1
2
* Work done during an internship at Meta
A photorealistic and immersive human avatar experience demands capturing fine, person-specific details such as cloth and hair dynamics, subtle facial expressions, and characteristic motion patterns. Achieving this requires large, high-quality datasets, which often introduce ambiguities and spurious correlations when very similar poses correspond to different appearances. Models that fit these details during training can overfit and produce unstable, abrupt appearance changes for novel poses. We propose a 3D Gaussian Splatting avatar model with a spatial MLP backbone that is conditioned on both pose and an appearance latent. The latent is learned during training by an encoder, yielding a compact representation that improves reconstruction quality and helps disambiguate pose-driven renderings. At driving time, our predictor autoregressively infers the latent, producing temporally smooth appearance evolution and improved stability. Overall, our method delivers a robust and practical path to high-fidelity, stable avatar driving.
We compare the results using our appearance predictor to: a re-implementation of [Zhan et al., 2025] with added face and hand/finger conditioning, denoted as MMLPs; a non-relightable version of the Relightable Full-Body Gaussian Codec Avatar (RFGCA) model from [Wang et al., 2025], denoted as nRFGCA.
Qualitative comparison of global (MMLPs [Zhan et al., 2025]) versus localized (Ours) pose conditioning. We manually perturb a single pose parameter: global conditioning induces large, non-local deformations and spurious appearance changes, whereas localized conditioning confines the effect to local, physically plausible motion.