Autoregressive Appearance Prediction for
3D Gaussian Avatars

Michael Steiner^1,2*, Zhang Chen², Alexander Richard², Vasu Agrawal², Markus Steinberger¹, Michael Zollhoefer²

¹ TU Graz logo ² Meta logo

* Work done during an internship at Meta

Teaser figure showing the overview of our method.

To avoid one-to-many ambiguity in capture training data, we decouple appearance from pose for stable driving. We learn per-frame appearance latents from extracted UV textures during training, then autoregressively predict them from short pose sequences at test time to produce smooth renderings with realistic appearance variation.

Abstract

A photorealistic and immersive human avatar experience demands capturing fine, person-specific details such as cloth and hair dynamics, subtle facial expressions, and characteristic motion patterns. Achieving this requires large, high-quality datasets, which often introduce ambiguities and spurious correlations when very similar poses correspond to different appearances. Models that fit these details during training can overfit and produce unstable, abrupt appearance changes for novel poses. We propose a 3D Gaussian Splatting avatar model with a spatial MLP backbone that is conditioned on both pose and an appearance latent. The latent is learned during training by an encoder, yielding a compact representation that improves reconstruction quality and helps disambiguate pose-driven renderings. At driving time, our predictor autoregressively infers the latent, producing temporally smooth appearance evolution and improved stability. Overall, our method delivers a robust and practical path to high-fidelity, stable avatar driving.

Test Sequence Comparison

We compare the results using our appearance predictor to: a re-implementation of [Zhan et al., 2025] with added face and hand/finger conditioning, denoted as MMLPs; a non-relightable version of the Relightable Full-Body Gaussian Codec Avatar (RFGCA) model from [Wang et al., 2025], denoted as nRFGCA.

Localized Pose Parameters

Qualitative comparison of global (MMLPs [Zhan et al., 2025]) versus localized (Ours) pose conditioning. We manually perturb a single pose parameter: global conditioning induces large, non-local deformations and spurious appearance changes, whereas localized conditioning confines the effect to local, physically plausible motion.

References

[Zhan et al., 2025] Zhan, Y., Shao, T., Yang, Y., Zhou, K.: Real-time High-fidelity Gaussian Human Avatars with Position-based Interpolation of Spatially Distributed MLPs. In: CVPR (2025)
[Wang et al., 2025] Wang, S., Simon, T., Santesteban, I., Bagautdinov, T., Li, J., Agrawal, V., Prada, F., Yu, S.I., Nalbone, P., Gramlich, M., et al.: Relightable Full-body Gaussian Codec Avatars. In: ACM SIGGRAPH Conference Papers (2025)
[Li et al., 2024] Li, Z., Zheng, Z., Wang, L., Liu, Y.: Animatable Gaussians: Learning Pose-dependent Gaussian Maps for High-fidelity Human Avatar Modeling. In: CVPR (2024)

BibTeX

@misc{steiner2026aap3dga, title={Autoregressive Appearance Prediction for 3D Gaussian Avatars}, author={Michael Steiner and Zhang Chen and Alexander Richard and Vasu Agrawal and Markus Steinberger and Michael Zollhöfer}, year={2026}, eprint={2604.00928}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2604.00928}, }