AI summaryⓘ
The authors address the challenge of creating realistic, fully animated digital human avatars that can run efficiently on devices like VR headsets. They propose a new method called Wavelet-guided Multi-level Spatial Factorized Blendshapes, which compresses complex avatar models while keeping details and movement realistic. Their approach uses wavelet decomposition and factorization to reduce the computing needed by up to 2000 times and makes the model 10 times smaller without losing much visual quality. Tests show their method performs better than other mobile-friendly avatar techniques and is fast enough for real-time use on VR hardware. This work helps bring high-quality digital humans into practical use on less powerful devices.
digital humansanimatable avatarswavelet decompositionblendshapesmodel compressiondistillationtexture space factorizationreal-time renderingVR headsetsappearance modeling
Abstract
Building photorealistic, animatable full-body digital humans remains a longstanding challenge in computer graphics and vision. Recent advances in animatable avatar modeling have largely progressed along two directions: improving the fidelity of dynamic geometry and appearance, or reducing computational complexity to enable deployment on resource-constrained platforms, e.g., VR headsets. However, existing approaches fail to achieve both goals simultaneously: Ultra-high-fidelity avatars typically require substantial computation on server-class GPUs, whereas lightweight avatars often suffer from limited surface dynamics, reduced appearance details, and noticeable artifacts. To bridge this gap, we propose a novel animatable avatar representation, termed Wavelet-guided Multi-level Spatial Factorized Blendshapes, and a corresponding distillation pipeline that transfers motion-aware clothing dynamics and fine-grained appearance details from a pre-trained ultra-high-quality avatar model into a compact, efficient representation. By coupling multi-level wavelet spectral decomposition with low-rank structural factorization in texture space, our method achieves up to 2000X lower computational cost and a 10X smaller model size than the original high-quality teacher avatar model, while preserving visually plausible dynamics and appearance details closely resemble those of the teacher model. Extensive comparisons with state-of-the-art methods show that our approach significantly outperforms existing avatar approaches designed for mobile settings and achieves comparable or superior rendering quality to most approaches that can only run on servers. Importantly, our representation substantially improves the practicality of high-fidelity avatars for immersive applications, achieving over 180 FPS on a desktop PC and real-time native on-device performance at 24 FPS on a standalone Meta Quest 3.