SARAH: Spatially Aware Real-time Agentic Humans

2026-02-20Computer Vision and Pattern Recognition

Computer Vision and Pattern Recognition
AI summary

The authors developed a way for virtual characters to move and look around naturally during conversations in virtual reality. Their method lets the character turn toward the user, respond to the user's movements, and adjust eye contact smoothly while speaking. This system works in real-time on VR headsets and is faster than previous methods. They tested it on a dataset and in a live VR setup, showing it creates realistic and spatially aware agent behavior.

embodied agentsvirtual realitycausal transformervariational autoencoderflow matchinggaze controlreal-time inferencestreaming VRdyadic audiospatial awareness
Authors
Evonne Ng, Siwei Zhang, Zhang Chen, Michael Zollhoefer, Alexander Richard
Abstract
As embodied agents become central to VR, telepresence, and digital human applications, their motion must go beyond speech-aligned gestures: agents should turn toward users, respond to their movement, and maintain natural gaze. Current methods lack this spatial awareness. We close this gap with the first real-time, fully causal method for spatially-aware conversational motion, deployable on a streaming VR headset. Given a user's position and dyadic audio, our approach produces full-body motion that aligns gestures with speech while orienting the agent according to the user. Our architecture combines a causal transformer-based VAE with interleaved latent tokens for streaming inference and a flow matching model conditioned on user trajectory and audio. To support varying gaze preferences, we introduce a gaze scoring mechanism with classifier-free guidance to decouple learning from control: the model captures natural spatial alignment from data, while users can adjust eye contact intensity at inference time. On the Embody 3D dataset, our method achieves state-of-the-art motion quality at over 300 FPS -- 3x faster than non-causal baselines -- while capturing the subtle spatial dynamics of natural conversation. We validate our approach on a live VR system, bringing spatially-aware conversational agents to real-time deployment. Please see https://evonneng.github.io/sarah/ for details.