Towards Highly-Constrained Human Motion Generation with Retrieval-Guided Diffusion Noise Optimization

2026-05-08Computer Vision and Pattern Recognition

Computer Vision and Pattern Recognition
AI summary

The authors developed a method to create human-like movements that meet very specific and tough goals, like avoiding obstacles or taking an exact number of steps. Their approach uses a smart search through large collections of motion data to find examples that help guide the creation process. They also use a special language model to understand the task better and decide what examples to look for. This method improves how well the system can produce realistic motions without needing extra training. Overall, the authors' system works well for making movements that follow tricky rules.

human motion generationzero-shot learningdiffusion noise optimizationretrieval-guided generationrelational task parsinglarge language models (LLM)spatiotemporal constraintscontrollable character animationreward-guided masktraining-free optimization
Authors
Hanchao Liu, Fang-Lue Zhang, Shining Zhang, Tai-Jiang Mu, Shi-Min Hu
Abstract
Generating human motion that satisfies customized zero-shot goal functions, enabling applications such as controllable character animation and behavior synthesis for virtual agents, is a critical capability. While current approaches handle many unseen constraints, they fail on tasks with very challenging spatiotemporal restrictions, such as severe spatial obstacles or specified numbers of walking steps. To equip motion generators for these highly constrained tasks, we present a retrieval-guided method built on the training-free diffusion noise optimization framework. The key idea is to search within large motion datasets for guidance that can potentially satisfy difficult constraints. We introduce relational task parsing to group target constraints and identify the difficult ones to be handled by retrieved reference. A better initialization for diffusion noise is then obtained via a reward-guided mask that combines random noise with retrieved noise. By optimizing diffusion noise from this improved initialization, we successfully solve highly constrained generation tasks. By leveraging LLM for relational task parsing, the whole framework is further enabled to automatically reason for what to retrieve, improving the intelligence of moving agents under a training-free optimization scheme.