Spatio-Temporal Grounding of Large Language Models from Perception Streams
2026-04-08 • Robotics
Robotics
AI summaryⓘ
The authors developed FESTS, a system that helps smaller AI language models better understand how objects move and interact over time in 3D space. They created a special way to turn natural language questions into precise spatial and temporal patterns called Spatial Regular Expressions (SpRE). Using this, they could generate large amounts of training data from video logs without needing manual labels. Training a 3-billion-parameter model on this data greatly improved its accuracy in understanding complex spatial and time relations, performing as well as much larger models like GPT-4.1 but being much smaller and more efficient.
Embodied AILarge Language ModelsSpatio-temporal reasoningSpatial Regular ExpressionsSpatial logicUniversal and existential quantificationVideo logsFrame-level F1 scoreGPT-4Model training
Authors
Jacob Anderson, Bardh Hoxha, Georgios Fainekos, Hideki Okamoto, Danil Prokhorov
Abstract
Embodied-AI agents must reason about how objects move and interact in 3-D space over time, yet existing smaller frontier Large Language Models (LLMs) still mis-handle fine-grained spatial relations, metric distances, and temporal orderings. We introduce the general framework Formally Explainable Spatio-Temporal Scenes (FESTS) that injects verifiable spatio-temporal supervision into an LLM by compiling natural-language queries into Spatial Regular Expression (SpRE) -- a language combining regular expression syntax with S4u spatial logic and extended here with universal and existential quantification. The pipeline matches each SpRE against any structured video log and exports aligned (query, frames, match, explanation) tuples, enabling unlimited training data without manual labels. Training a 3-billion-parameter model on 27k such tuples boosts frame-level F1 from 48.5% to 87.5%, matching GPT-4.1 on complex spatio-temporal reasoning while remaining two orders of magnitude smaller, and, hence, enabling spatio-temporal intelligence for Video LLM.