Not All Features Are Created Equal: A Mechanistic Study of Vision-Language-Action Models

2026-03-19Robotics

Robotics
AI summary

The authors studied how Vision-Language-Action (VLA) models turn what they see and read into actions by examining six different models. They found that the visual input mostly drives the robot’s movements, with language helping only when the task isn’t clear from visuals alone. Different parts of the models separately handle motor programs (how to move) and goal understanding (what to achieve). They also discovered specific activation patterns linked to actions and released a tool called Action Atlas to explore these findings. This work helps reveal the internal workings of complex VLA models.

Vision-Language-Action modelsactivation injectionsparse autoencoderslinear probesmotor programsmultimodal inputsbehavioral displacementcontrastive identificationcausal ablationactivation subspaces
Authors
Bryce Grant, Xijia Zhao, Peng Wang
Abstract
Vision-Language-Action (VLA) models combine perception, language, and motor control in a single architecture, yet how they translate multimodal inputs into actions remains poorly understood. We apply activation injection, sparse autoencoders (SAEs), and linear probes to six models spanning 80M--7B parameters across 394,000+ rollout episodes on four benchmarks. The visual pathway dominates action generation across all architectures: injecting baseline activations into null-prompt episodes recovers near-identical behavior, while cross-task injection steers robots toward source-task positions (99.8\% of X-VLA episodes align with the source trajectory), exposing spatially bound motor programs tied to scene coordinates rather than abstract task representations. Language sensitivity depends on task structure, not model design: when visual context uniquely specifies the task, language is ignored; when multiple goals share a scene, language becomes essential (X-VLA \texttt{libero\_goal}: 94\%$\to$10\% under wrong prompts vs.\ \texttt{libero\_object}: 60--100\% regardless). In all three multi-pathway architectures (\pizhalf{}, SmolVLA, GR00T), expert pathways encode motor programs while VLM pathways encode goal semantics ($2\times$ greater behavioral displacement from expert injection), and subspace injection confirms these occupy separable activation subspaces. Per-token SAE processing is essential for action fidelity on most architectures, though mean-pooling improves fidelity on X-VLA. Contrastive identification recovers 82+ manipulation concepts, and causal ablation reveals sensitivity spanning 28--92\% zero-effect rates independent of representation width. We release \textbf{Action Atlas} (https://action-atlas.com) for interactive exploration of VLA representations across all six models.