Flatness Preserves Instruction Following in Vision-Language-Action Models

2026-06-22Robotics

Robotics
AI summary

The authors study vision-language-action (VLA) models, which connect visual input, language instructions, and actions for robots. They find that when these models are fine-tuned on small robot datasets, they tend to ignore the instructions and rely too much on visual cues, a problem they call instruction blindness. The authors propose a method to make the model's learning more stable by focusing on flatness in the loss landscape during fine-tuning, which helps the model better follow instructions. Their approach improves performance by over 60% without needing extra data or changes to the model design.

vision-language-action (VLA) modelsfine-tuninginstruction blindnessloss landscapesharpness-aware minimizationflatness-preserving optimizationrobot learningpretrained representations
Authors
Haochen Zhang, Yonatan Bisk
Abstract
Vision-language-action (VLA) models have the potential for open-world generalization by leveraging pretrained vision-language representations, yet downstream finetuning on limited robot data often degrades these representations, leading to brittle policies that ignore language instructions in favor of visual shortcuts, a failure mode we term instruction blindness. We hypothesize that standard finetuning with limited data applies gradients to a sparse set of points, which manifests as a sharp loss landscape with high-curvature minima. We propose to address this directly through flatness-preserving optimization while finetuning on the exact same data, where learning a flatter landscape results in a model more robust to perturbations in the weight space. Specifically, we demonstrate that simply applying sharpness-aware minimization during VLA finetuning significantly improves instruction following by over 60% across multiple simulation and real-world benchmarks without additional data, architectural modification, or retraining. We further analyze the effect of selective sharpness, quantify its effects, and show that our approach is complementary to existing guidance techniques. Project page can be found at https://haochenz11.github.io/papers/flatness-vla/.