Multimodal Latent Reasoning via Predictive Embeddings

2026-04-09Machine Learning

Machine Learning
AI summary

The authors propose Pearl, a new method that helps visual language models improve how they think using tools, but without actually calling those tools during use. Instead, Pearl learns from how experts use these tools in a hidden, compressed space (latent space), making it faster and simpler. This approach avoids problems found in other methods that try to recreate tool effects step-by-step and can handle multiple tool uses smoothly. Tests show Pearl works as well or better than existing techniques, and the authors argue that focusing on predicting embeddings is a clearer way to learn than trying to recreate tool effects directly.

Visual Language ModelsMultimodal ReasoningLatent SpaceTool AugmentationPredictive EmbeddingsReconstruction-based MethodsInference OverheadSelf-supervised LearningMulti-step ReasoningVision-Language Generation
Authors
Ashutosh Adhikari, Mirella Lapata
Abstract
Tool-augmented multimodal reasoning enables visual language models (VLMs) to improve perception by interacting with external tools (e.g., cropping, depth estimation). However, such approaches incur substantial inference overhead, require specialized supervision, and are prone to erroneous tool calls. We propose Pearl (Predictive Embedding Alignment for Reasoning in Latent space), a JEPA-inspired framework that learns from expert tool-use trajectories entirely in the latent space, eliminating the need for explicit tool invocation at inference time. Unlike reconstruction-based latent reasoning methods, which autoregressively generate latent tokens and suffer from training-inference mismatch and limited support for multi-step tool use, Pearl directly learns predictive embeddings from multimodal trajectories while preserving the standard vision-language generation pipeline: it is model-agnostic, simple to train, and naturally supports trajectories with multiple tool calls. Experiments across multiple perception benchmarks show that Pearl matches or outperforms standard supervised fine-tuning and reconstruction-based latent reasoning approaches. Furthermore, we provide empirical evidence that reconstruction-based methods primarily learn embeddings rather than image edits in latent space, motivating predictive embedding learning as a more principled alternative.