The Unreasonable Effectiveness of Text Embedding Interpolation for Continuous Image Steering

2026-03-18Computer Vision and Pattern Recognition

Computer Vision and Pattern Recognition
AI summary

The authors propose a new way to edit images using text-based AI models without needing extra training or manual tweaks. They use a large language model to create special text prompts that guide the AI to make specific changes, like adjusting a facial expression or improving photo realism. Their method finds a smooth range to control these edits continuously, adding changes step-by-step to the text input. This approach works not just for images but also videos, and it performs as well as more complex methods that require additional training. The authors also introduce a new way to measure how smoothly these changes happen.

text-conditioned generative modelstext-embedding spacecontrastive promptssteering vectorsemantic controlelastic range searchcontinuous image editinglarge language modeltext encoderevaluation metric
Authors
Yigit Ekin, Yossi Gandelsman
Abstract
We present a training-free framework for continuous and controllable image editing at test time for text-conditioned generative models. In contrast to prior approaches that rely on additional training or manual user intervention, we find that a simple steering in the text-embedding space is sufficient to produce smooth edit control. Given a target concept (e.g., enhancing photorealism or changing facial expression), we use a large language model to automatically construct a small set of debiased contrastive prompt pairs, from which we compute a steering vector in the generator's text-encoder space. We then add this vector directly to the input prompt representation to control generation along the desired semantic axis. To obtain a continuous control, we propose an elastic range search procedure that automatically identifies an effective interval of steering magnitudes, avoiding both under-steering (no-edit) and over-steering (changing other attributes). Adding the scaled versions of the same vector within this interval yields smooth and continuous edits. Since our method modifies only textual representations, it naturally generalizes across text-conditioned modalities, including image and video generation. To quantify the steering continuity, we introduce a new evaluation metric that measures the uniformity of semantic change across edit strengths. We compare the continuous editing behavior across methods and find that, despite its simplicity and lightweight design, our approach is comparable to training-based alternatives, outperforming other training-free methods.