Modular Energy Steering for Safe Text-to-Image Generation with Foundation Models

2026-04-02 • Computer Vision and Pattern Recognition

Computer Vision and Pattern Recognition

AI summaryⓘ

The authors introduce a way to safely control text-to-image generation without changing the generator itself. They use existing vision-language models to guide the image creation by giving feedback during the process, acting like a smart guide. This method works for different types of generators and helps avoid unsafe or unwanted images while keeping high quality. Their approach is flexible and doesn’t need extra training, making it easier to apply widely.

text-to-image generationinference-time steeringfoundation modelsenergy-based samplingdiffusion modelsflow-matching modelssemantic representationssafety controlNSFW detectionlatent space

Authors

Yaoteng Tan, Zikui Cai, M. Salman Asif

Abstract

Controlling the behavior of text-to-image generative models is critical for safe and practical deployment. Existing safety approaches typically rely on model fine-tuning or curated datasets, which can degrade generation quality or limit scalability. We propose an inference-time steering framework that leverages gradient feedback from frozen pretrained foundation models to guide the generation process without modifying the underlying generator. Our key observation is that vision-language foundation models encode rich semantic representations that can be repurposed as off-the-shelf supervisory signals during generation. By injecting such feedback through clean latent estimates at each sampling step, our method formulates safety steering as an energy-based sampling problem. This design enables modular, training-free safety control that is compatible with both diffusion and flow-matching models and can generalize across diverse visual concepts. Experiments demonstrate state-of-the-art robustness against NSFW red-teaming benchmarks and effective multi-target steering, while preserving high generation quality on benign non-targeted prompts. Our framework provides a principled approach for utilizing foundation models as semantic energy estimators, enabling reliable and scalable safety control for text-to-image generation.

View PDFOpen arXiv