HorizonWeaver: Generalizable Multi-Level Semantic Editing for Driving Scenes

2026-04-06Computer Vision and Pattern Recognition

Computer Vision and Pattern Recognition
AI summary

The authors created HorizonWeaver, a tool that helps edit photos of driving scenes based on instructions, making it easier to test self-driving car safety. They addressed challenges like editing both small objects and whole scenes, keeping important details while following instructions, and handling different weather and road layouts. To do this, they combined data from real and synthetic sources, used language-guided masks for precise edits, and trained the model to keep content consistent and follow instructions closely. Their method works better than previous ones in several tests and is preferred by users.

autonomous drivingimage editinglanguage-guided editingscene understandingsynthetic datasemantic masksinstruction alignmentBEV segmentationnuScenes datasetphotorealistic rendering
Authors
Mauricio Soroco, Francesco Pittaluga, Zaid Tasneem, Abhishek Aich, Bingbing Zhuang, Wuyang Chen, Manmohan Chandraker, Ziyu Jiang
Abstract
Ensuring safety in autonomous driving requires scalable generation of realistic, controllable driving scenes beyond what real-world testing provides. Yet existing instruction guided image editors, trained on object-centric or artistic data, struggle with dense, safety-critical driving layouts. We propose HorizonWeaver, which tackles three fundamental challenges in driving scene editing: (1) multi-level granularity, requiring coherent object- and scene-level edits in dense environments; (2) rich high-level semantics, preserving diverse objects while following detailed instructions; and (3) ubiquitous domain shifts, handling changes in climate, layout, and traffic across unseen environments. The core of HorizonWeaver is a set of complementary contributions across data, model, and training: (1) Data: Large-scale dataset generation, where we build a paired real/synthetic dataset from Boreas, nuScenes, and Argoverse2 to improve generalization; (2) Model: Language-Guided Masks for fine-grained editing, where semantics-enriched masks and prompts enable precise, language-guided edits; and (3) Training: Content preservation and instruction alignment, where joint losses enforce scene consistency and instruction fidelity. Together, HorizonWeaver provides a scalable framework for photorealistic, instruction-driven editing of complex driving scenes, collecting 255K images across 13 editing categories and outperforming prior methods in L1, CLIP, and DINO metrics, achieving +46.4% user preference and improving BEV segmentation IoU by +33%. Project page: https://msoroco.github.io/horizonweaver/