Rethinking Vector Field Learning for Generative Segmentation

2026-03-19Computer Vision and Pattern Recognition

Computer Vision and Pattern Recognition
AI summary

The authors look at how diffusion models are used to segment images, pointing out problems with the usual math that makes training slow and less precise. They focus on the idea of vector fields and show that the common training approach causes the model to have weak learning signals and difficulty separating different classes. To fix this, the authors add a special correction that makes the model pay more attention to important areas, improving learning speed and accuracy without changing the overall training setup. They also create a new encoding method for categories that works well with pixel-based neural networks, leading to better segmentation results compared to standard methods.

diffusion modelsgenerative segmentationflow matching objectivevector fieldgradient vanishingtrajectory traversingvelocity fieldcategory encodingpixel neural fieldsemantic alignment
Authors
Chaoyang Wang, Yaobo Liang, Boci Peng, Fan Duan, Jingdong Wang, Yunhai Tong
Abstract
Taming diffusion models for generative segmentation has attracted increasing attention. While existing approaches primarily focus on architectural tweaks or training heuristics, there remains a limited understanding of the intrinsic mismatch between continuous flow matching objectives and discrete perception tasks. In this work, we revisit diffusion segmentation from the perspective of vector field learning. We identify two key limitations of the commonly used flow matching objective: gradient vanishing and trajectory traversing, which result in slow convergence and poor class separation. To tackle these issues, we propose a principled vector field reshaping strategy that augments the learned velocity field with a detached distance-aware correction term. This correction introduces both attractive and repulsive interactions, enhancing gradient magnitudes near centroids while preserving the original diffusion training framework. Furthermore, we design a computationally efficient, quasi-random category encoding scheme inspired by Kronecker sequences, which integrates seamlessly with an end-to-end pixel neural field framework for pixel-level semantic alignment. Extensive experiments consistently demonstrate significant improvements over vanilla flow matching approaches, substantially narrowing the performance gap between generative segmentation and strong discriminative specialists.