T-REN: Learning Text-Aligned Region Tokens Improves Dense Vision-Language Alignment and Scalability

2026-04-20Computer Vision and Pattern Recognition

Computer Vision and Pattern Recognition
AI summary

The authors developed T-REN, a new method that improves how computers understand images and text together. It creates smaller, smarter chunks of image information that match words better, making it easier to handle detailed tasks like identifying objects in pictures and videos. T-REN adds only a little extra complexity but works much faster and more accurately than previous methods. The authors tested it on several datasets and found significant improvements in accuracy and efficiency.

vision-language encoderopen-vocabulary segmentationregion-level representationtoken reductioncross-modal alignmentsemantic segmentationvideo object localizationtext-image retrievalcompact encodingvision backbone
Authors
Savya Khosla, Sethuraman T, Aryan Chadha, Alex Schwing, Derek Hoiem
Abstract
Despite recent progress, vision-language encoders struggle with two core limitations: (1) weak alignment between language and dense vision features, which hurts tasks like open-vocabulary semantic segmentation; and (2) high token counts for fine-grained visual representations, which limits scalability to long videos. This work addresses both limitations. We propose T-REN (Text-aligned Region Encoder Network), an efficient encoder that maps visual data to a compact set of text-aligned region-level representations (or region tokens). T-REN achieves this through a lightweight network added on top of a frozen vision backbone, trained to pool patch-level representations within each semantic region into region tokens and align them with region-level text annotations. With only 3.7% additional parameters compared to the vision-language backbone, this design yields substantially stronger dense cross-modal understanding while reducing the token count by orders of magnitude. Specifically, T-REN delivers +5.9 mIoU on ADE20K open-vocabulary segmentation, +18.4% recall on COCO object-level text-image retrieval, +15.6% recall on Ego4D video object localization, and +17.6% mIoU on VSPW video scene parsing, all while reducing token counts by more than 24x for images and 187x for videos compared to the patch-based vision-language backbone. The code and model are available at https://github.com/savya08/T-REN.