Multi-Vector Index Compression in Any Modality

2026-02-24Information Retrieval

Information RetrievalComputation and LanguageComputer Vision and Pattern Recognition
AI summary

The authors look at how to make searching through complex data like text, images, and videos faster and use less memory. They focus on a method called late interaction, which is powerful but can get expensive when handling long or rich documents. To fix this, they try different ways to shrink the data that represents documents, including a new method called attention-guided clustering that smartly picks important parts of the document to keep. They test these methods on various tasks and find that their new approach usually works better or just as well as keeping all the data. The authors also share their code publicly.

multi-vector retrievallate interactionindex compressionattention-guided clusteringhierarchical poolingsequence resizingmemory tokensinformation retrievalsemantic clusteringdocument representation
Authors
Hanxiang Qin, Alexander Martin, Rohan Jha, Chunsheng Zuo, Reno Kriz, Benjamin Van Durme
Abstract
We study efficient multi-vector retrieval for late interaction in any modality. Late interaction has emerged as a dominant paradigm for information retrieval in text, images, visual documents, and videos, but its computation and storage costs grow linearly with document length, making it costly for image-, video-, and audio-rich corpora. To address this limitation, we explore query-agnostic methods for compressing multi-vector document representations under a constant vector budget. We introduce four approaches for index compression: sequence resizing, memory tokens, hierarchical pooling, and a novel attention-guided clustering (AGC). AGC uses an attention-guided mechanism to identify the most semantically salient regions of a document as cluster centroids and to weight token aggregation. Evaluating these methods on retrieval tasks spanning text (BEIR), visual-document (ViDoRe), and video (MSR-VTT, MultiVENT 2.0), we show that attention-guided clustering consistently outperforms other parameterized compression methods (sequence resizing and memory tokens), provides greater flexibility in index size than non-parametric hierarchical clustering, and achieves competitive or improved performance compared to a full, uncompressed index. The source code is available at: github.com/hanxiangqin/omni-col-press.