Your UnEmbedding Matrix is Secretly a Feature Lens for Text Embeddings

2026-06-05 • Computation and Language

Computation and LanguageInformation Retrieval

AI summaryⓘ

The authors found that large language models (LLMs) create text embeddings that often focus too much on common, unimportant words, which hurts their ability to understand deeper meanings. To fix this, they designed EmbedFilter, a method that removes the influence of these frequent tokens from the embeddings. This not only improves how well the embeddings capture meaning but also reduces their size, making searches faster. Their tests show that EmbedFilter helps LLMs perform better on various tasks even with smaller embeddings.

large language modelstext embeddingszero-shot learningunembedding matrixsemantic representationdimensionality reductionembedding spacehigh-frequency tokenslinear transformationembedding retrieval

Authors

Songhao Wu, Zhongxin Chen, Yuxuan Liu, Heng Cui, Cong Li, Rui Yan

Abstract

Large language models exhibit impressive zero-shot capabilities across a wide range of downstream tasks. However, they struggle to function as off-the-shelf embedding models, leading to suboptimal performance on massive text embedding benchmarks. In this paper, we identify a potential cause underlying this deficiency. Our motivation stems from an unexpected observation: text embeddings tend to align with frequent but uninformative tokens when projected onto the vocabulary space. We argue that this excessive expression of high-frequency tokens suppresses the model's ability to capture nuanced semantics. To address this, we introduce EmbedFilter, a simple linear transformation designed to refine text embeddings derived from LLMs directly. Specifically, we uncover that the unembedding matrix within LLMs encodes a latent space that is actively writing these frequent tokens into embedding space. By filtering out this subspace, EmbedFilter suppress the influence of high-frequency tokens, thereby enhancing semantic representations. As a compelling byproduct, this enables an inherent dimensionality reduction, lowering index storage and speedup retrieval while fully preserving the refined embedding quality. Our experiments across multiple LLM backbones demonstrate that LLMs equipped with EmbedFilter achieve superior zero-shot downstream performance even with significantly reduced embedding dimensions. We hope our findings provide deeper insights into the mechanisms of LLM-based representations and inspire more principled designs to improve text embeddings training. Our code is available at https://github.com/CentreChen/EmbFilter.

View PDFOpen arXiv