Optimization Dynamics Imprint Semantic Specificity in Contrastive Embedding Norms

2026-06-29Artificial Intelligence

Artificial IntelligenceMachine Learning
AI summary

The authors studied why the length (or norm) of embeddings from contrastive models, which is usually ignored during training, still seems to carry meaningful information. They developed a theoretical explanation showing that the embedding length accidentally captures things like how specific a concept is or how often a token appears, due to the way the models are optimized. Their findings explain why this extra information exists and suggest it can be useful for improving some models without extra cost.

contrastive learningembedding normcosine similarityscale-invariant lossoptimization dynamicssemantic propertiesconcept specificitytoken frequencymodel calibrationretrieval tasks
Authors
Ziwei Su, Junyu Ren, Victor Veitch
Abstract
Contrastive embedding models trained with scale-invariant losses are typically paired with distance metrics like cosine similarity, effectively ignoring embedding magnitudes. However, surprisingly, empirical studies reveal that despite this, these "discarded" norms seem to correlate with semantic properties such as concept specificity, token frequency, and human uncertainty. In this work, we provide a formal theoretical framework explaining this phenomenon. By analyzing the optimization dynamics, we derive an analytic formula demonstrating that embedding length naturally encodes this information as a byproduct of the training process. We also show how this gives rise to signals that can serve as "free" calibration tools in specific models and retrieval tasks, providing a grounded explanation for a previously heuristic observation.