VISReg: Variance-Invariance-Sketching Regularization for JEPA training

2026-06-01Computer Vision and Pattern Recognition

Computer Vision and Pattern Recognition
AI summary

The authors propose a new method called VISReg to improve self-supervised learning, which helps computers learn useful data representations without labels. Unlike previous methods that only look at variance and covariance (basic statistics), VISReg uses a technique that captures the full shape of data distributions, making learning more stable. Their method also keeps track of scale separately, which leads to better performance and more reliable training even with low-quality or imbalanced data. When tested on standard image datasets, VISReg performs very well, matching or beating other leading methods despite using less data.

Self-supervised learningEmbedding collapseVariance regularizationCovarianceSliced-Wasserstein distanceDistribution alignmentSketching methodsOut-of-distribution generalizationImageNetRepresentation learning
Authors
Haiyu Wu, Randall Balestriero, Morgan Levine
Abstract
Self-supervised learning methods prevent embedding collapse via modeling heuristics or explicit regularization of the embedding space. Among the latter, VICReg decomposes regularization into variance and covariance objectives, offering flexibility and interpretability. However, covariance captures only second-order statistics -- encouraging decorrelation but failing to enforce the full distributional shape needed for stable training. Sketching-based methods such as SIGReg address this by aligning embeddings to an isotropic Gaussian, but lack flexibility and suffer from vanishing gradients under collapse. We propose Variance-Invariance-Sketching Regularization (VISReg), which replaces covariance with a Sliced-Wasserstein-based sketching objective that enforces full distributional shape, while retaining a variance term for scale control. By decoupling scale and shape, VISReg combines VICReg's flexibility with the distributional rigor of sketching methods, providing robust gradients even under collapse. We show that VISReg scales linearly, outperforms existing regularization on low-quality datasets, and is resilient to long-tailed and low-rank regimes. Pre-trained on ImageNet-1K, VISReg achieves state-of-the-art performance on out-of-distribution datasets. Pre-trained on ImageNet-22K, it matches DINOv2's OOD performance despite the latter using 10x more data (LVD-142M). Project and code: https://haiyuwu.github.io/visreg.