UNIEGO: Proxies as Mediators for Unified Egocentric Video Representation Learning

2026-06-18Computer Vision and Pattern Recognition

Computer Vision and Pattern RecognitionMachine Learning
AI summary

The authors address the challenge that videos taken from a wearable camera only show a limited view of human actions. They create a system called UNIEGO that learns from many different types of data and viewpoints by first using special proxy models to translate this varied information into a common format. Then, their method selectively learns only from the most reliable sources for each video clip. Their approach performs better than simpler methods on tasks like recognizing actions, finding videos, and segmenting actions in several hard datasets.

Egocentric videoMulti-teacher distillationRepresentation learningProxy modelsAction recognitionVideo retrievalAction segmentationModalityFoundation models
Authors
Wenhao Chi, Arkaprava Sinha, Dominick Reilly, Hieu Le, Srijan Das
Abstract
Egocentric video understanding is inherently limited by the narrow perspective of wearable cameras: a single viewpoint, a single modality, a single model cannot capture the full richness of human action. We argue that a truly expressive egocentric representation must subsume complementary knowledge across viewpoints, modalities, and foundation model representations, yet remain deployable from egocentric video alone. To this end, we introduce a hierarchical multi-teacher distillation framework that produces UNIEGO, a unified egocentric encoder trained with nine teachers spanning ego-exo viewpoints, RGB, depth, and skeleton modalities, and four foundation models. Rather than distilling directly from heterogeneous teachers whose incompatible architectures and feature geometries induce conflicting gradients, our framework interposes a layer of representation-specific Proxy models that translate diverse teacher knowledge into a homogeneous egocentric space. A second distillation stage, Selective Proxy Distillation (SPD), then adaptively selects, for each training sample, the subset of proxies that are both correct and confident, distilling exclusively from reliable supervision and suppressing erroneous signals. SPD is further stabilized by initializing UNIEGO as a learned convex combination of proxy parameters, placing the unified model in a well-conditioned region of the loss landscape before distillation begins. UNIEGO achieves state-of-the-art performance across three egocentric video understanding tasks - action recognition, video retrieval, and action segmentation on three challenging ego-exo benchmarks, outperforming naive multi-teacher distillation baselines and demonstrating that structured, proxy-mediated knowledge transfer yields richer and more discriminative egocentric representations.