EgoGroups: A Benchmark For Detecting Social Groups of People in the Wild
2026-03-23 • Computer Vision and Pattern Recognition
Computer Vision and Pattern Recognition
AI summaryⓘ
The authors created EgoGroups, a new dataset using first-person videos from 65 countries to better study how groups of people interact in everyday life. Unlike older datasets, EgoGroups includes diverse locations, crowd sizes, and weather conditions, providing a more realistic look at social group formation. They tested advanced AI models on this data and found that some models can identify social groups without extra training, but things like crowd density and cultural differences still affect performance. This work helps improve AI understanding of social interactions in real-world settings.
Social group detectionFirst-person view datasetVisual language models (VLM)Large language models (LLM)Zero-shot learningCrowd densityCultural contextHuman annotationsSocial dynamicsScene metadata
Authors
Jeffri Murrugarra-Llerena, Pranav Chitale, Zicheng Liu, Kai Ao, Yujin Ham, Guha Balakrishnan, Paola Cascante-Bonilla
Abstract
Social group detection, or the identification of humans involved in reciprocal interpersonal interactions (e.g., family members, friends, and customers and merchants), is a crucial component of social intelligence needed for agents transacting in the world. The few existing benchmarks for social group detection are limited by low scene diversity and reliance on third-person camera sources (e.g., surveillance footage). Consequently, these benchmarks generally lack real-world evaluation on how groups form and evolve in diverse cultural contexts and unconstrained settings. To address this gap, we introduce EgoGroups, a first-person view dataset that captures social dynamics in cities around the world. EgoGroups spans 65 countries covering low, medium, and high-crowd settings under four weather/time-of-day conditions. We include dense human annotations for person and social groups, along with rich geographic and scene metadata. Using this dataset, we performed an extensive evaluation of state-of-the-art VLM/LLMs and supervised models on their group detection capabilities. We found several interesting findings, including VLMs and LLMs can outperform supervised baselines in a zero-shot setting, while crowd density and cultural regions clearly influence model performance.