GroupToM-Bench: Benchmarking Group Theory of Mind and Nonlinear Social Emergence in MLLMs

2026-06-02Computer Vision and Pattern Recognition

Computer Vision and Pattern Recognition
AI summary

The authors explain that true general intelligence needs to understand not just how people think individually, but also how their thoughts combine to create group behavior. They show that current large language models struggle with this because group behavior is complex and does not simply add up from individuals' intentions. To test this, the authors created GroupToM-Bench, a new tool that looks at how beliefs and desires lead to group tensions and overall outcomes. Their experiments reveal that existing models lag behind humans in understanding these social dynamics and structures.

Theory of Mindmultimodal modelscollective behaviorgroup-level cognitionbelief-desire-intention modelsocial structuresnon-linear dynamicsbenchmarkcognitive auditmechanistic attribution
Authors
Weidong Tang, Jierui Li, Yueling Hou, Zihan Mei, Can Zhang, Xinyan Wan, Zhiyuan Liang, Pengfei Zhou, Yang You, Wangbo Zhao
Abstract
True general intelligence requires not only a model of the physical world but also a social world model: the capacity to infer how individual mental states interact and crystallize into group-level outcomes. Despite notable progress in individual-level Theory of Mind (ToM) reasoning, existing multimodal large language models fail at this broader task. Collective behavior emerges non-linearly from social tensions, conformity dynamics, and structural constraints, meaning it cannot be recovered by merely summing individual intentions. We present GroupToM-Bench, the first multimodal benchmark for group-level ToM, built around a causal chain spanning micro-level BDI states (belief, desire, intention), meso-level group tension and structural constraints, and macro-level outcome prediction and mechanistic attribution. To probe this full arc, we develop a seven-level cognitive audit framework. Experiments reveal a gap between current models and human baselines, highlighting a failure to process social structures and non-linear collective dynamics.