TriViewBench: Controlled Complexity Scaling for Multi-View Structural Reasoning in MLLMs

2026-06-24Computer Vision and Pattern Recognition

Computer Vision and Pattern RecognitionArtificial Intelligence
AI summary

The authors created a new test called TriViewBench to see how well multimodal large language models (MLLMs) can understand complex 3D scenes with different numbers of objects and levels of occlusion. They found that all tested models were best at simple local decisions, worse at counting objects, and struggled most with understanding the entire scene globally, especially as tasks got harder. Mistakes included missing objects when seen from one view and confusing objects when seen from multiple views. Their work suggests current models have trouble with understanding spatial information across views, and their new benchmark helps study these kinds of reasoning problems.

Multimodal Large Language ModelsVisual Question Answering3D ScenesOcclusionObject CountingGlobal RecoveryChain-of-Thought PromptingSpatial RepresentationSynthetic BenchmarkStructural Reasoning
Authors
Yu-Yang Chen, Lan-Zhe Guo
Abstract
Multimodal Large Language Models (MLLMs) demonstrate strong performance on standard visual question answering benchmarks, yet their scalability under controlled structural complexity remains poorly understood. We introduce TriViewBench, a controlled three-view visual reasoning benchmark constructed from synthetic 3D scenes with explicitly parameterized object count and occlusion. The benchmark contains 1,923 scenes and over 14K Question-Answer (QA) pairs organized into four complexity levels and three reasoning categories: Local Decision, Object Counting, and Global Recovery. We evaluate 18 open- and closed-source MLLMs under a unified prompting protocol. All 18 models exhibit an identical capability hierarchy without exception (Local Decision > Object Counting > Global Recovery), and performance degrades monotonically with complexity: Local Decision tasks decline modestly (12.11% relative drop), while Object Counting degrades substantially (59.14%) and Global Recovery collapses severely (80.02%). Error analysis on Object Counting reveals two mechanistically independent failure modes: single-view tasks are dominated by undercounting due to occlusion blindness, whereas the multi-view task reverses to overcounting due to cross-view identity confusion. Chain-of-Thought (CoT) prompting yields near-zero overall benefit ($Δ= -0.16\%$) and its effect on Global Recovery is strongly capability-gated, suggesting that the bottleneck lies in cross-view spatial representation rather than reasoning strategy. These findings reveal fundamental scalability limitations in current MLLMs and position TriViewBench as a controlled diagnostic framework for analyzing structural reasoning failures.