Appear2Meaning: A Cross-Cultural Benchmark for Structured Cultural Metadata Inference from Images

2026-04-08 • Computer Vision and Pattern Recognition

Computer Vision and Pattern RecognitionComputation and LanguageMultimedia

AI summaryⓘ

The authors studied how well computer models that understand images and language can identify detailed cultural information, like who made an artwork or where it is from, just by looking at pictures. They created a new test to check this across different cultures and used another AI to judge how well the models' answers matched real data. The results showed that the models only picked up bits and pieces correctly and struggled especially with different cultures and types of information. This means current models are not very reliable for pulling out detailed cultural facts from images alone.

vision-language modelsimage captioningcultural metadatacross-cultural benchmarksemantic alignmentLLM-as-Judgeexact-match accuracypartial-match accuracyattribute-level accuracystructured inference

Authors

Yuechen Jiang, Enze Zhang, Md Mohsinul Kabir, Qianqian Xie, Stavroula Golfomitsou, Konstantinos Arvanitis, Sophia Ananiadou

Abstract

Recent advances in vision-language models (VLMs) have improved image captioning for cultural heritage. However, inferring structured cultural metadata (e.g., creator, origin, period) from visual input remains underexplored. We introduce a multi-category, cross-cultural benchmark for this task and evaluate VLMs using an LLM-as-Judge framework that measures semantic alignment with reference annotations. To assess cultural reasoning, we report exact-match, partial-match, and attribute-level accuracy across cultural regions. Results show that models capture fragmented signals and exhibit substantial performance variation across cultures and metadata types, leading to inconsistent and weakly grounded predictions. These findings highlight the limitations of current VLMs in structured cultural metadata inference beyond visual perception.

View PDFOpen arXiv