AEGIS: A Holistic Benchmark for Evaluating Forensic Analysis of AI-Generated Academic Images

2026-04-30 • Computer Vision and Pattern Recognition

Computer Vision and Pattern RecognitionComputers and Society

AI summaryⓘ

The authors created AEGIS, a new benchmark to test how well tools can detect fake academic images made by AI. It covers many types of academic images and realistic fake methods, showing that detection tools still struggle and often get less than half correct. They test different models and find strengths vary, with some better at spotting text fakes and others better at overall fake detection. AEGIS highlights current challenges and gaps in detecting AI-generated academic images.

AI-generated imagesacademic image forensicsbenchmarkforgery detectionlocalization accuracymultimodal large language modelsGPT-5.1generative modelsforensic evaluation

Authors

Bo Zhang, Tzu-Yen Ma, Zichen Tang, Junpeng Ding, Zirui Wang, Yizhuo Zhao, Peilin Gao, Zijie Xi, Zixin Ding, Haiyang Sun, Haocheng Gao, Yuan Liu, Liangjia Wang, Yiling Huang, Yujie Wang, Yuyue Zhang, Ronghui Xi, Yuanze Li, Jiacheng Liu, Zhongjun Yang, Haihong E

Abstract

We introduce AEGIS, A holistic benchmark for Evaluating forensic analysis of AI-Generated academic ImageS. Compared to existing benchmarks, AEGIS features three key advances: (1) Domain-Specific Complexity: covering seven academic categories with 39 fine-grained subtypes, exposing intrinsic forensic difficulty, where even GPT-5.1 reaches 48.80% overall performance and expert models achieve only limited localization accuracy (IoU 30.09%); (2) Diverse Forgery Simulations: modeling four prevalent academic forgery strategies across 25 generative models, with 11 yielding average forensic accuracy below 50%, showing that forensics lag behind generative advances; and (3) Multi-Dimensional Forensic Evaluation: jointly assessing detection, reasoning, and localization, revealing complementary strengths between model families, with multimodal large language models (MLLMs) at 84.74% accuracy in textual artifact recognition and expert detectors peaking at 79.54% accuracy in binary authenticity detection. By evaluating 25 leading MLLMs, nine expert models, and one unified multimodal understanding and generation model, AEGIS serves as a diagnostic testbed exposing fundamental limitations in academic image forensics.

View PDFOpen arXiv