From Frames to Events: Rethinking Evaluation in Human-Centric Video Anomaly Detection
2026-04-10 • Computer Vision and Pattern Recognition
Computer Vision and Pattern Recognition
AI summaryⓘ
The authors explain that current video anomaly detection methods focus on identifying unusual frames one by one, which doesn't match how real anomalies happen as connected events over time. They studied popular datasets and found many models perform well when looking at single frames but poorly in detecting whole anomalous events accurately. To fix this, they propose new ways to detect and evaluate anomalies as complete events, not just frames, by using better techniques and new event-based scoring methods. Their work shows a big difference between frame-level and event-level performance, highlighting the need for event-focused approaches.
Video Anomaly DetectionPose-based DetectionFrame-level EvaluationEvent LocalizationTemporal Action LocalizationtIoU (temporal Intersection over Union)AUC-ROCF1 ScoreGaussian SmoothingAdaptive Binarization
Authors
Narges Rashvand, Shanle Yao, Armin Danesh Pazho, Babak Rahimi Ardabili, Hamed Tabkhi
Abstract
Pose-based Video Anomaly Detection (VAD) has gained significant attention for its privacy-preserving nature and robustness to environmental variations. However, traditional frame-level evaluations treat video as a collection of isolated frames, fundamentally misaligned with how anomalies manifest and are acted upon in the real world. In operational surveillance systems, what matters is not the flagging of individual frames, but the reliable detection, localization, and reporting of a coherent anomalous event, a contiguous temporal episode with an identifiable onset and duration. Frame-level metrics are blind to this distinction, and as a result, they systematically overestimate model performance for any deployment that requires actionable, event-level alerts. In this work, we propose a shift toward an event-centric perspective in VAD. We first audit widely used VAD benchmarks, including SHT[19], CHAD[6], NWPUC[4], and HuVAD[25], to characterize their event structure. We then introduce two strategies for temporal event localization: a score-refinement pipeline with hierarchical Gaussian smoothing and adaptive binarization, and an end-to-end Dual-Branch Model that directly generates event-level detections. Finally, we establish the first event-based evaluation standard for VAD by adapting Temporal Action Localization metrics, including tIoU-based event matching and multi-threshold F1 evaluation. Our results quantify a substantial performance gap: while all SoTA models achieve frame-level AUC-ROC exceeding 52% on the NWPUC[4], their event-level localization precision falls below 10% even at a minimal tIoU=0.2, with an average event-level F1 of only 0.11 across all thresholds. The code base for this work is available at https://github.com/TeCSAR-UNCC/EventCentric-VAD.