VETime: Vision Enhanced Zero-Shot Time Series Anomaly Detection

2026-02-18Computer Vision and Pattern Recognition

Computer Vision and Pattern Recognition
AI summary

The authors address the challenge of finding unusual patterns in time-series data, which involves detecting both immediate odd points and longer-term unusual contexts. They point out that current methods either focus too much on detailed timing without seeing the big picture or look at global patterns but miss fine details. To fix this, they created VETime, a system that combines time-based and image-based views by carefully aligning and merging these perspectives. Their experiments show VETime detects anomalies more accurately and efficiently than previous methods without needing extra training.

Time-series anomaly detectionPoint anomaliesContext anomaliesTemporal alignmentVision-based modelsContrastive learningMulti-modal fusionZero-shot learningReversible image conversion
Authors
Yingyuan Yang, Tian Lan, Yifei Gao, Yimeng Lu, Wenjun He, Meng Wang, Chenghao Liu, Chen Zhang
Abstract
Time-series anomaly detection (TSAD) requires identifying both immediate Point Anomalies and long-range Context Anomalies. However, existing foundation models face a fundamental trade-off: 1D temporal models provide fine-grained pointwise localization but lack a global contextual perspective, while 2D vision-based models capture global patterns but suffer from information bottlenecks due to a lack of temporal alignment and coarse-grained pointwise detection. To resolve this dilemma, we propose VETime, the first TSAD framework that unifies temporal and visual modalities through fine-grained visual-temporal alignment and dynamic fusion. VETime introduces a Reversible Image Conversion and a Patch-Level Temporal Alignment module to establish a shared visual-temporal timeline, preserving discriminative details while maintaining temporal sensitivity. Furthermore, we design an Anomaly Window Contrastive Learning mechanism and a Task-Adaptive Multi-Modal Fusion to adaptively integrate the complementary perceptual strengths of both modalities. Extensive experiments demonstrate that VETime significantly outperforms state-of-the-art models in zero-shot scenarios, achieving superior localization precision with lower computational overhead than current vision-based approaches. Code available at: https://github.com/yyyangcoder/VETime.