Modulate-and-Map: Crossmodal Feature Mapping with Cross-View Modulation for 3D Anomaly Detection

2026-04-02Computer Vision and Pattern Recognition

Computer Vision and Pattern Recognition
AI summary

The authors introduce ModMap, a new method for finding and highlighting unusual spots in 3D objects using multiple views and types of data together. Unlike older methods that look at each view separately, their approach learns how features relate across different views and data types. They also created a way to train the model using all view combinations for better detection. They tested ModMap on a new 3D anomaly benchmark and showed it works better than previous methods.

3D anomaly detectionmultiview learningmultimodal datafeature mappingfeature modulationcross-view trainingdepth encoderSiM3D benchmarkanomaly segmentation
Authors
Alex Costanzino, Pierluigi Zama Ramirez, Giuseppe Lisanti, Luigi Di Stefano
Abstract
We present ModMap, a natively multiview and multimodal framework for 3D anomaly detection and segmentation. Unlike existing methods that process views independently, our method draws inspiration from the crossmodal feature mapping paradigm to learn to map features across both modalities and views, while explicitly modelling view-dependent relationships through feature-wise modulation. We introduce a cross-view training strategy that leverages all possible view combinations, enabling effective anomaly scoring through multiview ensembling and aggregation. To process high-resolution 3D data, we train and publicly release a foundational depth encoder tailored to industrial datasets. Experiments on SiM3D, a recent benchmark that introduces the first multiview and multimodal setup for 3D anomaly detection and segmentation, demonstrate that ModMap attains state-of-the-art performance by surpassing previous methods by wide margins.