Improving Generalization on Cybersecurity Tasks with Multi-Modal Contrastive Learning
2026-03-20 • Cryptography and Security
Cryptography and SecurityArtificial Intelligence
AI summaryⓘ
The authors studied why machine learning models in cybersecurity often fail in real-world situations because they learn easy but unhelpful shortcuts instead of true security concepts. They tried a new learning method that connects information from rich sources like text descriptions to less detailed data like network payloads. Their two-step approach first creates a meaningful space from text descriptions, then teaches the payload data to fit into this space, helping the model learn better features. Tests on private and synthetic datasets show their method helps reduce shortcut learning. They also shared their code and synthetic dataset for others to use.
machine learningcybersecuritygeneralizationcontrastive learningmulti-modal learningpayloadtext embeddingsthreat classificationCVElarge language models
Authors
Jianan Huang, Rodolfo V. Valentim, Luca Vassio, Matteo Boffa, Marco Mellia, Idilio Drago, Dario Rossi
Abstract
The use of ML in cybersecurity has long been impaired by generalization issues: Models that work well in controlled scenarios fail to maintain performance in production. The root cause often lies in ML algorithms learning superficial patterns (shortcuts) rather than underlying cybersecurity concepts. We investigate contrastive multi-modal learning as a first step towards improving ML performance in cybersecurity tasks. We aim at transferring knowledge from data-rich modalities, such as text, to data-scarce modalities, such as payloads. We set up a case study on threat classification and propose a two-stage multi-modal contrastive learning framework that uses textual vulnerability descriptions to guide payload classification. First, we construct a semantically meaningful embedding space using contrastive learning on descriptions. Then, we align payloads to this space, transferring knowledge from text to payloads. We evaluate the approach on a large-scale private dataset and a synthetic benchmark built from public CVE descriptions and LLM-generated payloads. The methodology appears to reduce shortcut learning over baselines on both benchmarks. We release our synthetic benchmark and source code as open source.