CG-DMER: Hybrid Contrastive-Generative Framework for Disentangled Multimodal ECG Representation Learning

2026-02-24 • Artificial Intelligence

Artificial Intelligence

AI summaryⓘ

The authors identify two problems in current methods that analyze ECG signals: they ignore the relationships between different ECG leads and they struggle to properly connect ECG data with written clinical reports because of differences in how the information is presented. To fix this, they propose a new method called CG-DMER that looks at both spatial and temporal patterns in the ECG for better detail and separates shared information from modality-specific noise when pairing ECGs with reports. Their approach improves how ECG data is interpreted and matches or outperforms existing methods on tests with public datasets.

Electrocardiogram (ECG)Multimodal learningSpatial-temporal modelingRepresentation disentanglementContrastive learningGenerative modelingModality alignmentLead-agnostic processingCardiovascular diagnosisClinical reports

Authors

Ziwei Niu, Hao Sun, Shujun Bian, Xihong Yang, Lanfen Lin, Yuxin Liu, Yueming Jin

Abstract

Accurate interpretation of electrocardiogram (ECG) signals is crucial for diagnosing cardiovascular diseases. Recent multimodal approaches that integrate ECGs with accompanying clinical reports show strong potential, but they still face two main concerns from a modality perspective: (1) intra-modality: existing models process ECGs in a lead-agnostic manner, overlooking spatial-temporal dependencies across leads, which restricts their effectiveness in modeling fine-grained diagnostic patterns; (2) inter-modality: existing methods directly align ECG signals with clinical reports, introducing modality-specific biases due to the free-text nature of the reports. In light of these two issues, we propose CG-DMER, a contrastive-generative framework for disentangled multimodal ECG representation learning, powered by two key designs: (1) Spatial-temporal masked modeling is designed to better capture fine-grained temporal dynamics and inter-lead spatial dependencies by applying masking across both spatial and temporal dimensions and reconstructing the missing information. (2) A representation disentanglement and alignment strategy is designed to mitigate unnecessary noise and modality-specific biases by introducing modality-specific and modality-shared encoders, ensuring a clearer separation between modality-invariant and modality-specific representations. Experiments on three public datasets demonstrate that CG-DMER achieves state-of-the-art performance across diverse downstream tasks.

View PDFOpen arXiv