FineLAP: Taming Heterogeneous Supervision for Fine-grained Language-Audio Pretraining

2026-04-01 • Sound

Sound

AI summaryⓘ

The authors present FineLAP, a new method to improve audio-language models by teaching them to understand sounds both as whole clips and in short, detailed segments. They address the problem of having lots of general clip-level text descriptions but few precise frame-level labels by combining these two types of data during training. They also created FineLAP-100k, a large synthetic dataset with detailed event annotations to help the model learn better. Their experiments show that FineLAP performs very well on tasks like sound recognition, detection, and matching sounds to text. They also found that learning from both broad and fine details together helps the model improve overall.

Contrastive LearningAudio-Language ModelsFrame-level SupervisionClip-level AlignmentSelf-supervised EncoderSound Event DetectionData AugmentationDual-stream LossText-to-Audio GroundingSynthetic Dataset

Authors

Xiquan Li, Xuenan Xu, Ziyang Ma, Wenxi Chen, Haolin He, Qiuqiang Kong, Xie Chen

Abstract

Contrastively pretrained audio-language models (e.g., CLAP) excel at clip-level understanding but struggle with frame-level tasks. Existing extensions fail to exploit the varying granularity of real-world audio-text data, where massive clip-level textual descriptions coexist with limited frame-level annotations. This paper proposes Fine-grained Language-Audio Pretraining (FineLAP), a novel training paradigm that advances both clip- and frame-level alignment in CLAP with heterogeneous data. FineLAP introduces a dual-stream sigmoid loss with a cluster-based sampling strategy to jointly learn from clip- and frame-level supervision. To capture both global semantics and local details, FineLAP uses a decoupled audio projector on top of a self-supervised encoder. To alleviate the scarcity of temporally annotated data, we present FineLAP-100k, a large-scale synthetic SED dataset constructed through a scalable curation pipeline. Extensive experiments demonstrate that FineLAP achieves SOTA performance across multiple audio understanding tasks, including retrieval, classification, sound event detection, and text-to-audio grounding. Ablation studies further show that coarse- and fine-grained alignment are mutually beneficial, providing insights for building better audio-language models (ALMs).

View PDFOpen arXiv