TAC: Timestamped Audio Captioning

2026-02-17Sound

Sound
AI summary

The authors created a new model called Timestamped Audio Captioner (TAC) that can better describe sounds happening at the same time by providing timed captions with different detail levels. They trained TAC using fake but realistic mixes of real sounds to help it learn how to handle overlapping audio. TAC does better than other methods at detecting events and giving detailed captions without making up false information. They also made a version called TAC-V that combines sound and video for richer descriptions. Finally, they showed that linking TAC and TAC-V to a text-based AI improves understanding and reasoning about both audio and audio-visual content.

Audio Language ModelsTimestamped Audio CaptioningPolyphonic AudioTemporal GroundingDense CaptioningSynthetic DataAudio-Visual IntegrationLarge Language ModelsEvent DetectionHallucination in AI
Authors
Sonal Kumar, Prem Seetharaman, Ke Chen, Oriol Nieto, Jiaqi Su, Zhepei Wang, Rithesh Kumar, Dinesh Manocha, Nicholas J. Bryan, Zeyu Jin, Justin Salamon
Abstract
Large Audio Language Models struggle to disentangle overlapping events in complex acoustic scenes, yielding temporally inconsistent captions and frequent hallucinations. We introduce Timestamped Audio Captioner (TAC), a model that produces temporally grounded audio descriptions at varying degrees of detail and resolution. TAC is trained with a synthetic data pipeline that constructs challenging and dynamic mixtures from real-world audio sources, enabling robust learning under realistic polyphonic conditions. Across event detection and dense captioning, TAC outperforms all competing methods, with a low hallucination rate and accurate temporal grounding. We also introduce TAC-V, an audio-visual pipeline to generate semantically rich audio-visual descriptions. We then show that TAC and TAC-V serves as a "semantic bridge" for a text-only reasoner: a simple TAC$\rightarrow$LLM and TAC-V$\rightarrow$LLM cascade achieves state-of-the-art scores on benchmarks for both audio (MMAU-Pro, MMSU, MMAR) and audio-visual (DailyOmni, VideoHolmes) understanding and reasoning respectively.