How Do Instructions Shape Speech? Cross-Attention Attribution for Style-Captioned Text-to-Speech

2026-06-18 • Artificial Intelligence

Artificial Intelligence

AI summaryⓘ

The authors studied how different words in style descriptions affect the sound of generated speech in text-to-speech systems. They developed a method to trace how each word influences parts of the speech during generation, using a technique called cross-attention attribution on a speech diffusion model. Their analysis revealed that style words influence overall voice traits consistently, affect pitch and loudness, and have the strongest impact early in the speech generation process at specific network layers. This work helps explain how natural language controls voice characteristics in advanced TTS models.

text-to-speech (TTS)style captioningcross-attentionspeech diffusion modelsDAAM frameworkF0 (fundamental frequency)energyODE stepsnetwork layersattention entropy

Authors

Nityanand Mathur, Hamees Sayed, Wasim Madha, Apoorv Singh, Sameer Khurana, Akshat Mandloi, Sudarshan Kamath

Abstract

Style-captioned text-to-speech systems use natural language to control voice characteristics, but how individual words influence acoustic output remains unclear. Understanding this is critical for diagnosing failure modes and improving controllability in expressive TTS. We propose cross-attention attribution for speech diffusion models, adapting the DAAM framework to the speech domain for the first time, and apply it to CapSpeech-TTS. Our method extracts per-token heatmaps across 25 layers and 24 ODE steps. We analyze 3,600 (style caption, text transcript) combinations comprising 120 style captions conditioning the generation of 30 text transcripts each, revealing how caption tokens shape waveforms. Results show: (1) style tokens have lower temporal variance than content/function tokens, confirming global conditioning; (2) style attention correlates with F0 and energy; (3) style conditioning peaks in early steps and deep layers; (4) attention entropy reaches its minimum at layer 17, co-occurring with the style importance peak, indicating maximal network selectivity at the most style-critical stage. This is the first study of how natural language influences cross-attention in speech diffusion models

View PDFOpen arXiv