How Do Instructions Shape Speech? Cross-Attention Attribution for Style-Captioned Text-to-Speech

2026-06-18Artificial Intelligence

Artificial Intelligence
AI summary

The authors studied how different words in style descriptions affect the sound of generated speech in text-to-speech systems. They developed a method to trace how each word influences parts of the speech during generation, using a technique called cross-attention attribution on a speech diffusion model. Their analysis revealed that style words influence overall voice traits consistently, affect pitch and loudness, and have the strongest impact early in the speech generation process at specific network layers. This work helps explain how natural language controls voice characteristics in advanced TTS models.

text-to-speech (TTS)style captioningcross-attentionspeech diffusion modelsDAAM frameworkF0 (fundamental frequency)energyODE stepsnetwork layersattention entropy
Authors
Nityanand Mathur, Hamees Sayed, Wasim Madha, Apoorv Singh, Sameer Khurana, Akshat Mandloi, Sudarshan Kamath
Abstract
Style-captioned text-to-speech systems use natural language to control voice characteristics, but how individual words influence acoustic output remains unclear. Understanding this is critical for diagnosing failure modes and improving controllability in expressive TTS. We propose cross-attention attribution for speech diffusion models, adapting the DAAM framework to the speech domain for the first time, and apply it to CapSpeech-TTS. Our method extracts per-token heatmaps across 25 layers and 24 ODE steps. We analyze 3,600 (style caption, text transcript) combinations comprising 120 style captions conditioning the generation of 30 text transcripts each, revealing how caption tokens shape waveforms. Results show: (1) style tokens have lower temporal variance than content/function tokens, confirming global conditioning; (2) style attention correlates with F0 and energy; (3) style conditioning peaks in early steps and deep layers; (4) attention entropy reaches its minimum at layer 17, co-occurring with the style importance peak, indicating maximal network selectivity at the most style-critical stage. This is the first study of how natural language influences cross-attention in speech diffusion models