Listening with Attention: Entropy-Guided Explainability for Transformer-Based Audio Models

2026-06-12 • Sound

SoundArtificial Intelligence

AI summaryⓘ

The authors developed a new method called LEAF-X to better explain how transformer-based speech recognition models like Whisper work. LEAF-X uses a smart way of focusing on important parts of the model's attention to show which sounds influenced specific words. This method is more accurate and clear than previous explanation techniques, making the model's decisions easier to understand and check. Their tests showed that LEAF-X provides more faithful and stable explanations.

automatic speech recognitiontransformersattention mechanismentropyexplainable AItoken-to-frame attributionencoder-decoder modelscausal ablationfaithfulnesslocality

Authors

Ravi Ranjan, Utkarsh Grover, Xiaomin Lin, Agoritsa Polyzou

Abstract

Transformer-based automatic speech recognition (ASR) models such as Whisper are highly accurate, but their predictions remain difficult to interpret. Existing explainable AI (XAI) methods often lack faithfulness and precise temporal grounding. We propose Listening with Entropy-guided Attention for Faithful explainability (LEAF-X), a model-intrinsic XAI framework for transformer-based ASR. LEAF-X combines entropy-guided attention weighting, multi-layer attention rollout, and optional causal ablations to identify low-entropy, high-impact heads and layers, producing sparse token-to-frame attributions. Unlike perturbation-based explainers or raw attention maps, LEAF-X exploits the internal structure of encoder-decoder and speech-augmented decoder-only models to generate explanations that better reflect model computation. Results show 32% improved faithfulness, 35-39% stronger locality/sparsity, and the most stable attributions, supporting more transparent and auditable ASR.

View PDFOpen arXiv