CapTalk: Unified Voice Design for Single-Utterance and Dialogue Speech Generation
2026-04-09 • Sound
Sound
AI summaryⓘ
The authors created a system called CapTalk that designs voices from text descriptions, not just for single sentences but also for whole conversations. CapTalk uses special captions to describe voices for each sentence and for the overall speaker, and plans how the voice should sound throughout a dialogue. They made a method to keep the voice sound consistent while still allowing it to change naturally with the conversation. Tests show CapTalk works better than previous methods for both single sentences and dialogues.
text-to-speechvoice designtimbrespeaking stylemultimodal generationautoregressive modelvariational conditioningdialogue modelingexpression controlturn-level control
Authors
Xiaosu Su, Zihan Sun, Peilei Jia, Jun Gao
Abstract
Voice design from natural language descriptions is emerging as a new task in text-to-speech multimodal generation, aiming to synthesize speech with target timbre and speaking style without relying on reference audio. However, existing methods mainly focus on single-utterance generation, leaving conversational voice design largely unexplored. In this work, we extend voice design to dialogue, enabling better target speaker modeling and turn-level expressive control in natural conversational settings. We propose CapTalk, a unified caption-conditioned text-audio autoregressive framework for both single-utterance and dialogue voice design. CapTalk uses utterance-level captions for single-utterance voice design and speaker-level captions for dialogue speaker modeling, and further introduces a CoT control sequence in dialogue to explicitly plan turn-level dynamic attributes. To resolve the conflict between stable timbre preservation and context-adaptive expression, we propose a hierarchical variational conditioning module with an utterance-level speaker encoder to better balance stable timbre preservation and context-adaptive expression. This enables timbre reuse while keeping expression adaptive to the current utterance and, in dialogue, the surrounding context. We also build a comprehensive evaluation protocol for both single-utterance and dialogue settings. Experiments show that CapTalk achieves state-of-the-art performance on a single-utterance voice design benchmark and delivers better expression controllability and contextual appropriateness in multi-turn dialogue. Audio samples are available at: https://anonymous.4open.science/api/repo/Captalk-D601/file/index.html.