StyleStream: Real-Time Zero-Shot Voice Style Conversion

2026-02-23Sound

SoundArtificial Intelligence
AI summary

The authors developed StyleStream, a new system that can change the way someone sounds in real time to match another person's voice style, including their tone, accent, and emotion. Their method separates what is being said (content) from how it is said (style) using a two-part process: first removing the style, then adding the new style back in. They use special techniques to keep the meaning unchanged while changing the voice style, making it fast enough to work with only one second of delay. This is the first system to do zero-shot voice style conversion in real time.

voice style conversionlinguistic contentstyle disentanglementzero-shot learningdiffusion transformernon-autoregressive architectureinformation bottleneckreal-time processingend-to-end latencyspeaker timbre
Authors
Yisi Liu, Nicholas Lee, Gopala Anumanchipalli
Abstract
Voice style conversion aims to transform an input utterance to match a target speaker's timbre, accent, and emotion, with a central challenge being the disentanglement of linguistic content from style. While prior work has explored this problem, conversion quality remains limited, and real-time voice style conversion has not been addressed. We propose StyleStream, the first streamable zero-shot voice style conversion system that achieves state-of-the-art performance. StyleStream consists of two components: a Destylizer, which removes style attributes while preserving linguistic content, and a Stylizer, a diffusion transformer (DiT) that reintroduces target style conditioned on reference speech. Robust content-style disentanglement is enforced through text supervision and a highly constrained information bottleneck. This design enables a fully non-autoregressive architecture, achieving real-time voice style conversion with an end-to-end latency of 1 second. Samples and real-time demo: https://berkeley-speech-group.github.io/StyleStream/.