Audio Avatar Fingerprinting: An Approach for Authorized Use of Voice Cloning in the Era of Synthetic Audio

2026-03-20Sound

Sound
AI summary

The authors explain that new AI tools can create very realistic fake voices using only a few seconds of someone's audio, which can cause problems for security systems and video calls. They highlight the need to check if a synthetic voice is being used with permission, a task they call 'audio avatar fingerprinting.' To address this, the authors tested a speaker verification model for detecting fake speech and introduced a new dataset since no existing one suited this purpose. Their work aims to help in verifying whether synthetic voices are used by authorized people or not.

AI speech synthesisspeaker verificationfake speech detectionaudio avatar fingerprintingspeech forensicsdeepfake audioauthentication systemssynthetic voicesdataset
Authors
Candice R. Gerstner
Abstract
With the advancements in AI speech synthesis, it is easier than ever before to generate realistic audio in a target voice. One only needs a few seconds of reference audio from the target, quite literally putting words in the target person's mouth. This imposes a new set of forensics-related challenges on speech-based authentication systems, videoconferencing, and audio-visual broadcasting platforms, where we want to detect synthetic speech. At the same time, leveraging AI speech synthesis can enhance the different modes of communication through features such as low-bandwidth communication and audio enhancements - leading to ever-increasing legitimate use-cases of synthetic audio. In this case, we want to verify if the synthesized voice is actually spoken by the user. This will require a mechanism to verify whether a given synthetic audio is driven by an authorized identity, or not. We term this task audio avatar fingerprinting. As a step towards audio forensics in these new and emerging situations, we analyze and extend an off-the-shelf speaker verification model developed outside of forensics context for the task of fake speech detection and audio avatar fingerprinting, the first experimentation of its kind. Furthermore, we observe that no existing dataset allows for the novel task of verifying the authorized use of synthetic audio - a limitation which we address by introducing a new speech forensics dataset for this novel task.