Towards Real-Time Human-AI Musical Co-Performance: Accompaniment Generation with Latent Diffusion Models and MAX/MSP

2026-04-08Sound

SoundArtificial Intelligence
AI summary

The authors created a system where an AI can play music along with a live musician in real time. They combined a popular music software (MAX/MSP) with a Python-based AI model that predicts and generates musical accompaniment quickly enough to keep up during performances. To make the AI faster, they used a method called consistency distillation, which speeds up the music generation process. They tested their system and found it produces good quality music that stays in rhythm, balancing how fast the AI works and how far ahead it looks at the music.

latent diffusion modelMAX/MSPreal-time audio processinggenerative modelOSC/UDP communicationsliding-window look-aheadconsistency distillationsampling timebeat alignmentmusical accompaniment
Authors
Tornike Karchkhadze, Shlomo Dubnov
Abstract
We present a framework for real-time human-AI musical co-performance, in which a latent diffusion model generates instrumental accompaniment in response to a live stream of context audio. The system combines a MAX/MSP front-end-handling real-time audio input, buffering, and playback-with a Python inference server running the generative model, communicating via OSC/UDP messages. This allows musicians to perform in MAX/MSP - a well-established, real-time capable environment - while interacting with a large-scale Python-based generative model, overcoming the fundamental disconnect between real-time music tools and state-of-the-art AI models. We formulate accompaniment generation as a sliding-window look-ahead protocol, training the model to predict future audio from partial context, where system latency is a critical constraint. To reduce latency, we apply consistency distillation to our diffusion model, achieving a 5.4x reduction in sampling time, with both models achieving real-time operation. Evaluated on musical coherence, beat alignment, and audio quality, both models achieve strong performance in the Retrospective regime and degrade gracefully as look-ahead increases. These results demonstrate the feasibility of diffusion-based real-time accompaniment and expose the fundamental trade-off between model latency, look-ahead depth, and generation quality that any such system must navigate.