Magenta

Magenta RealTime 2: Open & Local Live Music Models

Thu, 04 Jun 2026 06:00:00 -0700

We’re excited to share Magenta RealTime 2 (MRT2), a state-of-the-art open model and efficient real-time inference engine that enables you to build and play AI musical instruments on your laptop!

To get started, download the apps on your MacBook (requires Apple Silicon).

Plugin Bundle (MacOS) View on GitHub Models

Unlike other large generative music models that work offline to turn a prompt into a track, MRT2 is a live, interactive model that you can control with MIDI and audio, in addition to text. It performs low-latency on-device inference to respond to your inputs instantly. You can run it as a standalone app, drop it into your DAW, or integrate it into other music software.

In addition to the open-weights model, we are releasing a collection of playable instruments and experiences built with MRT2. Experiment with cloning sounds, blending styles, and creating live accompaniment with this low-latency music model.

To explore the potential of live music models as instruments, today we are releasing:

Magenta RealTime 2, an open-weights model (2.4B parameters) capable of high-quality real-time music synthesis with low-latency real-time controls via MIDI, text, and audio.
Alongside our model, we release an open source Python library (pip install magenta-rt) offering inference via JAX/MLX using SequenceLayers.
An inference engine written in C++, enabling efficient streaming audio generation on a MacBook GPU via MLX.
A suite of example applications built on the inference engine. These offer a glimpse into the creative potential of Magenta RealTime 2, and serve as references to help you get started building new instruments and software integrations.

For a decade, the Magenta team has championed a vision of AI as a tool for musicians, never a replacement. We released our first neural synthesizer, NSynth, back in 2017 which put machine learning into playable hardware. We continued creating AI Instruments with projects such as DDSP, Piano Genie, and the first version of Magenta RealTime, our debut live music model capable of generating and blending a wide range of musical styles. MRT2 achieves ~15x lower latency than version one, works on standard hardware and integrates directly into DAWs, making this live model a true musical instrument.

A live music model with lower latency and expanded control

	Magenta RealTime	Magenta RealTime 2
Live music generation	✅	✅
Hardware required	TPU/GPU	MacBook
Frame size	2s	40ms
Control latency	~3s	~200ms
Control modalities	Text, Audio	Text, Audio, MIDI
Model sizes	760M / 220M	2.4B / 230M

Both MRT and MRT2 are codec language models operating on sequences of audio tokens from the SpectroStream codec, but MRT2 achieves lower latency by performing frame-level autoregression with frame-aligned conditioning. To enable expressive musical control, MRT2 is designed to model audio that continuously follows MIDI inputs, alongside style prompts which can be either audio or text; prompts are embedded via MusicCoCa. For minimal interaction lag, both signals are injected as frame-aligned conditioning at every generation step, allowing the model to react to changes in the signal within a single frame (40 ms, plus additional sources of empirical latency, see below).

Key to this approach is the use of a causal sliding window attention mechanism to enable continuous streaming generation while bounding memory requirements. Alongside this, learnable attention embeddings are also incorporated to improve generalization to arbitrary durations and context eviction artifacts (e.g., ringing and feedback) during long-context generation.

Fast C++ inference engine via MLX

While the original Magenta RealTime required a high-power GPU or TPU, Magenta RealTime 2 brings live generation to the hardware musicians actually use. To achieve this, we built a C++ inference engine powered by MLX that allows MRT2 to run natively on Apple Silicon. Apple’s MLX framework provides the link between Python and C++. More specifically, we use MLX to compile the MRT2 model, implemented using the SequenceLayers library, into an .mlxfn file which is a model container that bundles the weights and computational graph. Our C++ inference engine loads that file and uses the MLX runtime to efficiently execute it on Apple Silicon GPUs. The inference engine handles other necessary infrastructure (model state, audio buffering / resampling, MIDI input) and can be embedded into many music application frameworks where C++ supported.

MLX allows MRT2 to run on Apple Silicon (M-series): both model sizes can run offline (non-real-time) inference on any Apple Silicon Mac, while real-time streaming (generating audio faster than playback) is supported on the following devices:

Model	Platform
Base (2.4B)	MacBook M3 Pro (or higher) MacBook M2 Max (or higher)
Small (230M)	Any Apple Silicon MacBook, including MacBook Air

A suite of example applications for musicians and developers

A key goal of Magenta RealTime 2 is to allow musicians to integrate live music models within existing software, and help developers build custom applications. To help you get started, our codebase provides several examples, including standalone apps, plugins and extensions.

What’s Next?

Our team members have been building new instruments with machine learning for nearly 10 years, excitedly making unique and quirky sounds from statistical knowledge of music. With Magenta RealTime 2, AI instruments are finally starting to gain the controllability and immediacy we expect from music creation tools, but plenty remains to be explored. From even more interaction and lower control latency, to audio streaming inputs that can enable jamming and real-time audio control, we look forward to expanding the capabilities of live music models further. Stay tuned for future updates!

And in the meantime, we are also excited to bring more features and example applications to MRT2 soon, including:

Finetuning, allowing anyone to customize the model by directly training on their own data.
Example performance tools created in collaboration with Manaswi Mishra.

In the next few days, we will also be at the Music Technology Hackathon in Boston, where we are presenting a challenge centered around Magenta RealTime 2. We look forward to seeing what everyone will come up with!

Citation

Please cite our work as:

Magenta Team. “Magenta RealTime 2: Open & Local Live Music Models”. https://magenta.withgoogle.com/magenta-realtime-2. June 2026

@article{mrt2,
  title  = {Magenta RealTime 2: Open & Local Live Music Models},
  author = {Magenta Team},
  year   = {2026},
  note   = {https://magenta.withgoogle.com/magenta-realtime-2}
}

Appendix: Technical Details

Low-latency streaming generation

Some background on Codec Language Modeling. A codec language model (LM) operates on discrete sequences of tokens from a neural audio codec. Here a codec refers to a pair of functions, an encoder and decoder, that convert audio to and from a discrete, compressed representation while minimizing distortion.

More formally, the encoder is a function mapping raw stereo audio waveforms \(\textbf{a} \in \mathbb{R}^{T f_s \times 2}\) into matrices of discrete tokens \(\mathbf{x} \in \mathbb{V}_c^{Tf_k \times d_c}\) where \(T\) is the duration in seconds, \(f_s\) the audio sampling rate, \(f_k\) the token frame rate, \(\mathbb{V}_c\) the codec vocabulary, and \(d_c\) is the number of tokens per frame. In this case, \(d_c\) refers to the “depth” of the residual vector quantization algorithm, referring to the iterative quantization of continuous embeddings of each audio frame.

The goal of the codec LM is to model these token matrices. For efficiency, an increasingly common approach is to adopt a hierarchical autoregressive framework using a pair of Transformers: one which compresses temporal history into fixed-length embedding vectors (\(\texttt{Temporal}_\theta\)), and another which iteratively decodes tokens depth-wise given the current frame embedding (\(\texttt{Depth}_\phi\)). Assuming \(\mathbf{x_i}\) refers to the \(i\)-th frame of \(\mathbf{x}\), and \(x_i^j\) refers to its \(j\)-th token, the joint distribution over \(x\) is modeled autoregressively as: \[ P_{\theta,\phi}(\mathbf{x}) = \prod_{i=1}^{Tf_k} \prod_{j=1}^{d_c} P_\phi(x_i^j | \mathbf{x_i^{<j}}, \texttt{Temporal}_{\theta}(\mathbf{x_{<i}})), \] where \(P_\phi(x_i^j \mid \cdot) = \texttt{SoftMax}(\texttt{Depth}_\phi(\cdot))\).

At inference time, we generate audio by first sampling a token sequence \(\mathbf{x’} \sim P_{\theta,\phi}(\mathbf{x})\) and then outputting \(\mathbf{a}’ = \texttt{Dec}(\mathbf{x}’)\), where \(\texttt{Dec}\) is the codec decoder. This describes our base modeling approach, shared with Magenta RealTime. For our codec, we use SpectroStream to compress high fidelity (\(f_s = 48\) kHz) stereo audio into tokens at \(3\) kbps (\(f_k = 25\) Hz, \(d_c = 12\), \(|\mathbb{V}_c| = 2^{10}\)).

Lowering autoregression granularity: from chunk to frame. To achieve streaming audio generation, we need to enforce two constraints:

The system must generate at least \(f_k \cdot d_c\) tokens per second
The decoder must be causal, meaning its output audio for frame \(i\) only depends on \(\mathbf{x_{\leq i}}\)

In the original Magenta RealTime, we satisfied requirement (1) by performing autoregression on chunks of frames, where each chunk is 2 seconds in duration. This design was chosen to amortize model runtime over chunk length to achieve real-time streaming. However, because the system must wait until the next chunk to inject any new user control information, the chunk duration creates a lower bound on control delay, resulting in a response time of 2 seconds at a minimum. Instead, Magenta RealTime 2 models individual frames, allowing us to reduce model response time significantly. To ensure continuous streaming generation while operating on single frames, we adopt a decoder-only architecture, using a local sliding window attention (SWA) in the temporal Transformer.

This has two key advantages: (1) the decoder-only architecture allows us to remove the sequential bottleneck introduced by the bidirectional encoder in Magenta RealTime, where the full encoder output has to be materialized before decoding can begin; (2) the rolling attention mechanism allows us to extend the context length while keeping the KV cache size fixed. At each step of the autoregressive generation, key-value entries for new tokens are written into the cache, and entries older than the window size w are evicted:

Similarly to previous work, we find that using a sliding window attention causes the model to significantly deteriorate when initial tokens are evicted from the cache. To remediate this, we make use of a learnable attention sink embedding. In order to reconcile the finite training length with the receptive field induced by the SWA mechanism, we also take care to set the attention window size such that this effective receptive field does not exceed the training crop length. Finally, we further reduce train/test mismatch and achieve better length generalization by dropping learnable positional embeddings (NoPE), after observing that RoPE hinders generalization beyond the training length. Instead, the model implicitly learns positional information by relying on causal masking and SWA, which naturally extend to arbitrary-length sequences without extrapolation issues.

Putting all this together, our model presents significant architectural differences compared to the previous version:

Model	Magenta RealTime	Magenta RealTime 2
Autoregressive unit	2-second chunks (25 frames × 16 RVQ = 400 tokens)	Individual frames (12 RVQ tokens at 25 Hz = 40 ms)
Architecture	T5-style bidirectional encoder + causal decoder; encoder processes the full chunk of conditioning before decoding begins	Decoder-only; conditioning is injected at every frame, with no encoder forward pass as a sequential bottleneck
Minimum control delay	≥ 2 s (next chunk boundary)	~0.2 s (frame processing + depth decode + codec decode). See full latency diagram

Precise control through frame-by-frame conditioning

A central feature of MRT2 is responsive, multi-signal control: in addition to style control expressed through audio or text, MRT2 also supports note and drums on/off control. This is achieved by modeling the conditional distribution \(P_{\theta,\phi}(\mathbf{x} | \mathbf{c})\), where \(\mathbf{c} = (\mathbf{c}_{style}, \mathbf{c}_{notes}, \mathbf{c}_{drums})\) is formed by tokenized representations of all conditioning signals at the audio frame rate (25 Hz), concatenated together into a single conditioning vector per frame. This vector is then mapped to a multi-channel embedding and injected into the temporal decoder through streaming cross-attention, enabling the model to react to changes in any signal within a single frame (~40 ms).

At inference we enable flexible joint guidance by extending the classifier-free guidance (CFG) approach in Magenta RealTime to multiple signals. This allows us to balance the contribution of each conditioning signal separately and according to the desired level of adherence, while also supporting unconditional generation for any subset of controls.

Style control through audio and text. Similarly to Magenta RealTime, MRT2 can also be steered through audio and text via quantized MusicCoCa embeddings. During training, we freeze the embeddings associated with the MusicCoCa tokens instead of learning them from scratch. The goal is to leverage the rich, pre-trained semantic representations coming from the Residual Vector Quantizer (RVQ). By keeping these embeddings frozen, we ensure the generative model receives stable semantic embeddings, which significantly improves prompt adherence at inference time. While MusicCoCa provides a joint embedding space between text and audio, the underlying distributions associated with both modalities do not match exactly. This creates a train-test mismatch during inference, as the model has only been trained on audio embeddings, but receives text embeddings during inference. To bridge this gap, we train a generative model from which we can sample diverse audio embeddings given an input text embedding, learning the one-to-many relationship between a single text prompt and multiple valid audio signals. To ensure high performance, we employ a pixel Mean Flow (pMF) formulation, enabling high-quality one-step inference. Finally, training this mapper module on a mix of short tags and long-form captions provides flexible style control, ranging from simple tag-style inputs to highly detailed text descriptions.

Note control. We enable note control by training on (audio, MIDI) pairs. Note activity is encoded as a 128-channel pianoroll – one channel per MIDI pitch – at the audio frame rate (25 Hz). The model is trained on around 71k hours of mostly instrumental stock music from a variety of sources, with MIDI labels inferred by the MT3 transcription model. We structure the per-pitch token vocabulary to support two control modes at inference. In Auto-Strum mode, the user specifies only which pitches are active at each frame, and the model determines where to place note onsets. In Auto-Strum OFF mode, the user can additionally specify the exact timing of each note onset, giving precise attack-level control. This is achieved through a 4-token vocabulary that distinguishes between note off, generic note on, note onset, and note continuation. When Auto-Strum is off, the model receives onset and continuation tokens directly, and respects the specified attack timing. When Auto-Strum is on, onset information is replaced with an onset mask token, and the model freely chooses when to place attacks based on the active pitch information alone. To support both modes with a single model, we employ onset masking, a training-time augmentation that stochastically replaces the onset and continuation tokens of randomly selected notes with the onset mask token. This trains the model to generate musically plausible attacks when no explicit onset information is present, while faithfully following onset cues when they are provided.

Drums on/off control. The note conditioning described so far gives us control over the melodic and harmonic content of the generated audio, but leaves us with no mechanism to control the presence of percussive elements. As a result, the model can arbitrarily include drums as part of the generated audio whenever this is admissible by the style conditioning (e.g. “jazz”). This can often be undesirable if, for example, the model is played alongside other instruments or as part of a multi-track session (e.g. in a DAW). For this reason, it’s useful to optionally switch off drum generation through an explicit control. We enable this through an additional conditioning signal: at training time, we pass a frame-wise sequence of drum hits obtained by transcribing drum stems from each training example using OaF Drums. While this trains the model to respond to drum hits, we find that direct drum control is infeasible in practice, given the end-to-end response time. Instead, we leverage this control purely for switching between drum-unconditional and drumless generation, using the same multi-guidance CFG as the other signals.

Inference-time masking as creative control. Beyond providing a set of control signals to guide generation, it is crucial to have a way to compose and modulate them. We accomplish this through selective input masking coupled with CFG scales, a technique that allows us to flexibly define playing modes at inference. More specifically, we introduce a masking scheme designed to accomplish two complementary goals: (1) strengthen the model’s ability to follow the controls while remaining robust to noisy or missing inputs, (2) enable partially unconditional generation as a form of creative control. During training, we stochastically mask contiguous regions of each conditioning signal independently, varying both the masking probability and spatial scale. We find that this results in better adherence to the inputs when they are specified. Importantly, this augmentation implicitly trains the model to interpret masked regions as unspecified, opening up a new dimension of creative interaction at inference. The Auto-Strum mode described above in the Note Control section is one such example. Similarly, we employ masking over the pitch dimension of the pianoroll to give the model more or less “creative freedom” over which pitches can be active. For example, masking all pianoroll pitches except those currently pressed allows the model to freely add harmonies or embellishments, while explicitly setting neighboring pitches to “off” (silent) constrains it to play only the input notes.

Real-world control latency. While we have significantly reduced the model frame size (from 2s to 40ms) compared to the previous generation, inference time isn’t the only source of latency. Below we give a sketch of end-to-end reaction time, taking into account input and output buffers, alongside additional sources of latency introduced by external components.

Open-sourcing The Infinite Crate DAW plugin

Mon, 09 Mar 2026 12:00:00 -0700

View on GitHub Discuss on Discord Get the plugin

Six months ago we released The Infinite Crate, a DAW plugin that brings the Lyria RealTime music model into Digital Audio Workstations (DAWs) to improve the sampling workflow for producers. Since its release it’s been used by some of our favorite artists — including a wonderful showcase with Daito Manabe in Tokyo — and was featured as an exciting new music tool at NAMM 2026.

Today we’re fully open sourcing the DAW plugin for developers to fork, modify, and make their own under the permissive Apache 2.0 license.

The VST was born out of discussions and studio collaborations with musicians and producers from around the world. Many were intrigued by music models as a creative partner but needed deeper integration into the tools they know and trust — Ableton, Logic, and other DAWs that support VST3/AU plugins. Bridging this gap simplifies audio routing and MIDI-mapping for studio recording and live performance, allowing musicians to focus on what matters: the music.

We architected the plugin using React/TypeScript for the UI layer and JUCE/C++ for DAW connection, audio processing, and websocket audio streaming from the Gemini/Lyria API. This allowed us to rapidly iterate on the frontend using hot-reload (Shadcn/Tailwind), while ensuring latency sensitive operations (audio streaming and playback) happen in a compiled and unmanaged language with a tight clock. State is synced between TypeScript and C++ using Zustand’s state management and nlohmann json.

The plugin is a functional interface that exposes most of the controls available on the Lyria RealTime API to the React frontend and feeds the resulting audio stream into the DAW. Developers can fork the plugin and build creative interfaces and visualizations for the API (like Space DJ, MIDI DJ, or creative controls) directly in the DAW by spinning up the Vite server. Because the frontend uses a standard set of web frameworks it’s easy to explore new interfaces using AI-assisted coding tools like Gemini and Antigravity.

Looking ahead

In the near term, we hope to update the plugin to support on-device inference of the Magenta RealTime open-weights model for offline use. In the long term, we hope to support future music models with improved controls, such as audio and MIDI input.

We hope this open source plugin can support and be built with the growing community of music makers using machine learning as part of their creative process.

Join the discussion on our Discord.

Acknowledgments

We thank: Spencer Salazar for his talk on prototyping DAW plugins in web technologies at ADC 2020, JUCE for implementing a C++ to Web/JS bridge in JUCE 8, Tommy Cappel for rigorous testing, Alberto Lalama and Joyce Xie for their work on the API, Nikhil Bhanu for his work on the windows build, and the DeepMind research team that contributed to Lyria RealTime.

Lyria Camera: Soundtrack your life

Wed, 03 Dec 2025 12:00:00 -0800

Today we’re launching Lyria Camera, an app that uses Lyria RealTime to make music with your camera. By combining Gemini’s image understanding and the Lyria RealTime API, Lyria Camera generates a musical score that adapts to your environment on the fly.

It works by translating the visual scene into musical descriptors via Gemini, producing prompts like Reflective piano, cityscape calm. The Lyria RealTime API uses these terms as prompts to create a continuous stream of music that’s generated on the fly. As you move about your world, the prompts and the music they create will evolve over time.

Try Lyria Camera now or remix it on AI Studio.

volume_off

The world is your instrument

Reward your curiosity: When you’re using Lyria Camera, every image is a new instrument. You can find songs in your sketchbook, at the laundromat or in your breakfast cereal. Film around and see what you can find.
DJ your commute: Point your camera out the train window or mount it on your dashboard. Lyria Camera responds to the shifting scenery—the rhythm of passing streetlights or the calm of an open road—creating a drive-time score that matches your journey beat for beat.
Score your screen: On desktop, try the “Share Screen” feature to use a browser tab instead of your camera. Actually, any app on your computer can be used as a video feed. Try it while you’re working or gaming for a tailor-made soundtrack.

How it Works

Lyria Camera brings together several AI capabilities to create a seamless audiovisual feedback loop.

Multimodal Prompting. This is the bridge between sight and sound. We use Gemini to analyze your camera feed, translating visual cues into rich textual descriptions. These descriptions act as musical instructions, telling Lyria exactly how to interpret and ”play” what you’re seeing.
Continuous & Steerable Generation: The Lyria RealTime API is designed for continuous music generation. Instead of generating a static song, it creates an endless stream of audio that you can “steer” in different directions. This allows the music to morph smoothly from one mood to another without ever stopping or skipping a beat.

What will you build?

Lyria Camera is a great companion for a walk or a drive, and it’s just one thing you can do with the Lyria RealTime API. We built this app to demonstrate the possibilities of continuous, steerable music generation, but the real potential lies in what comes next.

You can try Lyria Camera on your phone or desktop today. For developers ready to push the boundaries further, the Lyria RealTime API can help you build the next generation of music experiences.

Space DJ: Navigating a Musical Universe

Mon, 03 Nov 2025 12:00:00 -0800

Today, we’re excited to launch Space DJ, a web application from Magenta that turns music exploration into an interactive journey through a constellation of sounds. You pilot a spaceship through a galaxy where each star represents a musical genre. As you navigate this universe, Space DJ uses the Lyria RealTime API to generate a continuous stream of music that reflects your position and selections in real-time.

We used the deploy app feature in AI Studio to make this available to everyone!
Try Space DJ now, or view and fork the source code in AI Studio.

volume_off

Fly Through Music

Explore a Musical Universe: Fly through a star constellation where each star is labeled with a music genre. This galaxy is a 3D projection of genre embeddings.
Generate Music in Real-Time: As you fly, the stars close to the spaceship light up and influence the music. Clicking on a star or a point in space anchors your selection. The Lyria Realtime model blends the prompts of nearby genres into a unique musical mashup that evolves dynamically as you move.
Uncover Hidden Connections: Similar genres appear close together in the 3D space. You can also enable “High-Dimensional Neighbors” to find genres that are semantically similar in the original high-dimensional embedding space, even if they aren’t visual neighbors in the projection.
Engage Auto-Pilot: Randomly drift through space for an ever-changing, generative soundscape.

How it Works

Space DJ combines several technologies to create an immersive experience:

Genre Embeddings: We start with text prompts for 300 musical genres out of a 1000 genre dataset. The text is converted into a rich numerical representation (embedding) using the open-source MagentaRT model’s MusicCoca embedder. These 768-dimensional embeddings are then reduced to 128 dimensions using Principal Component Analysis for efficiency.
3D Projection: To render the embeddings in 3D, we use Uniform Manifold Approximation and Projection (UMAP), an algorithm that projects the data into 3D space while trying to preserve its high-dimensional structure. You can tweak UMAP parameters in the settings for different constellation shapes.
Interactive Rendering: The 3D space, spaceship, and stars are rendered in your browser using three.js. You can select how many stars to create and whether to randomize the selection.
Real-Time Audio Synthesis: Your interactions within the 3D space are translated into a set of weighted text prompts (i.e. Deep House: 0.7, Ambient Techno: 0.3) based on proximity. These prompts are sent to the Lyria RealTime API, which synthesizes the music you hear, responding instantly to the spaceship’s position.
Development and Deployment: We used AI Studio to develop the applet through its interactive code editor. We leveraged AI Studio’s Cloud Run integration to deploy the application. This approach simplifies the deployment process and helps protect the Gemini API key by securely proxying requests to the Lyria RealTime API.

A New Frontier for Musical Interaction

Space DJ is an exploration into new ways of interacting with generative AI models for music. We hope to inspire new forms of musical expression and discovery.

Ready to take flight? Try Space DJ Now!

Lyria RealTime VST: The Infinite Crate

Wed, 09 Jul 2025 07:00:01 -0700

🎵Get the plugin

📖 Learn more

Live Generative Music in your DAW

Today, we’re happy to share The Infinite Crate, a DAW plugin prototype that integrates the Lyria RealTime API directly into your favorite music software. Use text prompts to steer a continuously evolving stream of music and feed the audio directly into your DAW for sampling, live performance, or a backing track to jam with.

volume_off

Integrating generative models with existing creative workflows has always been an important part of Magenta’s mission, as it allows people more control and agency in how they use these models in their own practice. Our previous experiments with plugins, including Magenta Studio for manipulating MIDI clips and DDSP VST for realtime audio-to-audio transformations, have over a million downloads combined and have validated for us the value of making these tools creatively accessible.

We hope The Infinite Crate will be a welcome addition to this lineup. We were inspired to create it though our collaborations with musicians such as Jacob Collier and Toro y Moi, where we saw the potential for integrating capabilities similar to MusicFX DJ more directly into studio and live performance workflows.

The Infinite Crate is cross-platform, available for both Mac and Windows, as a VST3 plugin, an AU component, and a standalone app.

Looking ahead

Lyria RealTime is not capable of running locally on consumer hardware, so thus the plugin requires an API key (free for Lyria RealTime) and internet access. We’re excited to explore complementing this approach with more efficient variants that can run locally on consumer hardware such as our recently released open model Magenta RealTime, so stay tuned!

Magenta RealTime: An Open-Weights Live Music Model

Fri, 20 Jun 2025 07:00:01 -0700

Magenta RealTime

Today, we’re happy to share a research preview of Magenta RealTime (Magenta RT), an open-weights live music model that allows you to interactively create, control and perform music in the moment.

Colab Demo

📝Paper

GitHub Code

Model Card

Magenta RT is the latest in a series of models and applications developed as part of the Magenta Project. It is the open-weights cousin of Lyria RealTime, the real-time generative music model powering Music FX DJ and the real-time music API in Google AI Studio, developed by Google DeepMind. Real-time music generation models open up unique opportunities for live music exploration and performance, and we’re excited to see what new tools, experiences, and art you create with them.

As an open-weights model, Magenta RT is targeted towards eventually running locally on consumer hardware (currently runs on free-tier Colab TPUs). It is an 800 million parameter autoregressive transformer model trained on ~190k hours of stock music from multiple sources, mostly instrumental. The model code is available on Github and the weights are available on Google Cloud Storage and Hugging Face under permissive licenses with some additional bespoke terms. To see how to run inference with the model and try it yourself, check out our Colab Demo. You may also customize MagentaRT on your own audio or explore live audio input Options for local, on device inference are coming soon.

How it Works

Live generative music is particularly difficult because it requires both real-time generation (i.e. real-time factor > 1, generating X seconds of audio in less than X seconds), causal streaming (i.e. online generation), and low-latency controllability.

Magenta RT overcomes these challenges by adapting the MusicLM architecture to perform block autoregression. The model generates a continuous stream of music in sequential chunks, each conditioned on the previous audio output (10s of coarse audio tokens) and a style embedding to produce the next audio chunk (2s of fine audio tokens). By manipulating the style embedding (weighted average of text or audio prompt embeddings), players can shape and morph the music in real-time, mixing together different styles, instruments, and musical attributes.

The latency of controls is set by the chunk size, which has a maximum output size of two seconds but can be reduced to increase reactivity. On a Colab free-tier TPU (v2-8 TPU), these two seconds of audio are generated in 1.25 seconds, giving a real-time factor of 1.6.

Compared to the original MusicLM, we’ve upgraded our representations to SpectroStream for high-fidelity (48kHz stereo) audio, which is a successor to SoundStream (Zeghidour+ 21). We also trained a new joint music+text embedding model called MusicCoCa that is influenced by both MuLan (Huang+ 22) and the CoCa models (Yu+ 22). Additional details are provided in the model card and deeper technical descriptions are available in our paper.

Latent Space Exploration… In Real Time

Magenta’s earlier work in latent music models for MIDI clips (MusicVAE, GrooVAE) and instrumental timbre (NSynth), offered a wide range of possible interfaces.

With Magenta RT, it is now possible to traverse the space of multi-instrumental audio: explore the never-before-heard music between genres, unusual instrument combinations, or your own audio samples.

The ability to adjust prompt mixtures in real-time allows you to efficiently explore the sonic landscape and find novel textures and loops to use as part of a larger piece of music.

Real-time interactivity also provides the possibility of this latent exploration being its own type of musical performance, the interpolation through space combined with anchoring of the audio context producing a structure similar to a DJ set or improvisation session. Beyond performance, it can also be used to provide interactive soundscapes for physical spaces like artist installations or virtual spaces like video games.

This opens up a world of possibilities to build new tools and interfaces, and below you can see three example applications built on the Lyria RealTime API in AI Studio. Over time, Magenta RT will open up similar opportunities for on-device applications.

PromptDJ

PromptDJ MIDI

PromptDJ Pad

Why Magenta RealTime?

Enhancing human creativity (not replacing it) has always been at the core of Magenta’s mission. AI, however, can be a double-edged sword for creative agency. It offers new opportunities for accessibility and expression, but it can also create a deluge of more passive creation and consumption compared to traditional methods. With this in mind, we have always strived to build tools that help close the skill gap to make creation more accessible, while also valuing existing musical practices and encouraging people to dig deeper in their own creative journeys. In this regard, real-time interactive music models offer several important advantages that have motivated our research over the years (Piano Genie, DDSP, NSynth, AI Duet, and more).

Live interaction demands more from the player but can offer more in return. The continuous perception-action loop between the human and the model provides access to a creative flow state, centering the experience on the joy of the process over the final product. The higher bandwidth channel of communication and control often results in outputs that are more unique and personal, as every action the player takes (or doesn’t) has an effect.

Finally, live models naturally avoid creating a deluge of passive content, because they intrinsically balance listening with generation in a 1:1 ratio. They create a unique moment in time, shared by the player, the model, and listeners.

While Lyria RealTime provides access to state-of-the-art live music generation to developers and users around the globe, the Magenta Project remains committed to providing more direct access to code and models to enable researchers, artists, and creative coders to further build upon and adapt to achieve their creative goals.

Known Limitations

Coverage of broad musical styles. Magenta RT’s training data primarily consists of Western instrumental music. As a consequence, Magenta RT has incomplete coverage of both vocal performance and the broader landscape of rich musical traditions worldwide. For real-time generation with broader style coverage, we refer users to our Lyria RealTime API.

Vocals. While the model is capable of generating non-lexical vocalizations and humming, it is not conditioned on lyrics and is unlikely to generate actual words. However, there remains some risk of generating explicit or culturally-insensitive lyrical content.

Latency. Because the Magenta RT LLM operates on two second chunks, user inputs for the style prompt may take two or more seconds to influence the musical output.

Limited context. Because the Magenta RT encoder has a maximum audio context window of ten seconds, the model is unable to directly reference music that has been output earlier than that. While the context is sufficient to enable the model to create melodies, rhythms, and chord progressions, the model is not capable of automatically creating longer-term song structures.

Future Work

Magenta RT and Lyria RT are pushing the boundaries of live generative music, and we are happy that Magenta RT marks a return of open releases from Magenta.

We are hard at work at making MagentaRT run locally on your own device - stay tuned for more info!

We are also working on the next generation of real-time models with higher quality, lower latency, and more interactivity, to create truly playable instruments and live accompaniment.

How to cite

Please cite our technical report:

BibTeX:

@article{gdmlyria2025live,
    title={Live Music Models},
    author={Caillon, Antoine and McWilliams, Brian and Tarakajian, Cassie and Simon, Ian and Manco, Ilaria and Engel, Jesse and Constant, Noah and Li, Pen and Denk, Timo I. and Lalama, Alberto and Agostinelli, Andrea and Huang, Anna and Manilow, Ethan and Brower, George and Erdogan, Hakan and Lei, Heidi and Rolnick, Itai and Grishchenko, Ivan and Orsini, Manu and Kastelic, Matej and Zuluaga, Mauricio and Verzetti, Mauro and Dooley, Michael and Skopek, Ondrej and Ferrer, Rafael and Borsos, Zal{\'a}n and van den Oord, {\"A}aron and Eck, Douglas and Collins, Eli and Baldridge, Jason and Hume, Tom and Donahue, Chris and Han, Kehang and Roberts, Adam},
    journal={arXiv:2508.04651},
    year={2025}
}

Introducing Lyria RealTime API

Thu, 12 Jun 2025 07:00:01 -0700

Lyria RealTime API

Lyria team

For the last few years, we have continued to explore how different ways of interacting with generative AI technologies for music can lead to new creative possibilities. A primary focus has been on what we refer to as “live music models”, which can be controlled by a user in real-time.

Lyria RealTime is Google DeepMind’s latest model developed for this purpose, and we are excited to share an experimental API that anyone can use to explore the technology, create some jams, develop an app, or build their own musical instruments. You can try a demo app now in Google AI Studio, fork it to build your own, or have a look at the API documentation. For more details on how Lyria RealTime works, see our technical report.

Here are a few interfaces we have open sourced in Google AI Studio for inspiration that you can easily fork and make your own:

PromptDJ

Our most fully-featured demo allows you to add prompts and use sliders to control their relative impact on the music. Advanced Settings let you try out manual overrides for different musical aspects like note density, tempo, and key.

Try it now !

PromptDJ MIDI

With PromptDJ MIDI, you can use a virtual MIDI controller to mix together text descriptors (that you can edit) and produce a single stream of music. You can even map the knobs to a physical MIDI controller via WebMIDI like Toro y Moi used during the I/O preshow.

Try it now !

PromptDJ Pad

PromptDJ Pad harkens back to our earlier experiments with latent space interfaces NSynth Super and MusicVAE Beat Blender, allowing you to easily explore the space between four editable prompts.

Try it now !

A key advantage of the API is its versatility, allowing it to be called from various platforms, not just web apps. For instance, we’ve developed a VST plugin called The Infinite Crate, which enables a seamless interaction between Lyria RealTime and the digital audio workstation of your choice!

volume_off

Capabilities

With Lyria RealTime, it is possible to traverse the space of multi-instrumental audio: explore the never-before-heard music between genres, unusual instrument combinations, or abstract concepts.

volume_off

The core capabilities of the model and API are:

Generates a continuous stream of 48kHz stereo music.
Low latency – maximum of 2 seconds between control change and effect.
Latent space steering based on a mixture of text descriptors.
Manual control over music features
- Tempo, key.
- Options to reduce or silence particular instrument groups (drums, bass, other).
- Control for density of note onsets.
- Control for spectral brightness.
Sampling temperature and top-k settings (“chaos” control).

volume_off

Interfaces for Live Music Models

One of the things we are most excited about with live music models is the number of novel interfaces they make possible by mapping human actions to musical controls. This harkens back to our earlier work with Magenta.js and the large number of applications it and other earlier Magenta technologies spawned. We hope the Lyria RealTime API will empower even more creativity by developers.

Live music models introduce a different interaction paradigm than text-to-song generators, which have impressive capabilities but lack the instantaneous feedback loops available to players of traditional instruments. The goal of models like Lyria RealTime is to put the human more deeply in the loop, centering the experience on the joy of the process over the final product. The higher bandwidth channel of communication and control often results in outputs that are more unique and personal, as every action the player takes (or doesn’t) has an effect.

In Lyria RealTime, the ability to adjust prompt mixtures and quickly hear the results allows players to efficiently explore the sonic landscape to find novel textures and loops. Real-time interactivity also provides the possibility of this latent exploration being its own type of musical performance, the interpolation through space combined with anchoring of the audio context producing a structure similar to DJ set or improvisation session. Beyond performance, it can also be used to provide interactive soundscapes for physical spaces like artist installations or virtual spaces like video games.

Our first public experiment with Lyria RealTime was MusicFX DJ, which we developed last year as a collaboration with Google Labs. MusicFX DJ allows you to create and conduct a continuous flow of music, and we worked with producers and artists to make the tool more inspiring and useful to musicians and amateurs alike.

At this year’s I/O, Toro y Moi (Chaz Bear) took Lyria RealTime for a spin on stage before the keynote, using a different interface that he operated via a physical MIDI controller. Chaz’s performance leaned deeply into the live nature of the model, improvising with it to lead the crowd on a sonic journey full of surprises for himself and the audience.

Chaz Bear's performance at Google I/O 2025.

How it Works

Live generative music is particularly difficult because it requires both real-time generation (i.e. real-time factor > 1, generating 2 seconds of audio in less than 2 seconds), causal streaming (i.e. online generation), and low-latency controllability.

Lyria RealTime diagram

Lyria RealTime overcomes these challenges by adapting the MusicLM architecture to perform block autoregression. The model generates a continuous stream of music in sequential chunks, each steered by the previous audio output and a style embedding for the next chunk. By manipulating the style embedding (weighted average of text or audio prompt embeddings), players can shape and morph the music in real-time, mixing together different styles, instruments, and musical attributes.

Future Work

We are currently working on the next generation of real-time models with higher quality, lower latency, more interactivity, and on-device operability, to create truly playable instruments and live accompaniment. Stay tuned as we continue working with communities of musicians and developers on these technologies.

Magenta Studio 2.0

Thu, 24 Aug 2023 07:00:01 -0700

TL;DR: Magenta Studio, first released in 2019, has been updated to more seamlessly integrate with Ableton Live. No functionality has changed, there are only UI changes and internal fixes. Please download and enjoy!

If you’re new to Magenta Studio, please read our previous post about what it is and how it works.

What’s New

In the previous version of Magenta Studio, the Max for Live (M4L) plugin would launch a separate application specific to your operating system for each of the tools. Unfortunately, as operating systems were upgraded, sometimes the applications stopped working. Therefore, we made the decision to integrate the tools directly into the Max for Live environment to ensure longer-term stability. The machine learning models are still directly integrated into the M4L plugin and do not require access to the Internet to use.

Upgrading

To upgrade from the previous version of Magenta Studio, you can download the latest version and drop it into Live directly in the place of the old plugin. The functionality has not been altered, only the interface and integration, so it works in exactly the same way.

Documentation

The documentation has been updated to reflect the new interface. The tool-specific videos have not been updated with the new interface, but the functionality is identical.

Support

Please report any issues to the GitHub repository. Thanks for using Magenta Studio!

Acknowledgements

Magenta Studio is based on work by members of the Google DeepMind team’s Magenta project along with contributors to the Magenta and Magenta.js libraries. The plug-ins were implemented by Yotam Mann and extended by Cassie Tarakajian.

The 2023 I/O Preshow – Composed by Dan Deacon (with some help from MusicLM)

Wed, 21 Jun 2023 13:00:00 -0700

Tl;dr: Dan Deacon worked with Google’s latest music AI models to compose the preshow music. Check out the MusicLM demo in the AI Test Kitchen app. Read on for more details about our collaboration with Dan Deacon.

Dan Deacon’s I/O Performance

On several occasions, we have had the pleasure of working with musicians that perform at Google I/O. This is an opportunity for us to bring our latest creative machine learning tools out of the lab and into the hands of the musicians. In previous years, we have worked with YACHT and The Flaming Lips. With YACHT we explored custom symbolic music generation models tailored to the band, and with The Flaming Lips we explored an interaction to bridge the audience and performers.

This year’s I/O pre-show was performed by electronic musician and composer Dan Deacon. With Dan we explored how artists might interact with generative models of music audio and incorporate them into their artistic process. Check out his performance in the video below and read on to learn more about his process using Google’s latest music AI tools:

Dan Deacon's performance at Google I/O 2023.

Dan used two of our new generative models in his performance: MusicLM (paper, demo), which produces music based on a text-based input prompt, and SingSong (paper), which will generate an accompaniment track for an audio-based singing input. Both of these models are part of the AudioLM (paper) family, and they directly produce audio based on the input conditioning (i.e., text or singing) by autoregressively predicting SoundStream (paper) tokens with one or more Transformer language models. SoundStream tokens can then be converted back to raw audio that can be used in conjunction with other audio editing software.

For his performance, Dan used MusicLM to create the chill, relaxing piano groove that’s heard behind his two meditations starring the Duck with Lips. Additionaly, Dan used both MusicLM and SingSong to create the Chiptune song. Most excitingly, Dan didn’t just use both SingSong and MusicLM, but actually extended their capabilities to put his performance together. We’ll discuss more of how Dan shaped the tools–and why it’s important that he did so–in the next section.

Working with Dan

As Dan discusses at around 7 minutes into his performance, he has always been excited by the promise that new technologies bring to the compositional process. Technology has a long and intertwined history with the art of making music. We might not think of things like flutes, violins, or trombones in the same way we think of computers now, but these were revolutionary new technologies when they were first introduced! They can also often seem disruptive at first–at one point in history, microphones caused quite a stir because they let vocalists sing much more softly (opposed to singing so loud they could be heard over the band). Yet in retrospect, microphones changed our relationship to music in many positive ways, enabling us to create, represent, and distribute music in ways that would have been inconceivable beforehand. Importantly, each new technological development expanded the creative palette of musicians, bringing with them new textures, new techniques, and sometimes new conceptions of music itself.

We view our new models as a continuation of music technology’s evolution. We’re incredibly inspired by the opportunity for these new tools to bring new creative capabilities to humanity, while remaining conscious of–and working hard to mitigate–their potential negative consequences. Our goal is and always has been to empower artists and musicians; a crucial piece of empowering musicians is understanding now these new tools situate themselves in different artists’ creative processes. With that in mind, collaborating with Dan was a great opportunity for us to work towards embodying our goals of empowering musicians in the era of generative modeling.

A glimpse of our in-person workshop where we showed our new tools to Dan Deacon.

About a month before I/O, we had a workshop with Dan where we introduced him to MusicLM and SingSong. Initially, Dan found many interesting text prompts to our MusicLM such as “a 600ft trombone.” He started to push the tools past their limit by, for example, playing his synthesizer into SingSong, ignoring that the system was trained on only singing inputs. These initial experiments turned out to be really fun and promising!

As we kept working with Dan, he surprised us by pushing these tools even further. Inspired by “I Am Sitting in a Room” (click here to listen), he fed the output of the SingSong model back into itself… over and over and over. Again, Dan moved beyond the model’s design of accepting singing input; by feeding its own output back into itself, the input audio was out of the distribution that the model had seen during training and we weren’t sure if this would work at all. Yet, not only did it work, but the feedback loop tended to produce music that still accompanies the input; it has the same key, tempo, and style. This was the interaction that Dan designed to compose the Chiptune song, above.

Dan began with a handful of text prompts to MusicLM, and then used the generated audio as input to SingSong and that output back through SingSong for numerous iterations. He was able to create hundreds of audio clips that complemented each other. From these, he handpicked his favorite clips, edited them slightly, and performed them.

We’re very proud to have been a part of Dan’s amazing performance. We’re extremely excited for the direction that this research is headed, and we’re always looking for ways to give musicians new tools to interact with. Check out the Google Keyword blog post to learn more about MusicLM and you can try it yourself by signing up via the AI Test Kitchen app.

Acknowledgements

This year’s I/O pre-show was a huge collaborative effort. We would like to thank everyone involved in making the performance a success (in no particular order): Josh Christman, Daniel Chandler, Meghan Reinhardt, Carolyne De Bellefeuille, Adi Goodrich, Jon Barron, Meghan Reinhardt, Carolyne De Bellefeuille, Irina Blok, Spencer Sterling, Ruben Beddeleem, Ben Poole, Cadie Desbiens-Desmeules, Chris Donahue, Jorge Gonzalez Mendez, Noah Constant, Jesse Engel, Timo Denk, Andrea Agostinelli, Neil Zeghidour, Christian Frank, Mauricio Zuluaga, Hema Manickavasagam, Tom Hume, and Lynn Cherry.

The Wordcraft Writers Workshop: Creative Co-Writing with AI

Thu, 01 Dec 2022 08:00:00 -0800

A core piece of Magenta’s mission is to empower creativity using AI and machine learning. In order to evaluate how well this goal is being achieved, it is important to put tools in the hands of creators, encouraging them to share honest and critical feedback. This feedback can help researchers to thoughtfully develop the next generations of ML-powered creative tools. Most of our prior efforts to engage with creators have been in the domain of music (for example, Magenta Studio and NSynth).

However, human creativity encompasses far more than just music: visual artists paint, draw, and sculpt, and writers craft stories and poetry. In recent years, we’ve seen huge advancements in machine learning techniques that can facilitate creativity in these other modalities. Creative writing is an especially interesting domain because it is so challenging for AI to get right. Even short stories commonly have narrative arcs that span paragraphs or longer, multiple characters with diverging points of view, and a careful balance of familiar archetypes and novel storytelling–all difficult traits for state-of-the-art AI to replicate. At the same time, the omnipresent writer’s block is not a problem at all for neural language models like LaMDA, which can effortlessly generate as many words as you ask them for.

Earlier this year, we invited a cohort of 13 professional creative writers to try their hands at writing stories using Wordcraft, an AI-augmented text editor with a wide range of generative capabilities targeted at creative writing assistance. Wordcraft can suggest story ideas, rewrite text according to user-provided instructions, and elaborate on what has already been written. It also has a chatbot interface where users can engage with LaMDA, Google’s dialog-based language model, about their stories.

A demo of the Wordcraft web application

As in generative music, AI-assisted story writing can be a mixed bag. At its best, Wordcraft made suggestions that were inspiring and surrealistic, and writers applauded its usefulness for ideation and overcoming writer’s block. However, it also had a tendency to rehash tired tropes, and it could take wading through many dull suggestions before finding an interesting one.

All of the writers’ stories are available in the Wordcraft Writer’s Workshop’s digital literary magazine, and a detailed writeup of what we learned about the role machine learning can play in creative writing can be found here.

We hope you enjoy perusing through the stories, and we are excited to hear your ideas about how AI can create valuable creative writing tools.