Kyutai this week introduced Moshi, a multimodal AI model that can process audio streams in real time. The model, which outperforms some of OpenAI’s GPT-4o capabilities, is designed to understand and express itself very quickly, and even interrupt its human interlocutor. Currently, Moshi speaks and understands English with different accents, including French, and can listen to and generate audio and speech while maintaining continuity in its textual thoughts.
A technology open to all
A major feature of Moshi is its ability to handle two audio streams simultaneously, allowing it to listen and speak at the same time. This real-time interaction is made possible by joint pre-training on a mixture of text and audio: Moshi’s model uses synthetic text data from the Helium model, a 7 billion parameter language model developed by Kyutai.
Moshi’s refinement process involved 100,000 synthetic spoken conversations, converted using text-to-speech (TTS) technology. The model’s voice was trained on data generated by a separate TTS model, which achieved an end-to-end latency of 200 milliseconds, which is pretty impressive.
Kyutai has also developed a smaller variant of Moshi that can run on a MacBook or consumer-sized GPU, making the technology more easily integrated into many users’ homes.
Subscribe to WorldOfSoftware
Kyutai has hammered home the importance of responsible AI use: a watermark is built in to detect AI-generated audio, a feature that is still in development. The decision to open source Moshi also reflects the lab’s commitment to transparency and collaborative development within the AI community.
« The model code and weights will soon be shared freely and for free, which is also unprecedented for such technology. “, explains Kyutai. They will be useful to both researchers in the field and developers working on voice-based products and services. ».
The team plans to release a technical report and open versions of the model, including the inference codebase, the 7B model, the audio codec, and the entire optimized stack. Future versions (Moshi 1.1, 1.2, and 2.0) will refine the model based on user feedback. Moshi’s license is intended to be as open as possible, which should help foster the broadest possible adoption.
It is already possible to test Moshi, since the bot is online at this address.
🟣 To not miss any news on the WorldOfSoftware, , .