Sometimes you want to transcribe something, but don’t want it to be hanging out on the internet for any hacker to see. Maybe it’s a conversation with your doctor or lawyer. Maybe you’re a journalist, and it’s a sensitive interview. Privacy and control are important.
That desire for privacy is one reason the French developer Mistral AI built its latest transcription models to be small enough to run on devices. They can run on your phone, on your laptop or in the cloud.
Voxtral Mini Transcribe 2, one of the new models announced Wednesday, is “super, super small,” Pierre Stock, Mistral’s vice president of science operations, told me. Another new model, Voxtral Realtime, can do the same thing but live, like closed captioning.
Privacy is not the only reason the company wanted to build small open-source models. By running right on the device you’re using, these models can work faster. No more waiting on files to find their way through the internet to a data center and back.
“What you want is the transcription to happen super, super close to you,” Stock said. “And the closest we can find to you is any edge device, so a laptop, a phone, a wearable like a smartwatch, for instance.”
The low latency (read: high speed) is especially important for real-time transcription. The Voxtral Realtime model can generate with a latency of less than 200 milliseconds, Stock said. It can transcribe a speaker’s words about as quickly as you can read them. No more waiting two or three seconds for the closed captioning to catch up.
Watch this: Chip Shortage Impacting iPhones, OpenAI Stalled Investment, TikTok Censorship Allegations | Tech Today
The Voxtral Realtime model is available through Mistral’s API and on Hugging Face, along with a demo where you can try it out.
In some brief testing, I found it generated fairly quickly (although not as fast as you’d expect if it were on device) and managed to capture what I said accurately in English with a little bit of Spanish mixed in. It’s capable of handling 13 languages right now, according to Mistral.
Voxtral Mini Transcribe 2 is also available through the company’s API, or you can play around with it in Mistral’s AI Studio. I used the model to transcribe my interview with Stock.
I found it to be quick and pretty reliable, although it struggled with proper names like Mistral AI (which it called Mr. Lay Eye) and Voxtral (VoxTroll). Yes, the AI model got its own name wrong. But Stock said users can customize the model to understand certain words, names and jargon better if they’re using it for specific tasks.
The challenge of building small, fast AI models is that they also have to be accurate, Stock said. The company touted the models’ performance on benchmarks showing improved error rates compared to competitors.
“It’s not enough to say, OK, I’ll make a small model,” Stock said. “What you need is a small model that has the same quality as larger models, right?”
