Microsoft Corp. today introduced a trio of artificial intelligence models optimized to process images and audio.
The algorithms are available through Microsoft Foundry, an Azure service that developers can use to build AI applications. The tech giant has also started rolling out the models to a number of other products.
The first new algorithm, MAI-Image-2, can generate images with a resolution of up to 1024 by 1024 pixels based on user instructions. Each prompt may contain up to 32,000 tokens worth of text. Under the hood, MAI-Image-2 turns instructions into images using 10 billion to 50 billion non-embedding parameters. Non-embedding parameters are model components that focus on generating content rather than preliminary data preparation tasks.
Microsoft says that MAI-Image-2 is at least twice as fast as its previous-generation image generator. The second new model that debuted today, MAI-Transcribe-1, also brings significant speed improvements. It can transcribe speech 2.5 times faster than Microsoft’s earlier models.
MAI-Transcribe-1’s other selling point is its accuracy. Microsoft tested the model’s mean word error rate, a measure of transcript quality, across 25 languages. MAI-Transcribe-1 logged an error rate of 3.9%, which put it ahead of Gemini 3.1 Flash and OpenAI Group PBC’s GPT-Transcribe. One contributor to the model’s accuracy is that it includes features for filtering environmental noise.
On launch, MAI-Transcribe-1 supports batch transcription. That means the model can only process pre-prepared files such as audiobooks. According to Microsoft, a future update will add the ability to transcribe real-time audio streams. The company is also working on a so-called diarization feature that can split the text of a transcript into speaker-specific segments.
The third model that Microsoft introduced today is called MAI-Voice-1. As the name suggests, it’s optimized to generate synthetic speech based on user-provided scripts. Customers can choose from one of built-in AI voices or use their own voice.
Microsoft says all three models offer competitive pricing compared to competitors. MAI-Image-2 is priced at $5 per 1 million input tokens and $33 per 1 million output tokens. MAI-Transcribe-1 costs $0.36 per hour of transcribed speech, while MAI-Voice-1 starts at $22 per 1 million characters.
The models are available through not only Microsoft Foundry but also several other services. Microsoft is currently in the process of rolling out MAI-Image-2 to Bing and PowerPoint, while MAI-Voice-1 is accessible in an audio creation tool called Copilot Audio Expressions.
The tech giant has developed a line of custom AI chips called MAIA to power its AI workloads. The newest addition to the series family, the inference-optimized Maia 200, made its debut in late January. Microsoft says that the three-nanometer chip outperforms competing cloud providers’ custom AI chips across several benchmarks.
Photo: Microsoft
Support our mission to keep content open and free by engaging with theCUBE community. Join theCUBE’s Alumni Trust Network, where technology leaders connect, share intelligence and create opportunities.
- 15M+ viewers of theCUBE videos, powering conversations across AI, cloud, cybersecurity and more
- 11.4k+ theCUBE alumni — Connect with more than 11,400 tech and business leaders shaping the future through a unique trusted-based network.
About News Media
Founded by tech visionaries John Furrier and Dave Vellante, News Media has built a dynamic ecosystem of industry-leading digital media brands that reach 15+ million elite tech professionals. Our new proprietary theCUBE AI Video Cloud is breaking ground in audience interaction, leveraging theCUBEai.com neural network to help technology companies make data-driven decisions and stay at the forefront of industry conversations.
