Only a tiny fraction of the more than 7,000 languages on Earth are supported by artificial intelligence models, so today Nvidia Corp. announced a massive new AI-ready dataset and models to support the development of high-quality AI translation for European languages.
The new dataset, named Granary, is a massive open-source corpus of multilingual audio, including more than a million hours of audio, plus 650,000 hours of speech recognition and 350,000 hours of speech translation.
Nvidia’s speech AI team collaborated with researchers from Carnegie Mellon University and Fondazione Bruno Kessler to process unlabeled audio and public speech data into information usable for AI training. The dataset is available openly and for free on GitHub.
Granary includes 25 European languages, representing nearly all of the European Union’s 24 official languages, plus Russian and Ukrainian. The dataset also contains languages with limited available data, such as Croatian, Estonian and Maltese.
This is critically important because providing these underrepresented human-annotated datasets will enable developers to create more inclusive speech technologies for audiences who speak those languages, while using less training data in their AI applications and models.
Nvidia fine-tuned its dataset for European languages, focusing on high-quality audio and annotation specific to those language families, which allows models to use less data. The team demonstrated in their research paper that, compared to other popular datasets, it takes around half as much Granary training data to achieve high accuracy for automatic speech recognition and automatic speech translation.
New AI translation and transcription models
Alongside Granary, Nvidia also released new Canary and Parakeet models to demonstrate what can be created with the dataset.
The two models are Canary-1b-v2, a model optimized for high accuracy on complex tasks, and Parkeet-tdt-0.6b-v6, a smaller model designed for high-speed, low-latency translation and transcription tasks.
The new Canary is available under a fairly permissive license for commercial and research use, expanding Canary’s current languages from four to 25. It offers transcription and translation quality comparable to models three times larger while running inference up to 10 times faster. At 1 billion parameters, it can run completely on-device on most next-gen flagship smartphones for speech translation on the fly.
Parakeet prioritizes high-throughput and is capable of ingesting and transcribing 24 minutes of audio in a single pass. It can detect the audio language and transcribe without additional prompting. Both Canary and Parakeet provide accurate punctuation, capitalization and word-level timestamps in their outputs.
Other AI models that provide massively multilingual capabilities include Cohere for AI’s Aya Expanse, a family of high-performance multilingual models developed by the nonprofit research lab run by the AI startup Cohere Inc. It is part of the Aya Collection, one of the largest multilingual dataset collections to date, which includes 513 million examples, and includes Aya-101, an open AI model capable of covering more than 100 languages.
Nvidia provided additional information on how to fine-tune models using the Granary dataset, such as how the company trained Canary and Parakeet, on GitHub and has made the new massive multilingual dataset available to developers on Hugging Face.
Image: Nvidia
Support our mission to keep content open and free by engaging with theCUBE community. Join theCUBE’s Alumni Trust Network, where technology leaders connect, share intelligence and create opportunities.
- 15M+ viewers of theCUBE videos, powering conversations across AI, cloud, cybersecurity and more
- 11.4k+ theCUBE alumni — Connect with more than 11,400 tech and business leaders shaping the future through a unique trusted-based network.
About News Media
Founded by tech visionaries John Furrier and Dave Vellante, News Media has built a dynamic ecosystem of industry-leading digital media brands that reach 15+ million elite tech professionals. Our new proprietary theCUBE AI Video Cloud is breaking ground in audience interaction, leveraging theCUBEai.com neural network to help technology companies make data-driven decisions and stay at the forefront of industry conversations.