Nvidia Releases Massive AI-ready European Language Dataset And Tools - News

Only a tiny fraction of the more than 7,000 languages on Earth are supported by artificial intelligence models, so today Nvidia Corp. announced a massive new AI-ready dataset and models to support the development of high-quality AI translation for European languages.

The new dataset, named Granary, is a massive open-source corpus of multilingual audio, including more than a million hours of audio, plus 650,000 hours of speech recognition and 350,000 hours of speech translation.

Nvidia’s speech AI team collaborated with researchers from Carnegie Mellon University and Fondazione Bruno Kessler to process unlabeled audio and public speech data into information usable for AI training. The dataset is available openly and for free on GitHub.

Granary includes 25 European languages, representing nearly all of the European Union’s 24 official languages, plus Russian and Ukrainian. The dataset also contains languages with limited available data, such as Croatian, Estonian and Maltese.

This is critically important because providing these underrepresented human-annotated datasets will enable developers to create more inclusive speech technologies for audiences who speak those languages, while using less training data in their AI applications and models.

Nvidia fine-tuned its dataset for European languages, focusing on high-quality audio and annotation specific to those language families, which allows models to use less data. The team demonstrated in their research paper that, compared to other popular datasets, it takes around half as much Granary training data to achieve high accuracy for automatic speech recognition and automatic speech translation.

New AI translation and transcription models

Alongside Granary, Nvidia also released new Canary and Parakeet models to demonstrate what can be created with the dataset.

The two models are Canary-1b-v2, a model optimized for high accuracy on complex tasks, and Parkeet-tdt-0.6b-v6, a smaller model designed for high-speed, low-latency translation and transcription tasks.

The new Canary is available under a fairly permissive license for commercial and research use, expanding Canary’s current languages from four to 25. It offers transcription and translation quality comparable to models three times larger while running inference up to 10 times faster. At 1 billion parameters, it can run completely on-device on most next-gen flagship smartphones for speech translation on the fly.

Parakeet prioritizes high-throughput and is capable of ingesting and transcribing 24 minutes of audio in a single pass. It can detect the audio language and transcribe without additional prompting. Both Canary and Parakeet provide accurate punctuation, capitalization and word-level timestamps in their outputs.

Other AI models that provide massively multilingual capabilities include Cohere for AI’s Aya Expanse, a family of high-performance multilingual models developed by the nonprofit research lab run by the AI startup Cohere Inc. It is part of the Aya Collection, one of the largest multilingual dataset collections to date, which includes 513 million examples, and includes Aya-101, an open AI model capable of covering more than 100 languages.

Nvidia provided additional information on how to fine-tune models using the Granary dataset, such as how the company trained Canary and Parakeet, on GitHub and has made the new massive multilingual dataset available to developers on Hugging Face.

Image: Nvidia

Support our mission to keep content open and free by engaging with theCUBE community. Join theCUBE’s Alumni Trust Network, where technology leaders connect, share intelligence and create opportunities.

15M+ viewers of theCUBE videos, powering conversations across AI, cloud, cybersecurity and more
11.4k+ theCUBE alumni — Connect with more than 11,400 tech and business leaders shaping the future through a unique trusted-based network.

About News Media

News Media is a recognized leader in digital media innovation, uniting breakthrough technology, strategic insights and real-time audience engagement. As the parent company of News, theCUBE Network, theCUBE Research, CUBE365, theCUBE AI and theCUBE SuperStudios — with flagship locations in Silicon Valley and the New York Stock Exchange — News Media operates at the intersection of media, technology and AI.

Founded by tech visionaries John Furrier and Dave Vellante, News Media has built a dynamic ecosystem of industry-leading digital media brands that reach 15+ million elite tech professionals. Our new proprietary theCUBE AI Video Cloud is breaking ground in audience interaction, leveraging theCUBEai.com neural network to help technology companies make data-driven decisions and stay at the forefront of industry conversations.

Nvidia releases massive AI-ready European language dataset and tools – News

New AI translation and transcription models

Image: Nvidia

Leave a Reply Cancel reply

Stay Connected

Latest News

How fintech is shaping Britain’s finances – UKTN

What is Affiliate Marketing? The Essential Guide

Nothing Ear 3

The Easier, More Affordable Way to Get Windows 11 Pro

World of Software is your one-stop website for the latest tech news and updates, follow us now to get the news that matters to you.

Quick Link

Topics

Sign Up for Our Newsletter

New AI translation and transcription models

Image: Nvidia

Sign Up For Daily Newsletter

Be keep up! Get the latest breaking news delivered straight to your inbox.

Leave a Reply Cancel reply

Stay Connected

Latest News