By using this site, you agree to the Privacy Policy and Terms of Use.
Accept
World of SoftwareWorld of SoftwareWorld of Software
  • News
  • Software
  • Mobile
  • Computing
  • Gaming
  • Videos
  • More
    • Gadget
    • Web Stories
    • Trending
    • Press Release
Search
  • Privacy
  • Terms
  • Advertise
  • Contact
Copyright © All Rights Reserved. World of Software.
Reading: Nvidia releases massive AI-ready European language dataset and tools – News
Share
Sign In
Notification Show More
Font ResizerAa
World of SoftwareWorld of Software
Font ResizerAa
  • Software
  • Mobile
  • Computing
  • Gadget
  • Gaming
  • Videos
Search
  • News
  • Software
  • Mobile
  • Computing
  • Gaming
  • Videos
  • More
    • Gadget
    • Web Stories
    • Trending
    • Press Release
Have an existing account? Sign In
Follow US
  • Privacy
  • Terms
  • Advertise
  • Contact
Copyright © All Rights Reserved. World of Software.
World of Software > News > Nvidia releases massive AI-ready European language dataset and tools – News
News

Nvidia releases massive AI-ready European language dataset and tools – News

News Room
Last updated: 2025/08/17 at 11:48 AM
News Room Published 17 August 2025
Share
SHARE

Only a tiny fraction of the more than 7,000 languages on Earth are supported by artificial intelligence models, so today Nvidia Corp. announced a massive new AI-ready dataset and models to support the development of high-quality AI translation for European languages.

The new dataset, named Granary, is a massive open-source corpus of multilingual audio, including more than a million hours of audio, plus 650,000 hours of speech recognition and 350,000 hours of speech translation.

Nvidia’s speech AI team collaborated with researchers from Carnegie Mellon University and Fondazione Bruno Kessler to process unlabeled audio and public speech data into information usable for AI training. The dataset is available openly and for free on GitHub.

Granary includes 25 European languages, representing nearly all of the European Union’s 24 official languages, plus Russian and Ukrainian. The dataset also contains languages with limited available data, such as Croatian, Estonian and Maltese.

This is critically important because providing these underrepresented human-annotated datasets will enable developers to create more inclusive speech technologies for audiences who speak those languages, while using less training data in their AI applications and models.

Nvidia fine-tuned its dataset for European languages, focusing on high-quality audio and annotation specific to those language families, which allows models to use less data. The team demonstrated in their research paper that, compared to other popular datasets, it takes around half as much Granary training data to achieve high accuracy for automatic speech recognition and automatic speech translation.

New AI translation and transcription models

Alongside Granary, Nvidia also released new Canary and Parakeet models to demonstrate what can be created with the dataset.

The two models are Canary-1b-v2, a model optimized for high accuracy on complex tasks, and Parkeet-tdt-0.6b-v6, a smaller model designed for high-speed, low-latency translation and transcription tasks.

The new Canary is available under a fairly permissive license for commercial and research use, expanding Canary’s current languages from four to 25. It offers transcription and translation quality comparable to models three times larger while running inference up to 10 times faster. At 1 billion parameters, it can run completely on-device on most next-gen flagship smartphones for speech translation on the fly.

Parakeet prioritizes high-throughput and is capable of ingesting and transcribing 24 minutes of audio in a single pass. It can detect the audio language and transcribe without additional prompting. Both Canary and Parakeet provide accurate punctuation, capitalization and word-level timestamps in their outputs.

Other AI models that provide massively multilingual capabilities include Cohere for AI’s Aya Expanse, a family of high-performance multilingual models developed by the nonprofit research lab run by the AI startup Cohere Inc. It is part of the Aya Collection, one of the largest multilingual dataset collections to date, which includes 513 million examples, and includes Aya-101, an open AI model capable of covering more than 100 languages.

Nvidia provided additional information on how to fine-tune models using the Granary dataset, such as how the company trained Canary and Parakeet, on GitHub and has made the new massive multilingual dataset available to developers on Hugging Face.

Image: Nvidia

Support our mission to keep content open and free by engaging with theCUBE community. Join theCUBE’s Alumni Trust Network, where technology leaders connect, share intelligence and create opportunities.

  • 15M+ viewers of theCUBE videos, powering conversations across AI, cloud, cybersecurity and more
  • 11.4k+ theCUBE alumni — Connect with more than 11,400 tech and business leaders shaping the future through a unique trusted-based network.

About News Media

News Media is a recognized leader in digital media innovation, uniting breakthrough technology, strategic insights and real-time audience engagement. As the parent company of News, theCUBE Network, theCUBE Research, CUBE365, theCUBE AI and theCUBE SuperStudios — with flagship locations in Silicon Valley and the New York Stock Exchange — News Media operates at the intersection of media, technology and AI.

Founded by tech visionaries John Furrier and Dave Vellante, News Media has built a dynamic ecosystem of industry-leading digital media brands that reach 15+ million elite tech professionals. Our new proprietary theCUBE AI Video Cloud is breaking ground in audience interaction, leveraging theCUBEai.com neural network to help technology companies make data-driven decisions and stay at the forefront of industry conversations.

Sign Up For Daily Newsletter

Be keep up! Get the latest breaking news delivered straight to your inbox.
By signing up, you agree to our Terms of Use and acknowledge the data practices in our Privacy Policy. You may unsubscribe at any time.
Share This Article
Facebook Twitter Email Print
Share
What do you think?
Love0
Sad0
Happy0
Sleepy0
Angry0
Dead0
Wink0
Previous Article Hwky()unhsnnquFnhFunswhnGwhf
Next Article Why Your AirPods Keep Pausing And How To Fix It – BGR
Leave a comment

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Stay Connected

248.1k Like
69.1k Follow
134k Pin
54.3k Follow

Latest News

Connected glasses that make life easier for the visually impaired
Mobile
Educational Byte: Why You Can’t Always Trust Token Prices on CoinMarketCap | HackerNoon
Computing
5 Secret OBS Tricks to Elevate Your Live Streams
News
Antigravity unveils ‘world first’ 360-degree video camera drone
News

You Might also Like

News

5 Secret OBS Tricks to Elevate Your Live Streams

8 Min Read
News

Antigravity unveils ‘world first’ 360-degree video camera drone

4 Min Read
News

Why The Dream Chaser Space Plane Keeps Getting Delayed – BGR

5 Min Read
News

It’s time to ditch your old HDMI cables

9 Min Read
//

World of Software is your one-stop website for the latest tech news and updates, follow us now to get the news that matters to you.

Quick Link

  • Privacy Policy
  • Terms of use
  • Advertise
  • Contact

Topics

  • Computing
  • Software
  • Press Release
  • Trending

Sign Up for Our Newsletter

Subscribe to our newsletter to get our newest articles instantly!

World of SoftwareWorld of Software
Follow US
Copyright © All Rights Reserved. World of Software.
Welcome Back!

Sign in to your account

Lost your password?