By using this site, you agree to the Privacy Policy and Terms of Use.
Accept
World of SoftwareWorld of SoftwareWorld of Software
  • News
  • Software
  • Mobile
  • Computing
  • Gaming
  • Videos
  • More
    • Gadget
    • Web Stories
    • Trending
    • Press Release
Search
  • Privacy
  • Terms
  • Advertise
  • Contact
Copyright © All Rights Reserved. World of Software.
Reading: Why real-time voice AI is harder than it sounds – News
Share
Sign In
Notification Show More
Font ResizerAa
World of SoftwareWorld of Software
Font ResizerAa
  • Software
  • Mobile
  • Computing
  • Gadget
  • Gaming
  • Videos
Search
  • News
  • Software
  • Mobile
  • Computing
  • Gaming
  • Videos
  • More
    • Gadget
    • Web Stories
    • Trending
    • Press Release
Have an existing account? Sign In
Follow US
  • Privacy
  • Terms
  • Advertise
  • Contact
Copyright © All Rights Reserved. World of Software.
World of Software > News > Why real-time voice AI is harder than it sounds – News
News

Why real-time voice AI is harder than it sounds – News

News Room
Last updated: 2026/02/21 at 11:23 AM
News Room Published 21 February 2026
Share
Why real-time voice AI is harder than it sounds –  News
SHARE

Real-time voice recognition has become so common that many of us now take it for granted. But that convenience is the product of years of deep learning research and products that yielded more frustration than results.

It turns out that simultaneous voice transcription is one of the hardest engineering problems in modern artificial intelligence, for reasons that have more to do with the foibles of human speech and our lack of tolerance for delay than with the underlying technology.

Voice is where many AI systems first break down, especially as companies rush to deploy agents in customer-facing environments, said Scott Stephenson, co-founder chief executive officer of Deepgram Inc., developer of a scalable platform for automatic speech recognition and text-to-speech capabilities delivered via an application programming interface.

Human tolerance has its limits

“It has to do with real time,” he said. “If people are working with a product that isn’t expected to work in real time, they’ll allow more failures or silent failures.”

A misfiring chatbot can be retried. A voice assistant that pauses, misunderstands or responds awkwardly annoys the user. Those latency constraints mean “you have to get everything that you need to get done in 500 milliseconds or less,” Stephenson said.

Unlike text, which is standardized, speech is variable. The same word can sound dramatically different depending on accent, age, language, microphone quality, background noise or even where the speaker is standing. Stephenson called this one of the biggest problems in building robust speech systems.

Transcription tools have been around for years, but most only worked well with perfect audio. Those rule-based speech systems were built from layered models that tended to compound errors.

“Each of the models was maybe 80% or 85% accurate,” Stephenson said. “When you stack five of those together, you get down to 50% accuracy.”

Deep learning breakthrough

The breakthrough was end-to-end deep learning, in which models trained directly on massive datasets and inferred the rules themselves.

But even strong models are only part of the equation. Enterprise voice systems must be deployed like infrastructure, and the needs of business buyers are fundamentally different from those of consumers. “It has to have low latency, it has to have high throughput, it has to be reliable, it has to be debunkable, it has to be adaptable and get better over time,” Stephenson said.

Deployment options matter too. Many enterprises want voice recognition to run in their own environments for regulatory or privacy reasons. Deepgram delivers its technology using an API-first approach, but Stephenson said the differentiator is not the interface but the ability to deliver consistent performance at scale.

Measuring quality in voice recognition is more complex than many executives assume, he said. The primary metric for speech-to-text is word error rate, or the percentage of words transcribed incorrectly. “If your word error rate is 25% or less, you can get value,” he said. But perfection is unrealistic: “There really isn’t a zero percent word error rate,” even with humans.

Voice generation is even harder to score objectively. Stephenson said it relies heavily on human preference testing with “tens or hundreds of people” across different scenarios.

The infrastructure burden is growing as voice agents increasingly rely on large language models and tool use behind the scenes. Latency is a physics problem at global scale. Real-time voice systems require regional endpoints because “the Earth is large enough that the speed of light matters,” Stephenson said. That’s why Deepgram is expanding its endpoint network to Europe this year, with Asia on deck.

Because of its inherent complexity, voice AI shouldn’t be viewed as an all-or-nothing proposition. Stephenson advised testing in a few scenarios where the lexicography is limited and expanding from there. “Don’t try to boil the ocean,” he said.

Voice recognition may be the most natural interface humans have, but making it work reliably in real time requires disciplined engineering, global infrastructure and models trained to survive the chaos of the way people speak.

Image: News/Meta AI

Support our mission to keep content open and free by engaging with theCUBE community. Join theCUBE’s Alumni Trust Network, where technology leaders connect, share intelligence and create opportunities.

  • 15M+ viewers of theCUBE videos, powering conversations across AI, cloud, cybersecurity and more
  • 11.4k+ theCUBE alumni — Connect with more than 11,400 tech and business leaders shaping the future through a unique trusted-based network.

About News Media

News Media is a recognized leader in digital media innovation, uniting breakthrough technology, strategic insights and real-time audience engagement. As the parent company of News, theCUBE Network, theCUBE Research, CUBE365, theCUBE AI and theCUBE SuperStudios — with flagship locations in Silicon Valley and the New York Stock Exchange — News Media operates at the intersection of media, technology and AI.

Founded by tech visionaries John Furrier and Dave Vellante, News Media has built a dynamic ecosystem of industry-leading digital media brands that reach 15+ million elite tech professionals. Our new proprietary theCUBE AI Video Cloud is breaking ground in audience interaction, leveraging theCUBEai.com neural network to help technology companies make data-driven decisions and stay at the forefront of industry conversations.

Sign Up For Daily Newsletter

Be keep up! Get the latest breaking news delivered straight to your inbox.
By signing up, you agree to our Terms of Use and acknowledge the data practices in our Privacy Policy. You may unsubscribe at any time.
Share This Article
Facebook Twitter Email Print
Share
What do you think?
Love0
Sad0
Happy0
Sleepy0
Angry0
Dead0
Wink0
Previous Article How the Instagram Algorithm Works in 2025 | Ultimate Guide How the Instagram Algorithm Works in 2025 | Ultimate Guide
Next Article How Staying Hydrated Keeps You Looking and Feeling Young How Staying Hydrated Keeps You Looking and Feeling Young
Leave a comment

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Stay Connected

248.1k Like
69.1k Follow
134k Pin
54.3k Follow

Latest News

How to Spot Accounts That Buy Instagram Followers |
How to Spot Accounts That Buy Instagram Followers |
Computing
'Daredevil: Born Again' Season 2 Review: Series Soars With Unexpected Turns, Pumped-Up Suspense and Fierce Action
'Daredevil: Born Again' Season 2 Review: Series Soars With Unexpected Turns, Pumped-Up Suspense and Fierce Action
News
Google Introduces Duet AI Pricing: Unleashing the Power of AI-Driven Productivity – Chat GPT AI Hub
Google Introduces Duet AI Pricing: Unleashing the Power of AI-Driven Productivity – Chat GPT AI Hub
Computing
Ugreen NASync iDX6011 Pro NAS review: The right kind of overkill
Ugreen NASync iDX6011 Pro NAS review: The right kind of overkill
News

You Might also Like

'Daredevil: Born Again' Season 2 Review: Series Soars With Unexpected Turns, Pumped-Up Suspense and Fierce Action
News

'Daredevil: Born Again' Season 2 Review: Series Soars With Unexpected Turns, Pumped-Up Suspense and Fierce Action

7 Min Read
Ugreen NASync iDX6011 Pro NAS review: The right kind of overkill
News

Ugreen NASync iDX6011 Pro NAS review: The right kind of overkill

1 Min Read
Neuralink’s Brain Chip Can Now Translate Brain Activity Into Audible Words
News

Neuralink’s Brain Chip Can Now Translate Brain Activity Into Audible Words

7 Min Read
The Bumpboxx BB-777 is the ultimate in boombox nostalgia
News

The Bumpboxx BB-777 is the ultimate in boombox nostalgia

3 Min Read
//

World of Software is your one-stop website for the latest tech news and updates, follow us now to get the news that matters to you.

Quick Link

  • Privacy Policy
  • Terms of use
  • Advertise
  • Contact

Topics

  • Computing
  • Software
  • Press Release
  • Trending

Sign Up for Our Newsletter

Subscribe to our newsletter to get our newest articles instantly!

World of SoftwareWorld of Software
Follow US
Copyright © All Rights Reserved. World of Software.
Welcome Back!

Sign in to your account

Lost your password?