Google’s Healthcare AI Made Up A Body Part — What Happens When Doctors Don’t Notice?

Scenario: A radiologist is looking at your brain scan and flags an abnormality in the basal ganglia. It’s an area of the brain that helps you with motor control, learning, and emotional processing. The name sounds a bit like another part of the brain, the basilar artery, which supplies blood to your brainstem — but the radiologist knows not to confuse them. A stroke or abnormality in one is typically treated in a very different way than in the other.

Now imagine your doctor is using an AI model to do the reading. The model says you have a problem with your “basilar ganglia,” conflating the two names into an area of the brain that does not exist. You’d hope your doctor would catch the mistake and double-check the scan. But there’s a chance they don’t.

Though not in a hospital setting, the “basilar ganglia” is a real error that was served up by Google’s healthcare AI model, Med-Gemini. A 2024 research paper introducing Med-Gemini included the hallucination in a section on head CT scans, and nobody at Google caught it, in either that paper or a blog post announcing it. When Bryan Moore, a board-certified neurologist and researcher with expertise in AI, flagged the mistake, he tells The Verge, the company quietly edited the blog post to fix the error with no public acknowledgement — and the paper remained unchanged. Google calls the incident a simple misspelling of “basal ganglia.” Some medical professionals say it’s a dangerous error and an example of the limitations of healthcare AI.

Med-Gemini is a collection of AI models that can summarize health data, create radiology reports, analyze electronic health records, and more. The pre-print research paper, meant to demonstrate its value to doctors, highlighted a series of abnormalities in scans that radiologists “missed” but AI caught. One of its examples was that Med-Gemini diagnosed an “old left basilar ganglia infarct.” But as established, there’s no such thing.

Fast-forward about a year, and Med-Gemini’s trusted tester program is no longer accepting new entrants — likely meaning that the program is being tested in real-life medical scenarios on a pilot basis. It’s still an early trial, but the stakes of AI errors are getting higher. Med-Gemini isn’t the only model making them. And it’s not clear how doctors should respond.

“What you’re talking about is super dangerous,” Maulin Shah, chief medical information officer at Providence, a healthcare system serving 51 hospitals and more than 1,000 clinics, tells The Verge. He added, “Two letters, but it’s a big deal.”

In a statement, Google spokesperson Jason Freidenfelds told The Verge that the company partners with the medical community to test its models and that Google is transparent about their limitations.

“Though the system did spot a missed pathology, it used an incorrect term to describe it (basilar instead of basal). That’s why we clarified in the blog post,” Freidenfelds said. He added, “We’re continually working to improve our models, rigorously examining an extensive range of performance attributes — see our training and deployment practices for a detailed view into our process.”

A ‘common mis-transcription’

On May 6th, 2024, Google debuted its newest suite of healthcare AI models with fanfare. It billed “Med-Gemini” as a “leap forward” with “substantial potential in medicine,” touting its real-world applications in radiology, pathology, dermatology, ophthalmology, and genomics.

The models trained on medical images, like chest X-rays, CT slices, pathology slides, and more, using de-identified medical data with text labels, according to a Google blog post. The company said the AI models could “interpret complex 3D scans, answer clinical questions, and generate state-of-the-art radiology reports” — even going as far as to say they could help predict disease risk via genomic information.

Moore saw the authors’ promotions of the paper early on and took a look. He caught the mistake and was alarmed, flagging the error to Google on LinkedIn and contacting authors directly to let them know.

The company, he saw, quietly switched out evidence of the AI model’s error. It updated the debut blog post phrasing from “basilar ganglia” to “basal ganglia” with no other differences and no change to the paper itself. In communication viewed by The Verge, Google Health employees responded to Moore, calling the mistake a typo.

In response, Moore publicly called out Google for the quiet edit. This time the company changed the result back with a clarifying caption, writing that “‘basilar’ is a common mis-transcription of ‘basal’ that Med-Gemini has learned from the training data, though the meaning of the report is unchanged.”

Google acknowledged the issue in a public LinkedIn comment, again downplaying the issue as a “misspelling.”

“Thank you for noting this!” the company said. “We’ve updated the blog post figure to show the original model output, and agree it is important to showcase how the model actually operates.”

As of this article’s publication, the research paper itself still contains the error with no updates or acknowledgement.

Whether it’s a typo, a hallucination, or both, errors like these raise much larger questions about the standards healthcare AI should be held to, and when it will be ready to be released into public-facing use cases.

“The problem with these typos or other hallucinations is I don’t trust our humans to review them”

“The problem with these typos or other hallucinations is I don’t trust our humans to review them, or certainly not at every level,” Shah tells The Verge. “These things propagate. We found in one of our analyses of a tool that somebody had written a note with an incorrect pathologic assessment — pathology was positive for cancer, they put negative (inadvertently) … But now the AI is reading all those notes and propagating it, and propagating it, and making decisions off that bad data.”

Errors with Google’s healthcare models have persisted. Two months ago, Google debuted MedGemma, a newer and more advanced healthcare model that specializes in AI-based radiology results, and medical professionals found that if they phrased questions differently when asking the AI model questions, answers varied and could lead to inaccurate outputs.

In one example, Dr. Judy Gichoya, an associate professor in the department of radiology and informatics at Emory University School of Medicine, asked MedGemma about a problem with a patient’s rib X-ray with a lot of specifics — “Here is an X-ray of a patient [age] [gender]. What do you see in the X-ray?” — and the model correctly diagnosed the issue. When the system was shown the same image but with a simpler question — “What do you see in the X-ray?” — the AI said there weren’t any issues at all. “The X-ray shows a normal adult chest,” MedGemma wrote.

In another example, Gichoya asked MedGemma about an X-ray showing pneumoperitoneum, or gas under the diaphragm. The first time, the system answered correctly. But with slightly different query wording, the AI hallucinated multiple types of diagnoses.

“The question is, are we going to actually question the AI or not?” Shah says. Even if an AI system is listening to a doctor-patient conversation to generate clinical notes, or translating a doctor’s own shorthand, he says, those have hallucination risks which could lead to even more dangers. That’s because medical professionals could be less likely to double-check the AI-generated text, especially since it’s often accurate.

“If I write ‘ASA 325 mg qd,’ it should change it to ‘Take an aspirin every day, 325 milligrams,’ or something that a patient can understand,” Shah says. “You do that enough times, you stop reading the patient part. So if it now hallucinates — if it thinks the ASA is the anesthesia standard assessment … you’re not going to catch it.”

Shah says he’s hoping the industry moves toward augmentation of healthcare professionals instead of replacing clinical aspects. He’s also looking to see real-time hallucination detection in the AI industry — for instance, one AI model checking another for hallucination risk and either not showing those parts to the end user or flagging them with a warning.

“In healthcare, ‘confabulation’ happens in dementia and in alcoholism where you just make stuff up that sounds really accurate — so you don’t realize someone has dementia because they’re making it up and it sounds right, and then you really listen and you’re like, ‘Wait, that’s not right’ — that’s exactly what these things are doing,” Shah says. “So we have these confabulation alerts in our system that we put in where we’re using AI.”

Gichoya, who leads Emory’s Healthcare Al Innovation and Translational Informatics lab, says she’s seen newer versions of Med-Gemini hallucinate in research environments, just like most large-scale AI healthcare models.

“Their nature is that [they] tend to make up things, and it doesn’t say ‘I don’t know,’ which is a big, big problem for high-stakes domains like medicine,” Gichoya says.

She added, “People are trying to change the workflow of radiologists to come back and say, ‘AI will generate the report, then you read the report,’ but that report has so many hallucinations, and most of us radiologists would not be able to work like that. And so I see the bar for adoption being much higher, even if people don’t realize it.”

Dr. Jonathan Chen, associate professor at the Stanford School of Medicine and the director for medical education in AI, searched for the right adjective — trying out “treacherous,” “dangerous,” and “precarious” — before settling on how to describe this moment in healthcare AI. “It’s a very weird threshold moment where a lot of these things are being adopted too fast into clinical care,” he says. “They’re really not mature.”

On the “basilar ganglia” issue, he says, “Maybe it’s a typo, maybe it’s a meaningful difference — all of those are very real issues that need to be unpacked.”

Some parts of the healthcare industry are desperate for help from AI tools, but the industry needs to have appropriate skepticism before adopting them, Chen says. Perhaps the biggest danger is not that these systems are sometimes wrong — it’s how credible and trustworthy they sound when they tell you an obstruction in the “basilar ganglia” is a real thing, he says. Plenty of errors slip into human medical notes, but AI can actually exacerbate the problem, thanks to a well-documented phenomenon known as automation bias, where complacency leads people to miss errors in a system that’s right most of the time. Even AI checking an AI’s work is still imperfect, he says. “When we deal with medical care, imperfect can feel intolerable.”

“Maybe other people are like, ‘If we can get as high as a human, we’re good enough.’ I don’t buy that for a second”

“You know the driverless car analogy, ‘Hey, it’s driven me so well so many times, I’m going to go to sleep at the wheel.’ It’s like, ‘Whoa, whoa, wait a minute, when your or somebody else’s life is on the line, maybe that’s not the right way to do this,’” Chen says, adding, “I think there’s a lot of help and benefit we get, but also very obvious mistakes will happen that don’t need to happen if we approach this in a more deliberate way.”

Requiring AI to work perfectly without human intervention, Chen says, could mean “we’ll never get the benefits out of it that we can use right now. On the other hand, we should hold it to as high a bar as it can achieve. And I think there’s still a higher bar it can and should reach for.” Getting second opinions from multiple, real people remains vital.

That said, Google’s paper had more than 50 authors, and it was reviewed by medical professionals before publication. It’s not clear exactly why none of them caught the error; Google did not directly answer a question about why it slipped through.

Dr. Michael Pencina, chief data scientist at Duke Health, tells The Verge he’s “much more likely to believe” the Med-Gemini error is a hallucination than a typo, adding, “The question is, again, what are the consequences of it?” The answer, to him, rests in the stakes of making an error — and with healthcare, those stakes are serious. “The higher-risk the application is and the more autonomous the system is … the higher the bar for evidence needs to be,” he says. “And unfortunately we are at a stage in the development of AI that is still very much what I would call the Wild West.”

“In my mind, AI has to have a way higher bar of error than a human,” Providence’s Shah says. “Maybe other people are like, ‘If we can get as high as a human, we’re good enough.’ I don’t buy that for a second. Otherwise, I’ll just keep my humans doing the work. With humans I know how to go and talk to them and say, ‘Hey, let’s look at this case together. How could we have done it differently?’ What are you going to do when the AI does that?”

Follow topics and authors from this story to see more like this in your personalized homepage feed and to receive email updates.

Hayden Field

Google’s healthcare AI made up a body part — what happens when doctors don’t notice?

A ‘common mis-transcription’

Leave a Reply Cancel reply

Stay Connected

Latest News

Luckin Coffee to enter Malaysian market next after Singapore · TechNode

I dug up the best early October Prime Day laptop deals

Quantum Entanglement Unlocks Unbreakable Randomness for Next-Gen Cryptography | HackerNoon

Cloud provider publishes ‘tech sovereignty’ plan for UK | Computer Weekly

World of Software is your one-stop website for the latest tech news and updates, follow us now to get the news that matters to you.

Quick Link

Topics

Sign Up for Our Newsletter

A ‘common mis-transcription’

Sign Up For Daily Newsletter

Be keep up! Get the latest breaking news delivered straight to your inbox.

Leave a Reply Cancel reply

Stay Connected

Latest News