By using this site, you agree to the Privacy Policy and Terms of Use.
Accept
World of SoftwareWorld of SoftwareWorld of Software
  • News
  • Software
  • Mobile
  • Computing
  • Gaming
  • Videos
  • More
    • Gadget
    • Web Stories
    • Trending
    • Press Release
Search
  • Privacy
  • Terms
  • Advertise
  • Contact
Copyright © All Rights Reserved. World of Software.
Reading: Hallucinations by Design (Part 2): The Silent Flaws of Embeddings & Why Your AI Is Getting It Wrong | HackerNoon
Share
Sign In
Notification Show More
Font ResizerAa
World of SoftwareWorld of Software
Font ResizerAa
  • Software
  • Mobile
  • Computing
  • Gadget
  • Gaming
  • Videos
Search
  • News
  • Software
  • Mobile
  • Computing
  • Gaming
  • Videos
  • More
    • Gadget
    • Web Stories
    • Trending
    • Press Release
Have an existing account? Sign In
Follow US
  • Privacy
  • Terms
  • Advertise
  • Contact
Copyright © All Rights Reserved. World of Software.
World of Software > Computing > Hallucinations by Design (Part 2): The Silent Flaws of Embeddings & Why Your AI Is Getting It Wrong | HackerNoon
Computing

Hallucinations by Design (Part 2): The Silent Flaws of Embeddings & Why Your AI Is Getting It Wrong | HackerNoon

News Room
Last updated: 2025/04/02 at 12:25 AM
News Room Published 2 April 2025
Share
SHARE

Caption: The two characters look different but share a striking similarity in posture, expression, and background—almost like they are “embeddings” of different sentences that end up close together.

READ PART-1 here (https://hackernoon.com/hallucination-by-design-how-embedding-models-misunderstand-language)

Last month, I shared how embedding models hallucinate when handling simple language variations like negation and capitalization. The response was overwhelming – seems I’m not the only one who’s been burnt by these issues. Today, I’m diving deeper into even more troubling blind spots I’ve discovered through testing. These are the kinds that keep me up at night and make me question everything about how we’re building AI systems.

This is the second part in the series on Hallucinations by Design. It is a continuation of our previous discussion on how embeddings hallucinate. To get the most out of this article, I highly recommend reading the linked article first, as it lays the foundational concepts necessary to fully grasp the ideas explored here. By doing so, you’ll have a seamless learning experience and a deeper understanding of the topic.

Hypothetical vs. actual? Just details!

Here’s where things get truly disturbing. When I ran “If the treatment works, symptoms should improve” against “The treatment works and symptoms have improved”, the similarity score hit 0.95. I sat staring at my screen in disbelief. One’s speculating about potential outcomes; the other’s reporting confirmed results!

I hit this problem working on a clinical research document. The search couldn’t distinguish between hypothesized treatment outcomes and verified results. Doctors searching for proven treatments were getting mixed results with unproven hypotheses. Do you think physicians making treatment decisions appreciate confusing speculation with evidence? I am sure I wouldn’t want my medical care based on “might work” rather than “does work”.

Again, think about all the cases where distinguishing hypotheticals from facts is essential – scientific research, medical trials, legal precedents, and investment analyses. When your model conflates “if X then possibly Y” with “X happened and caused Y”, you’ve completely misunderstood the epistemic status of the information. We’re basically working with models that can’t tell the difference between speculation and confirmation despite analyzing text where this distinction determines whether something is reliable information or mere conjecture.

Temporal order? Whatever order!

Embedding models see “She completed her degree before starting the job” and “She started her job before completing her degree” as NEARLY identical – ridiculous 0.97 similarity score. One’s a traditional career path; the other’s working while studying. Completely different situations!

I found this while building a resume screening system. The embeddings couldn’t distinguish between candidates who finished their degrees before working and those who were still completing studies. Hiring managers wasted hours interviewing candidates who didn’t meet their basic qualification requirements. Do you think busy recruiters appreciate having their time wasted with mismatched candidates? I am sure I wouldn’t want my hiring pipeline filled with noise.

Think about all the cases where sequence is crucial – medical treatment protocols, legal procedural requirements, cooking recipes, assembly instructions, and chemical formulations. When your model can’t tell “A before B” from “B before A,” you’ve lost fundamental causal relationships. We’re basically working with models that treat time as an optional concept despite analyzing text that’s full of critical sequential information.

Quantitative thresholds vanish into thin air

This one actually made me spill my coffee. Embedding models see “The company barely exceeded earnings expectations” and “The company significantly missed earnings expectations” as SHOCKINGLY similar – 0.93 similarity score. Exceeded versus missed! These mean opposite things in finance!

If you are building a financial news analysis system, the embeddings wouldn’t distinguish between positive and negative earnings surprises – literally the difference between stock prices going up or down. Investors making trading decisions based on our summaries were getting completely contradictory information. Do you think people risking actual money appreciate getting fundamentally wrong market signals? I am sure I wouldn’t want my retirement account guided by such confusion.

Now, think about all the cases where crossing a threshold changes everything – passing vs. failing grades, healthy vs. dangerous vital signs, profitable vs. unprofitable businesses, compliant vs. non-compliant regulatory statuses. Your model loses its ability to make meaningful distinctions when it cannot distinguish between barely meeting the target and completely missing it. We’re basically working with models that don’t understand the concept of thresholds despite analyzing text that’s constantly discussing whether targets were met or missed.

Scalar inversions get completely flipped

The absurdity just keeps piling up. During testing, I found that “The meeting ran significantly shorter than planned” and “The meeting ran significantly longer than planned” scored a 0.96 similarity. I was in complete shock. These sentences describe completely opposite situations – time saved versus time wasted!

I encountered this with project management documents. The search couldn’t distinguish between schedule overruns and efficiencies. Managers searching for examples of time-saving techniques were getting shown projects with serious delays. Do you think executives tracking project timelines appreciate getting the exact opposite information they asked for? I am sure I would be furious if I were preparing for a board meeting with such backward data.

Think about all the cases where direction on a scale is crucial – cost savings vs. overruns, performance improvements vs. degradations, health improvements vs. declines, and risk increases vs. decreases. When your model treats “much higher than” as interchangeable with “much lower than”, you’ve lost the ability to track directional change. We’re basically working with models that don’t understand opposing directions despite analyzing text filled with comparative assessments.

Domain-specific opposites look like synonyms

Medical documents

I couldn’t believe what I was seeing in the healthcare tests. “The patient presents with tachycardia” versus “The patient presents with bradycardia” returned a 0.94 similarity score. For non-medical folks, that’s like confusing a racing heart with one that’s dangerously slow – conditions with opposite treatments!

I discovered this while working on a symptom-matching system for electronic health records. The embedding model couldn’t distinguish between fundamentally different medical conditions that require opposite treatments. Physicians searching for cases similar to a patient with a racing heart were shown cases of patients with dangerously slow heartbeats. Do you think doctors making time-sensitive decisions appreciate getting contradictory clinical information? I am sure I wouldn’t want my treatment based on the opposite of my actual condition.

In the field of medicine, these distinctions can have significant consequences. Tachycardia might be treated with beta-blockers, while bradycardia might require a pacemaker – giving the wrong treatment could be fatal. We’re basically working with models that can’t distinguish between opposite medical conditions despite analyzing text where this distinction determines appropriate care.

Legal documents

The legal tests were just as bad. When comparing “Plaintiff bears the burden of proof” with “Defendant bears the burden of proof”, the model returned a staggering 0.97 similarity. Let that sink in. These statements literally determine which side has to prove their case in court! Mixing these up could lose you your lawsuit.

The search couldn’t distinguish between fundamentally different legal standards and responsibilities. Lawyers researching precedents about plaintiff burdens were shown cases discussing defendant burdens. Do you think attorneys preparing for trial appreciate getting precisely backward legal standards? I am sure I wouldn’t want my lawsuit built on completely inverted legal principles.

In legal contexts, who bears the burden of proof often determines the outcome of a case. When your model can’t distinguish which party has which responsibilities, you’ve undermined the entire basis of legal reasoning. We’re basically working with models that confuse legal roles despite analyzing text where these distinctions define how justice functions.

Units of measurement

I had to run this test multiple times because I couldn’t believe the results. “The procedure takes about 5 minutes” versus “The procedure takes about 5 hours” scored a whopping 0.97 similarity. Is this for real? That’s a 60x time difference! Imagine waiting for your “5-minute” appointment that actually takes 5 hours.

I found this while building the same healthcare system. The embeddings couldn’t distinguish between brief and lengthy procedures. Clinic managers trying to schedule short procedures were being shown lengthy operations that would block their surgery suites for entire days. Do you think medical facilities with tight scheduling constraints appreciate having their entire day’s workflow disrupted? I am sure I wouldn’t want my hospital running 60x behind schedule.

Units of measurement fundamentally change meaning. When your model treats “5 minutes” and “5 hours” as essentially identical, you’ve lost the ability to understand magnitude. We’re basically working with models that ignore units despite analyzing text where units determine whether something is trivial or significant.

More measurement problems

And it just gets worse from there. During the use of the same healthcare documents, I found “The tumor is 2 centimeters in diameter” and “The tumor is 2 inches in diameter” scored an alarming 0.98 similarity. For context, that’s the difference between a potentially minor tumor and one that’s 2.54x larger – often the threshold between “watch and wait” versus immediate surgery.

The embeddings couldn’t distinguish between metric and imperial measurements. Oncologists researching treatment options for small tumors were being shown cases of much larger growths. Do you think cancer specialists appreciate getting case studies that aren’t remotely comparable to their patients?

Even speed limits get confused. Models treat “Maintain speeds under 30 mph” and “Maintain speeds under 30 kph” as HIGHLY similar – a problematic 0.96 similarity score. That’s the difference between 30 and 18.6 miles per hour – enough to determine whether an accident is fatal!

Converting between units isn’t just a mathematical exercise – it fundamentally changes recommendations, safety parameters, and outcomes. We’re basically working with models that think numbers without units are sufficient despite analyzing text where the units completely transform the meaning.

The Truth and the Results

Here is the comparison between msmarco-distilbert-base-tas-b, all-mpnet-base-v2, and open-ai-text-embedding-3-large, and you will notice that there is no significant difference between the output of these models.

                                           ***msmarco-distilbert-base-tas-b embedding score across different test cases***
                                              ***all-mpnet-base-v2 embedding score across different test cases***
                                         ***openai-text-embedding-3-large embedding score across different test cases***

Just to repeat..

Look, embeddings are amazingly useful despite these problems. I’m not advocating against using them, but rather, it’s crucial to approach them cautiously. Here’s my battle-tested advice after dozens of projects and countless failures:

  1. Test your model on real user language patterns before deployment. Not academic benchmarks, not sanitized test cases – actual examples of how your users communicate. We built a “linguistic stress test” toolkit that simulates common variations like negations, typos, and numerical differences. Every system we test fails in some areas – the question is whether those areas matter for your specific application.

  2. Build guardrails around critical blind spots. Different applications have different can’t-fail requirements. For healthcare, it’s typically negation and entity precision. For finance, it’s numbers and temporal relationships. For legal, it’s conditions and obligations. Identify what absolutely can’t go wrong in your domain, and implement specialized safeguards.

  3. Layer different techniques instead of betting everything on embeddings. Our most successful systems combine embedding-based retrieval with keyword verification, explicit rule checks, and specialized classifiers for critical distinctions. This redundancy isn’t inefficient; it’s essential.

  4. Be transparent with users about what the system can and can’t do reliably. We added confidence scores that explicitly flag when a result might involve negation, numerical comparison, or other potential weak points. Users appreciate the honesty, and it builds trust in the system overall.

**Here’s the most important thing I’ve learnt:**these models don’t understand language the way humans do – they understand statistical patterns. When I stopped expecting human-like understanding and started treating them as sophisticated pattern-matching tools with specific blind spots, my systems got better. Much better.

The blind spots I’ve described aren’t going away anytime soon – they’re baked into how these models work. But if you know they’re there, you can design around them. And sometimes, acknowledging a limitation is the first step toward overcoming it.

Note: I have many more such cases found through experiments, and I will cover them in my subsequent post.

The next continuation article will be coming out soon. Stay tuned!!

Sign Up For Daily Newsletter

Be keep up! Get the latest breaking news delivered straight to your inbox.
By signing up, you agree to our Terms of Use and acknowledge the data practices in our Privacy Policy. You may unsubscribe at any time.
Share This Article
Facebook Twitter Email Print
Share
What do you think?
Love0
Sad0
Happy0
Sleepy0
Angry0
Dead0
Wink0
Previous Article EV Tax Credits: How to Get the Most Money on Your Return
Next Article AirTag 2: What to expect from the new generation coming soon
Leave a comment

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Stay Connected

248.1k Like
69.1k Follow
134k Pin
54.3k Follow

Latest News

From the basic to the Paperwhite, these are Amazon’s best Kindles
News
The best wireless headphones get even better
News
PS5 Pro 6 months later — 3 things I love and 3 things I hate
News
NVIDIA reportedly plans to establish research center in Shanghai · TechNode
Computing

You Might also Like

Linux 6.14.7 & Other Stable Kernel Releases Bring ARM64 Security Fix

1 Min Read
Computing

NVIDIA reportedly plans to establish research center in Shanghai · TechNode

1 Min Read
Computing

Debian 13 “Trixie” Now In Hard Freeze: MIPS64EL Demoted, RISC-V 64-bit Promoted

1 Min Read
Computing

Tencent and 2K team up to launch mobile game NBA 2K All Star on March 25 · TechNode

1 Min Read
//

World of Software is your one-stop website for the latest tech news and updates, follow us now to get the news that matters to you.

Quick Link

  • Privacy Policy
  • Terms of use
  • Advertise
  • Contact

Topics

  • Computing
  • Software
  • Press Release
  • Trending

Sign Up for Our Newsletter

Subscribe to our newsletter to get our newest articles instantly!

World of SoftwareWorld of Software
Follow US
Copyright © All Rights Reserved. World of Software.
Welcome Back!

Sign in to your account

Lost your password?