In my previous installment (READ HERE) , I explored how embedding models struggle with basic language variations like unit of measurements and domain specific sentences. The community response was remarkable – clearly, many of you have encountered similar challenges in your work. Today, I’m expanding our investigation to uncover even more concerning blind spots I’ve identified through extensive testing. These fundamental issues have profound implications for how we approach AI system development.
This article is the third entry in my Hallucinations by Design series, building directly on our examination of embedding hallucinations. For optimal comprehension, I strongly suggest reading the previous articles first (HERE and HERE) to establish the essential background needed to fully appreciate the concepts we’ll be discussing here. This sequential approach will provide you with a more coherent understanding of these critical issues.
Statistical significance gets completely reversed
My statistician colleague turned pale when I showed him this. The model rated “The results showed a significant difference (p<0.05)” and “The results showed no significant difference (p>0.05)” at 0.94 similarity. He just kept shaking his head. “That’s… those are opposites. That’s the whole point of statistical testing.”
The embeddings couldn’t distinguish between statistically significant and insignificant findings. Researchers searching for proven effects were shown a mix of validated and invalidated studies. Do you think scientists making research decisions appreciate confusing evidence with non-evidence? I am sure I wouldn’t want my research funding wasted pursuing effects that studies actually disproved.
Statistical significance is the cornerstone of empirical research. When your model can’t distinguish between “proven effect” and “no proven effect,” you’ve undermined the entire scientific method. We’re basically working with models that ignore p-values despite analyzing text where these values determine whether a finding is considered valid.
Identity versus similarity gets muddled
My materials science colleague laughed out loud when I showed her this one. The model gave a 0.97 similarity score to “This material is aluminum” versus “This material resembles aluminum.” She stopped laughing when I explained how this was affecting her lab’s search results. Being something and looking like something are fundamentally different!
The embeddings couldn’t distinguish between materials that actually were a substance and those that merely looked or behaved similarly. Engineers searching for aluminum components were getting results for aluminum-like alloys with completely different properties. Do you think aerospace engineers appreciate getting information about aluminum-like materials when structural integrity depends on actual aluminum? I am sure I wouldn’t want to fly in a plane built with “sort of like aluminum” parts.
The distinction between identity and similarity is fundamental to precise communication. When your model treats “X is Y” as equivalent to “X resembles Y,” you’ve lost the ability to make definitive identifications. We’re basically working with models that blur categorical boundaries despite analyzing text where precise identification is often the entire point.
Presuppositions vanish into thin air
Now a bit from philosophy subject and this one also bothered me on a deep level. The embedding model rated “What caused the system to fail?” and “Did the system fail?” at a 0.93 similarity. That completely misses the point! The first question assumes failure happened, while the second asks if it happened at all. That’s Logic 101!
The embeddings couldn’t distinguish between questions presupposing a condition and questions asking if the condition existed. Support agents investigating system failures were getting mixed with cases questioning whether failures had occurred at all. Do you think IT managers trying to resolve incidents appreciate wasting time on cases where nothing actually broke? I am sure I wouldn’t want my support team chasing phantoms instead of fixing real problems.
Presuppositions fundamentally change what a sentence means. When your model treats “Why is X true?” as equivalent to “Is X true?”, you’ve lost the ability to understand what is being assumed versus questioned. We’re basically working with models that miss embedded assumptions despite analyzing text where these assumptions often contain critical information.
Percentages get completely inverted
Looking at pharmaceutical data gave me a headache that no embedding model could properly classify. Get this: “Only 5% of patients experienced side effects” versus “Up to 95% of patients experienced side effects” scored a 0.90 similarity. In what universe are those remotely the same? One’s a remarkably safe drug, the other would probably never get FDA approval!
I discovered this building a pharmaceutical research database. The algorithm couldn’t distinguish between dramatically different safety profiles. Researchers looking for treatments with minimal side effects were shown options with nearly universal adverse effects. Do you think doctors prescribing medications appreciate confusing remarkably safe drugs with overwhelmingly problematic ones? I am sure I wouldn’t want to take medication with a 95% side effect rate when I thought it was 5%.
Percentages express fundamentally different magnitudes that often determine risk assessments and decision-making. When your model treats “5%” as similar to “95%,” you’ve lost the ability to understand statistical significance. We’re basically working with models that see percentages as decorative rather than substantive despite analyzing text where these values drive critical decisions.
There is a way to fix these kinds of issues and they will be covered later. Let’s understand more issues with embeddings.
Remember when embedding models were supposed to understand context? Well, they don’t. “The market is climbing a wall of worry” versus “Rock climbers are scaling a worrying wall” scored an 0.89 similarity. Anyone with basic reading comprehension knows one’s a financial metaphor and one’s about actual mountain climbing.
I discovered this building a financial news analysis system. The algorithm couldn’t distinguish between metaphorical and literal language. Investors searching for market analyses were getting mixed results including actual rock climbing stories. Do you think traders making investment decisions appreciate getting literal climbing articles mixed with financial analyses? I am sure I wouldn’t want my retirement savings influenced by articles about rock scaling techniques.
Metaphorical language is pervasive in specialized domains like finance, medicine, and law. When your model confuses metaphors with their literal interpretations, you’ve lost the ability to understand domain jargon. We’re basically working with models that miss figurative meaning despite analyzing text filled with specialized metaphors that domain experts instantly recognize.
Extensional versus intensional references get confused
I’m not an astronomer, but even I know this is wrong. “The Morning Star is visible at dawn” versus “The Evening Star is visible at dawn” scored a 0.93 similarity. Here’s the kicker – both refer to Venus, but only one statement is actually true! The Evening Star (Venus in the evening) isn’t visible at dawn, by definition.
The embeddings couldn’t distinguish between different ways of referring to the same object when those references carried different truth values. Astronomers searching for accurate observation times were shown contradictory information. Do you think researchers planning observations appreciate getting objectively false viewing times? I am sure I wouldn’t want to wake up at dawn to see something that’s only visible at dusk.
Different ways of referring to the same entity often carry different contextual implications. When your model treats all references to an entity as interchangeable, you’ve lost the ability to preserve context-dependent truth. We’re basically working with models that collapse referential distinctions despite analyzing text where these distinctions determine factual accuracy.
Domain-specific thresholds get completely missed
My doctor friend nearly had a heart attack when I showed her this test result. “The patient’s fever was 101°F” versus “The patient’s fever was 104°F” scored a 0.97 similarity. “Are you KIDDING me?” she shouted. “That’s the difference between ‘take some Tylenol’ and ‘get to the ER immediately’!”
The embeddings couldn’t distinguish between clinically significant temperature thresholds. Doctors searching for cases of dangerous fevers were getting mixed results including mild temperature elevations. Do you think emergency physicians appreciate getting non-urgent cases mixed with life-threatening ones? I am sure I wouldn’t want my dangerously ill child triaged as having a mild fever.
Domain-specific thresholds often represent critical decision boundaries. When your model treats “just above normal” the same as “critically elevated,” you’ve lost the ability to distinguish between routine and emergency situations. We’re basically working with models that see numbers as interchangeable despite analyzing text where small numerical differences represent hugely different clinical situations.
Date formats cause international incidents
Ever missed a deadline because of date format confusion? Our models do it consistently. “Submit your application by 12/10/2023” versus “Submit your application by 10/12/2023” scored an almost perfect 0.99 similarity. Depending on whether you’re in the US or Europe, those dates are two months apart!
The embeddings couldn’t distinguish between MM/DD and DD/MM formats. Students applying to universities were missing deadlines because date formats were interpreted differently across countries. Do you think applicants appreciate missing life-changing opportunities because of date format confusion? I am sure I wouldn’t want my future derailed because an AI couldn’t tell October from December.
Date format ambiguity isn’t just annoying – it can have legal, financial, and personal consequences. When your model treats different date formats as identical, you’ve introduced a cultural bias that particularly impacts international systems. We’re basically working with models that ignore date format conventions despite analyzing text where these distinctions can determine whether something is on time or hopelessly late.
The Truth and the Results
Here is the comparison between msmarco-distilbert-base-tas-b, all-mpnet-base-v2, and open-ai-text-embedding-3-large, and you will notice that there is no significant difference between the output of these models.
msmarco-distilbert-base-tas-b embedding score across different test cases
all-mpnet-base-v2 embedding score across different test cases
openai-text-embedding-3-large embedding score across different test cases
Cannot stress more..
Look, embeddings are amazingly useful despite these problems. I’m not advocating against using them, but rather, it’s crucial to approach them cautiously. Here’s my battle-tested advice after dozens of projects and countless failures:
-
Test your model on real user language patterns before deployment. Not academic benchmarks, not sanitized test cases – actual examples of how your users communicate. We built a “linguistic stress test” toolkit that simulates common variations like negations, typos, and numerical differences. Every system we test fails in some areas – the question is whether those areas matter for your specific application.
-
Build guardrails around critical blind spots. Different applications have different can’t-fail requirements. For healthcare, it’s typically negation and entity precision. For finance, it’s numbers and temporal relationships. For legal, it’s conditions and obligations. Identify what absolutely can’t go wrong in your domain, and implement specialized safeguards.
-
Layer different techniques instead of betting everything on embeddings. Our most successful systems combine embedding-based retrieval with keyword verification, explicit rule checks, and specialized classifiers for critical distinctions. This redundancy isn’t inefficient; it’s essential.
-
Be transparent with users about what the system can and can’t do reliably. We added confidence scores that explicitly flag when a result might involve negation, numerical comparison, or other potential weak points. Users appreciate the honesty, and it builds trust in the system overall.
Here’s the most important thing I’ve learnt:**these models don’t understand language the way humans do – they understand statistical patterns. When I stopped expecting human-like understanding and started treating them as sophisticated pattern-matching tools with specific blind spots, my systems got better. Much better.
The blind spots I’ve described aren’t going away anytime soon – they’re baked into how these models work. But if you know they’re there, you can design around them. And sometimes, acknowledging a limitation is the first step toward overcoming it.
Note: I have many more such cases found through experiments and I will not be covering them instead,I will start covering solution to each of the problems mentioned related to embeddings in my subsequent post.
The next continuation article will be coming out soon. Stay tuned!!