AI text generators understandably fail a classic test from psychology and cannot correctly name color words shown in color if the two do not match. This was discovered by a US research team that had GPT-4o, Claude 3.5 Sonnet, GPT-5, Claude Opus 4.1 and Gemini 2.5 complete the so-called Stroop test.
Read more after the ad
If the two colors did not match, the AI models could only correctly assign the visible color for a few words. When they were presented with more, the error rate rose sharply. When matching and differing pairs of color and term were alternately shown, the AI models were consistently wrong. The team believes that the finding must be taken into account in the development of general artificial intelligence.
More words, significantly more errors
The Stroop test clinically assesses how well people are able to suppress an automatic reaction. To do this, words are printed in colored font and the test subjects are supposed to say the font color, ignoring the meaning of the word. On average, they take a little longer to do this if the color and meaning (“rot“, „blau“) do not match, explains the research team led by Suketu Chandrakant Patel from the City University of New York. Nevertheless, they “can provide stable and highly precise performance even with long word lists.” That cannot be said about the AI models examined, quite the opposite.
As the research group explains, the results are, as expected, best when the AI models have to name font colors that match the respective word. But even then there is a drop at 40 terms. If the two don’t match, naming only works for a few words; with ten words the hit rate drops to 60%, with 40 it’s even less than 20%. If the color sometimes matches and sometimes doesn’t, the models get completely out of step; with 40 words the hit rate even drops to 0%. The models were only completely correct under one condition: when instead of color words, neutral “X” characters were presented, the number of which corresponded to the number of letters in the respective color word.
The fact that AI models do not reliably determine what needs attention in a text and what does not is not new. The study published in PNAS Nexus has now confirmed that large language models (LLM) – like humans – are better trained to read words than to name colors. However, people could suppress reading and concentrate on naming the color even with long lists of words. The complete drop in performance of the AI models “points to fundamental limitations compared to biological attention.” However, these control mechanisms are fundamental for achieving general artificial intelligence. It could also save computing power if AI models could more reliably ignore non-essential information.
(my)
