(© V. Yakobchuk - stock.adobe.com)
A Century-Old Psychology Trick Just Revealed What’s Missing in Today’s Most Advanced AI
In A Nutshell
- Researchers tested two leading AI systems, GPT-4o and Claude 3.5 Sonnet, on the Stroop task, a classic brain exercise that measures the ability to stay focused under competing demands.
- Both AI models performed well on short word lists but collapsed dramatically on longer ones, with accuracy falling to nearly zero in some conditions.
- Newer models including GPT-5, Claude Opus 4.1, and Gemini 2.5 Pro showed the same core weakness in a follow-up test, suggesting the problem is architectural rather than fixable by simply building bigger models.
- Researchers say AI lacks the conflict-monitoring system the human brain uses to detect interference and adjust attention, a gap they argue must be addressed to achieve truly general artificial intelligence.
Most adults can look at the word “RED” printed in blue ink and say “blue” without much trouble. For some of today’s leading AI language models, that same simple task exposes a surprising gap that, according to new research, may not be solved by making models bigger or giving them more data.
Researchers from Queens College at The City University of New York and Texas A&M University published a study in PNAS Nexus testing two leading AI systems, OpenAI’s GPT-4o and Anthropic’s Claude 3.5 Sonnet, on a famous psychological exercise called the Stroop task. Dating to 1935, the test asks participants to name the ink color a word is printed in while ignoring what the word itself says. When those two things conflict, such as the word “BLUE” written in red ink, most people slow down a little but stay accurate. Both AI systems handled short lists in much the same way. As the lists grew longer, performance began collapsing, eventually falling apart almost entirely. The authors argue this failure reveals something fundamental about what today’s AI is actually missing.
At its core, the experiment probes a mental ability called executive control, or the brain’s capacity to hold a goal in mind, notice when competing pieces of information are pulling in different directions, and resolve conflict. It is not about intelligence or vocabulary. It is the mental machinery that keeps a person focused even when something keeps tugging their attention away. Modern AI, the researchers contend, is built without anything resembling this mechanism.
How Researchers Put AI Through the Stroop Test
For the study, researchers used the classic Stroop framework but added a twist: instead of one word at a time, they presented lists of varying lengths, 1, 5, 10, 20, and 40 words, to observe how performance held up under increasing demands. On short lists, both models looked impressively human, performing worse on mismatched word-color pairs than on matching ones, which is exactly what happens with people. Things began unraveling quickly as the lists grew longer.
This interference effect is so well-documented in humans that it is widely regarded as a gold standard for studying how the brain manages competing information. Unlike people, who stay largely accurate no matter how long the list gets, the AI models hit a wall.
The Numbers Tell a Damaging Story for AI Chatbots
GPT-4o’s accuracy on the mismatched color-naming task dropped from 91% on five-word lists to 57% on ten-word lists, fell to 22% at 20 words, and reached just 15% at 40 words. Claude 3.5 Sonnet held on longer, maintaining 76% accuracy at 20 words, before dropping sharply to 24% at 40 words, a fall of 52 percentage points. In some conditions, accuracy fell to nearly zero.
Meanwhile, the word-reading task stayed close to perfect for both models across all list lengths. These models were not struggling in a general sense. They were specifically failing at the part requiring an override of automatic reading and sustained focus on a goal.
Research on human performance cited in the paper shows people maintain roughly 95% accuracy regardless of list length, with some studies recording 97% accuracy on tasks with lists up to 1,500 words.
Newer AI Models Show the Same Core Weakness
To check whether more recently developed systems had moved past this limitation, the team ran a smaller, exploratory follow-up on three newer models: GPT-5, Claude Opus 4.1, and Gemini 2.5 Pro. The sample was notably smaller than the primary study, but the results were consistent. All three showed the same core weakness under sustained conflicting demands.
Gemini 2.5 Pro handled word-color conflict better than the others but stumbled on a control condition involving strings of the letter “X” with no meaningful words attached. Researchers interpret that as a sign that even improved performance depends on recognizing familiar word patterns rather than genuine task-focused control. GPT-5, when allowed to use an internal reasoning mode, simply wrote and executed code to solve the task, a workaround the researchers say sidesteps the problem entirely rather than solving it. All of this, the authors argue, points to a structural problem baked into how these systems are designed, not a shortcoming that will disappear by scaling up.
What the Human Brain Has That AI Does Not
Every major AI assistant available today runs on the transformer architecture, a mathematical framework that, according to the researchers, has no equivalent to the conflict-monitoring system the human brain uses to detect interference and stay on task. That gap showed up clearly in the data: the more conflicting information these models had to juggle, the faster they fell apart.
AI excels at drawing on vast stores of learned information, passing professional exams and generating fluent text. But overriding a strong automatic habit while holding a competing goal steady under pressure is a different kind of challenge entirely, and one these systems have not cracked. Getting AI to that level, the authors conclude, may require building something fundamentally new that entails the ability to notice conflict, maintain focus, and actually adjust course.
Disclaimer: This article is based on a published academic study. Study findings and conclusions are those of the researchers and do not necessarily reflect the views of this publication. Study sample sizes, methodologies, and results may vary.
Paper Notes
Limitations
The study’s authors acknowledge that their follow-up evaluation of newer AI models, GPT-5, Claude Opus 4.1, and Gemini 2.5 Pro, used a notably smaller sample size (n=5) compared to the primary study (n=30), which limits the strength of conclusions drawn from that portion. The researchers also discuss that the AI models were tested in a text-based format, and the way these systems process color information through language may differ in important ways from how humans perceive color visually. The authors note that transformer models process color terms as language tokens rather than actual perceptual input, which may contribute to the performance imbalance between word reading and color naming. The study focuses on two specific model architectures in its primary analysis, and while the follow-up testing suggested the findings generalize, the results cannot be assumed to apply to all AI architectures.
Funding and Disclosures
The authors declare no funding and no competing interests.
Publication Details
Authors: Suketu Chandrakant Patel (Department of Psychology, Queens College, The City University of New York), Hongbin Wang (College of Medicine, Texas A&M University), and Jin Fan (Department of Psychology, Queens College, The City University of New York) | Journal: PNAS Nexus, Volume 5, Issue 6 | Paper Title: “Deficient executive control in transformer attention” | Published: June 2, 2026 | DOI: https://doi.org/10.1093/pnasnexus/pgag149







