Built To Be An Ideal Listener, This AI Still Made The Same Mistakes As Human Ears

Party,People,Communication,Talking,Happiness,Concept

Credit: Rawpixel.com on Shutterstock

Science Has A New Explanation For Why You Mishear People In A Crowd

In A Nutshell

MIT researchers built an AI model to solve the “cocktail party problem” of following one voice in a crowd: and it behaved almost exactly like a human listener, including the mistakes.
Both humans and the AI struggled most when two voices were similar in pitch and vocal quality, pointing to confusion errors that may sometimes be unavoidable rather than signs of poor attention.
The AI independently replicated known human listening patterns (like doing better with opposite-sex or foreign-language distractors) without ever being told to.
The findings have practical implications for hearing aid and cochlear implant research, offering a new tool for predicting how people struggle with speech in noise.

At a crowded party, straining to follow one conversation while someone else talks nearby, the voice you were tracking suddenly seems to swap with the wrong one. For a moment, you report words the other person said. Most people chalk it up to distraction or tired ears. A new MIT study has a more unsettling explanation. That slip may at times be unavoidable, a built-in consequence of how human voices overlap. And an AI built to solve the same listening challenge made the exact same mistakes.

Researchers at the Massachusetts Institute of Technology built an artificial intelligence model to tackle what scientists call the “cocktail party problem:” understanding one speaker when several are competing for your attention at once. Rather than programming it to copy human behavior, they optimized it purely for the task and let it find whatever solution worked best on its own. What came back was a system that not only matched human listening performance across dozens of realistic listening conditions but showed strikingly similar errors, triggered in the same situations and at comparable rates.

Published in the journal Nature Human Behaviour, the results carry an uncomfortable suggestion: some of the moments when human attention breaks down may reflect an efficient solution to a difficult problem, not just a flaw.

Both the brain and the model work through the same basic strategy. When a listener hears a brief sample of a target voice, the brain builds a mental template of that voice’s characteristics, its pitch, its tone, its location in space, then uses that template to amplify matching features in the noisy mix. Neuroscientists had long suspected as much, but it had not previously been demonstrated in a model that could match human behavior across real-world conditions.

How Selective Listening Goes Wrong

To build the model, the MIT team trained it on nearly four million audio examples, each one presenting a short clip of a target voice, then a two-second mix of that voice layered with distractors, and asking it to identify the middle word spoken by the target. Training sounds were placed in simulated rooms with realistic echoes, covering a broad range of real-world conditions.

Across several experiments, with groups ranging from 195 online participants to smaller in-person groups tested through a hemispherical loudspeaker array, the model’s performance tracked human behavior almost step for step. Tellingly, the researchers measured not just how often people got the right word but how often they reported a word spoken by the wrong voice entirely.

Confusion rates were low overall but climbed sharply in two conditions: when the target voice became harder to hear relative to the distractor, and when both voices belonged to speakers of the same sex. Male and female voices differ enough in pitch and vocal quality that the brain can usually keep them apart. When both voices are similar, the mental template built for the target overlaps heavily with the distractor, and errors spike. The researchers concluded that “some selection failures are an inevitable consequence of target-distractor feature similarity,” though they note that attention lapses and other factors also contribute. When two voices are acoustically close enough, even an optimized system struggles to fully untangle them.

selective attention — Josh McDermott (left), professor of brain and cognitive sciences and associate investigator at the McGovern Institute sits with graduate student Ian Griffith in the speaker array room where they conducted the study. (Credit: Steph Stevens)

Why Similar Voices Trip Up Both Human and AI Listeners

That finding runs deeper than it might seem. People also understand speech more easily when the competing voice speaks an unfamiliar language. A Mandarin-speaking distractor is far easier for a native English speaker to filter out than an English one, likely because unfamiliar speech shares fewer useful acoustic features with the target. Both humans and the model showed the same advantage. Both also struggled more with voices whose natural frequency structure was artificially disrupted, and both dropped sharply in performance when voices were whispered, stripping away the pitch cues the brain relies on to separate talkers.

Spatial cues followed the same pattern. Both benefited from hearing a target and distractor at different locations, and both were fooled by an auditory illusion, often attributed to what researchers call the precedence effect, in which a brief delay causes the brain to perceive a sound as coming from a different location than it physically originates. That neither humans nor the model could resist that illusion points to something fundamental: spatial attention in hearing is built on perceptual shortcuts shaped by a world full of echoes and reflections, and those shortcuts come with hard limits baked in.

Where in the Brain Selective Listening Actually Happens

Beyond behavior, the model shed light on where in the brain’s processing chain attention does its heavy lifting. Neuroscience research has shown that attentional enhancement of a target voice tends to appear relatively late in the auditory pathway, in higher-order regions rather than at the earliest stages of hearing. When the researchers analyzed which processing layers in the AI showed the clearest separation between target and distractor, they found a similar late-stage pattern. Early layers carried information about voice identity and location, while the selective boost that separates target from distractor only became visible toward the end.

Models with attention applied only at the earliest or latest stage fell well short of matching human behavior. Spreading it across multiple stages, the way both the brain and the full model do, turned out to be essential. That convergence offers a concrete hypothesis: the timing of attention in the human auditory system may not be arbitrary biology. It may be where attention has to happen for the job to get done.

For researchers working on hearing aids and cochlear implants, these results have real bearing on practical problems. Difficulty following conversations in noise is among the most disabling aspects of hearing loss, and a model that predicts human confusion patterns with this level of accuracy could become a useful tool for testing hearing technologies before they ever reach a patient.

Human attention evolved to handle a genuinely hard problem, one where voices overlap, rooms add echoes, and multiple people talk at once in nearly every social setting people occupy. That an AI, with no exposure to human biology and no instruction to act like a person, ends up solving that problem the same way, with the same limits, suggests the brain’s approach to listening may be one of the most effective ways to solve it.

Paper Notes

Limitations

The researchers note several boundaries on their findings. The model was tested in a controlled task structure where attention was always cued by a prior audio sample of the target voice, a setup that does not fully capture the variety of ways real-world attention can be directed, such as following a voice described in words or tracking a sound that changes over time. The framework also does not account for flexible executive control, the ability listeners have to adjust the strength of attention based on perceived task difficulty or effort. While the model’s architecture was loosely inspired by the biology of the auditory system, it deviates from biological sensory systems in many ways, limiting its use as a direct model of specific neural mechanisms. Human participants also showed a practice effect across spatial listening experiments that the model, whose weights were fixed after training, could not replicate.

Funding and Disclosures

This research was supported by National Institutes of Health grant R01 DC017970. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. The authors declare no competing interests.

Publication Details

This study was authored by Ian M. Griffith, R. Preston Hess, and Josh H. McDermott of the Department of Brain and Cognitive Sciences and the McGovern Institute for Brain Research at the Massachusetts Institute of Technology. Griffith and McDermott are also affiliated with the Program in Speech and Hearing Biosciences and Technology at Harvard University, and McDermott with the Center for Brains, Minds, and Machines at MIT. The paper, “Optimized feature gains explain and predict successes and failures of human selective listening,” was published in Nature Human Behaviour on March 13, 2026. DOI: https://doi.org/10.1038/s41562-026-02414-7. Data and code are publicly available via OSF (https://osf.io/wjzvu) and GitHub (https://github.com/mcdermottLab/auditory_attention).