Your Brain May Detect An AI Voice Before You Can

Woman talking to AI assistant on smarphone

Scientists Found A Gap Between What The Brain Heard And What Was Consciously Noticed

In A Nutshell

After just 12 minutes of passive exposure to labeled AI and human voices, the brain showed measurable differences in how it processed synthetic versus real speech: even though conscious detection ability barely improved.
Modern AI voice synthesis closely mimics the broad character of human speech but may fall short at reproducing the rapid, subtle acoustic fluctuations that the brain quietly registers.
A gap between neural sensitivity and conscious awareness (called a neural-behavioral dissociation) means most people remain unreliable at identifying AI-generated voices, even when their auditory systems have begun adapting to them.
Researchers suggest that longer or more targeted training protocols could eventually help people consciously access what their brains are already picking up, with potential applications in defending against voice-based fraud.

Somewhere between the ear and conscious awareness, something gets lost. A new study found that after just 12 minutes of passive exposure to labeled AI and human voices, the brain begins processing them as measurably distinct categories. Consciously, though, participants remained essentially unable to tell the difference. Their ability to correctly identify AI voices barely changed. Their brain activity, on the other hand, told a different story.

Published in eNeuro by researchers at Tianjin University and the Chinese University of Hong Kong, the study used brain recordings to track how the auditory system responds to AI-generated speech before and after a brief training session, then compared that neural data to what participants consciously reported. What emerged was a clear disconnect: the brain was quietly adapting to synthetic voices in ways that never surfaced as better detection ability. Researchers call it a neural-behavioral dissociation, and understanding it may hold the key to building training programs that can help people catch voice deepfakes before they cause harm.

“Our study shows that even when listeners cannot behaviorally distinguish AI-generated voices from real human voices, brief perceptual training enables their brains to detect subtle acoustic differences,” the authors wrote. Given how rapidly AI voice technology is advancing, and how easily it can be used for fraud and impersonation, closing that gap between brain and behavior is now a practical concern as much as a scientific one.

How Researchers Built and Tested the AI-Generated Voices

Three native Mandarin speakers, two women and one man, each recorded 67 short sentences. Those recordings were then fed into GPT-SoVITS, a widely available open-source voice-cloning tool, to produce two types of synthetic speech per speaker. One version was fine-tuned on each speaker’s own recordings, producing a close imitation of their voice. A second version was generated without additional fine-tuning, relying solely on short audio samples, which still sounded human but bore a weaker resemblance to the specific speaker.

Thirty adults between ages 20 and 32, all native Mandarin speakers with no neurological history, participated while wearing a 64-electrode EEG cap that records electrical brain activity in real time. In the first session, they listened to 297 randomly ordered sentences drawn from all three voice types and pressed a button after each one to label the speaker as human or AI. No feedback was given on their guesses.

Then came the training phase. Participants heard nine longer audio clips, one per speaker-voice combination, explicitly labeled as either human or AI. No instructions told them what to listen for. The whole thing lasted roughly 12 minutes. After that, a second test session began with a fresh set of sentences.

voice cloning — AI voice-cloning tools are already being used to impersonate relatives, employers, and public figures. (Credit: Linaimages on Shutterstock)

Why AI Voice Detection Fails at the Conscious Level

Behavioral results were discouraging but not surprising. Participants performed poorly at distinguishing human from AI speech in both sessions. Statistical analysis confirmed that training produced no significant improvement in conscious discrimination ability. What did shift was strategy: after training, participants became more likely to label voices as AI-generated overall, a sign of increased caution rather than sharpened skill.

Part of what makes conscious detection so difficult may come down to the acoustic properties of the voices themselves. Analysis of the speech recordings revealed that AI-generated voices differ from human ones in the fine, rapid fluctuations that characterize natural speech, the micro-level variations in how a voice moves through individual sounds. Modern AI synthesis does an impressive job mimicking the broad, overall character of a human voice, but it may fall short in precisely reproducing these moment-to-moment dynamics. These acoustic differences may contribute to why listeners struggle to identify synthetic voices, though the study did not establish this as a definitive cause.

Where the Brain and Behavior Split on Deepfake Voice Detection

EEG recordings told a story the behavioral data couldn’t. Using a method called temporal response function analysis, which tracks how closely the brain’s electrical activity follows the contours of incoming sound over time, researchers compared neural responses to human and AI voices before and after training. Before training, no meaningful neural distinctions emerged between voice types. After training, the brain showed clear, statistically significant differences in how it processed human versus AI speech at approximately 55 milliseconds, 210 milliseconds, and 455 milliseconds following each sound, spanning early acoustic processing all the way through to higher-level interpretation.

In plain terms, after just 12 minutes of labeled exposure, the brain had begun responding differently to AI and human voices, even as the individual kept pressing the wrong button.

Broader analyses of brain wave patterns and spatial electrical activity across the scalp found no significant differences between voice types, suggesting the training effect was specific to how the brain tracks fine acoustic detail in real time rather than reflecting widespread changes in neural activity.

What Short-Term Training Could Mean for Catching AI Voice Fakes

The neural data suggests the auditory system may already register subtle differences between human and AI speech, even when listeners cannot consciously act on them. Rather than building a detection skill from scratch, future training programs might help listeners learn to use acoustic cues the brain already registers.

Twelve minutes of passive, labeled exposure was enough to reshape brain responses but not enough to change behavior. Researchers suggest that longer training, or protocols designed to direct a listener’s attention toward the acoustic cues that distinguish human from synthetic speech, could eventually bridge that gap. Whether hearing a familiar person’s voice cloned by AI would make detection easier or harder is an open question the study’s design could not address, but it is one with obvious real-world stakes.

AI voice-cloning tools are already being used to impersonate relatives, employers, and public figures, and most people tested under controlled conditions cannot reliably identify them. The auditory system’s sensitivity to synthetic voices, quiet as it is, may offer a foundation for training programs that haven’t yet been built. Getting that sensitivity into conscious awareness is the next challenge.

Paper Notes

Limitations

This study involved a small, demographically narrow sample of 30 native Mandarin-speaking adults ages 20 to 32, all from a university setting. Findings may not generalize to older adults, non-native speakers, or people with varying hearing ability. All speakers were unfamiliar to participants before the experiment, so the results do not address whether familiarity with a cloned voice would improve or complicate detection. Speech stimuli were short, context-free sentences, which may have limited what participants could draw on behaviorally. Whether longer or richer speech materials would improve conscious detection remains untested. Only one AI synthesis tool, GPT-SoVITS, was used, and results may differ with other voice-cloning systems. The training session was brief and passive, and the study did not test whether more intensive or targeted protocols could close the gap between neural sensitivity and behavioral performance.

Funding and Disclosures

This work was supported by the Improvement on Competitiveness in Hiring New Faculties Funding Scheme at the Chinese University of Hong Kong (grant number 4937113), awarded to corresponding author Xiangbin Teng. Authors report no conflicts of interest.

Publication Details

Authors: Jinghan Yang, Haoran Jiang, Yanru Bai, Guangjian Ni, and Xiangbin Teng | Affiliations: Academy of Medical Engineering and Translational Medicine and State Key Laboratory of Advanced Medical Materials and Devices, Tianjin University, Tianjin, China; Haihe Laboratory of Brain-computer Interaction and Human-machine Integration, Tianjin, China (Yang, Jiang, Bai, Ni); Department of Psychology and Brain and Mind Institute, The Chinese University of Hong Kong, Hong Kong SAR, China (Teng) | Journal: eNeuro | Paper Title: “Short-Term Perceptual Training Modulates Neural Responses to Deepfake Speech but Does Not Improve Behavioral Discrimination” | DOI: https://doi.org/10.1523/ENEURO.0300-25.2026 | Status: Peer-reviewed and accepted; Early Release version. Received August 12, 2025; Revised February 8, 2026; Accepted February 11, 2026.