(© V. Yakobchuk - stock.adobe.com)
Forgetful AI Learns Grammar Better Than Standard Models, But Can’t Predict How Humans Read
In A Nutshell
- Researchers built an AI that mimics human memory by making older words fade in importance, and found it learned grammar better than a standard AI.
- The improvement held across multiple tests, with the biggest gains in rules like subject-verb agreement that depend on nearby words.
- Despite better grammar scores, the forgetful AI was worse at predicting how long real humans pause on words while reading.
- Researchers ruled out two leading explanations for the gap, leaving the underlying cause an open question.
Human memory has a strange quirk. The moment words are processed, the brain starts letting them go. By the time a sentence ends, the exact phrasing from just a few words back has already begun to fade. For decades, scientists believed this forgetfulness wasn’t a flaw. It was actually helping people learn language. A new study put that idea to the test inside artificial intelligence, and the results are stranger than anyone expected.
Researchers at the University of Amsterdam built an AI language model that crudely mimics this feature of human memory: older words fade in importance. When trained on a human-sized collection of text, it learned grammar better than a standard AI with a more expansive memory. But when tested on whether it could predict how long a real human takes to read a given word, a standard measure of human-like language processing, it did worse. Better grammar, worse reading-time prediction: an unexpected mismatch that has left researchers without a clean explanation.
Published in the journal Transactions of the Association for Computational Linguistics, the study takes on a question that has nagged at researchers since AI language models became powerful enough to challenge human performance. Does the way a brain forgets actually make it smarter?
A Classic Idea, Tested on Modern AI
A famous 1993 study by scientist Jeffrey Elman found that artificial neural networks learned the rules of a made-up language more effectively when their memory was restricted. A brain forced to let go of specific words would be pushed to find deeper patterns instead, like learning the underlying rules of grammar rather than just memorizing which words tend to follow other words. This concept, sometimes called “less is more,” became a cornerstone of how scientists think about children learning to talk.
But then came modern AI. Large language models, including the architecture that powers tools like ChatGPT, do not impose the kind of systematic forgetting that the modified model does. Older words are not made to fade simply because time has passed, and yet these standard models still learn language remarkably well. Was Elman’s idea actually right, or had modern AI complicated it just by existing?
Rather than studying powerful commercial systems, where too many factors are at play, the Amsterdam team built their own smaller models and ran them in tightly matched pairs, varying one thing: how quickly the model’s attention to past words fades.
How the Forgetful AI Was Built and Tested
Using a scaled-down version of the GPT-2 architecture, the researchers trained their models on the BabyLM dataset: about 10 million words in its smaller version, approximating a young child’s language exposure, and around 100 million words in its larger version.
An early attempt at adding memory decay backfired. When forgetting started from the very first word, the AI began making spelling errors and struggled with basic patterns, because even immediate within-word connections were being disrupted. Human memory doesn’t work that way; people hold the most recent few words in sharp detail before older material begins to fade. So the team built that buffer in, keeping the most recent three to seven words intact before the decay kicked in.
With that adjustment, the results flipped. Across ten separate training runs, the forgetful AI consistently outperformed the standard version on a broad language modeling evaluation and on BLiMP, a standardized grammar test that checks whether a model correctly prefers grammatical sentences over ungrammatical ones. Gains were especially visible for subject-verb agreement, the kind of rule that depends on nearby words. The improvement held up on the larger dataset too.
The Twist Nobody Fully Predicted
Better grammar scores raised an obvious follow-up question: was the forgetful AI also better at mimicking how humans actually read? Researchers tested this by checking whether the model could predict how long a person pauses on a given word. When a word is surprising in context, readers slow down, and a well-calibrated language model should capture that pattern.
It didn’t. Performance on reading-time prediction got worse, not better, across two datasets: one from a self-paced reading study of 181 participants, and another tracking eye movements from 10 people reading news articles.
Two leading explanations were tested and ruled out. One holds that large models trained on superhuman quantities of text memorize patterns in ways humans never could, but these models were trained on human-scale data. A second points to over-memorization of rare words. Rare words did show up as a factor, but the telltale signature of that kind of memorization wasn’t present. Whatever is driving the gap remains unexplained.
Learning Like a Person Is Not the Same as Thinking Like One
Study authors note that their results apply specifically to the human-scale data range they studied and may not extend to systems trained on vastly larger datasets. Nothing here says commercial chatbots would improve if made more forgetful; the experiment used smaller, research-specific models far removed from systems like ChatGPT.
But the core finding holds: a constraint inspired by human biology made an AI learn grammar better, in a measurable and repeatable way. A more human-like AI still predicted human reading behavior worse than the standard version. Building a machine that learns like a person and building one that behaves like a person may be two very different problems.
Paper Notes
Limitations
The authors acknowledge several important boundaries around their findings. The study was conducted specifically within a developmentally plausible data range, roughly 10 million to 100 million words, and the authors note that the benefits of fleeting memory would likely diminish at the massive scales used in commercial AI training, where models can discover local language patterns from sheer data volume alone. Training data was specific to the BabyLM corpus, which skews toward spoken and child-directed language, meaning the results may not generalize to text types with longer structural dependencies, such as academic papers, novels, or programming code. The memory decay function is described by the authors as “a crude approximation of human memory” and does not account for the content-sensitive nature of real human forgetting, in which emotionally or informationally significant words may persist longer than predictable ones. The echoic memory buffer was added after observing poor performance in the naive decay model, introducing a degree of post-hoc adjustment to the design, even though it was grounded in human memory research and was not optimized against performance metrics.
Funding and Disclosures
This work was partly funded by the Dutch Research Council (NWO) under a Veni grant to co-author M. Heilbron. No other funding sources or conflicts of interest are mentioned in the paper.
Publication Details
Authors: Abishek Thamma (University of Amsterdam, Amsterdam Brain and Cognition; Vrije Universiteit Amsterdam, Department of Informatics) and Micha Heilbron (University of Amsterdam, Amsterdam Brain and Cognition; Max Planck Institute for Psycholinguistics) | Journal: Transactions of the Association for Computational Linguistics, Volume 14, pp. 877–892, 2026 | Paper Title: “Human-like Fleeting Memory Improves Language Learning but Impairs Reading Time Prediction in Transformer Language Models” | DOI: https://doi.org/10.1162/TACL.a.688 | Action Editor: Dilek Hakkani-Tur | Submission: July 2025; Revision: January 2026; Published: June 2026 | Licensed under CC-BY 4.0 by the Association for Computational Linguistics.







