AI vs Human

(© V. Yakobchuk - stock.adobe.com)

Forgetful AI Learns Grammar Better Than Standard Models, But Can’t Predict How Humans Read

In A Nutshell

  • Researchers built an AI that mimics human memory by making older words fade in importance, and found it learned grammar better than a standard AI.
  • The improvement held across multiple tests, with the biggest gains in rules like subject-verb agreement that depend on nearby words.
  • Despite better grammar scores, the forgetful AI was worse at predicting how long real humans pause on words while reading.
  • Researchers ruled out two leading explanations for the gap, leaving the underlying cause an open question.

Human memory has a strange quirk. The moment words are processed, the brain starts letting them go. By the time a sentence ends, the exact phrasing from just a few words back has already begun to fade. For decades, scientists believed this forgetfulness wasn’t a flaw. It was actually helping people learn language. A new study put that idea to the test inside artificial intelligence, and the results are stranger than anyone expected.

Researchers at the University of Amsterdam built an AI language model that crudely mimics this feature of human memory: older words fade in importance. When trained on a human-sized collection of text, it learned grammar better than a standard AI with a more expansive memory. But when tested on whether it could predict how long a real human takes to read a given word, a standard measure of human-like language processing, it did worse. Better grammar, worse reading-time prediction: an unexpected mismatch that has left researchers without a clean explanation.

Published in the journal Transactions of the Association for Computational Linguistics, the study takes on a question that has nagged at researchers since AI language models became powerful enough to challenge human performance. Does the way a brain forgets actually make it smarter?

A Classic Idea, Tested on Modern AI

A famous 1993 study by scientist Jeffrey Elman found that artificial neural networks learned the rules of a made-up language more effectively when their memory was restricted. A brain forced to let go of specific words would be pushed to find deeper patterns instead, like learning the underlying rules of grammar rather than just memorizing which words tend to follow other words. This concept, sometimes called “less is more,” became a cornerstone of how scientists think about children learning to talk.

But then came modern AI. Large language models, including the architecture that powers tools like ChatGPT, do not impose the kind of systematic forgetting that the modified model does. Older words are not made to fade simply because time has passed, and yet these standard models still learn language remarkably well. Was Elman’s idea actually right, or had modern AI complicated it just by existing?

Rather than studying powerful commercial systems, where too many factors are at play, the Amsterdam team built their own smaller models and ran them in tightly matched pairs, varying one thing: how quickly the model’s attention to past words fades.

AI memory
Researchers built an AI that forgets like humans do. It learned grammar better, but struggled to predict real human reading behavior. (Image by StudyFinds)

How the Forgetful AI Was Built and Tested

Using a scaled-down version of the GPT-2 architecture, the researchers trained their models on the BabyLM dataset: about 10 million words in its smaller version, approximating a young child’s language exposure, and around 100 million words in its larger version.

An early attempt at adding memory decay backfired. When forgetting started from the very first word, the AI began making spelling errors and struggled with basic patterns, because even immediate within-word connections were being disrupted. Human memory doesn’t work that way; people hold the most recent few words in sharp detail before older material begins to fade. So the team built that buffer in, keeping the most recent three to seven words intact before the decay kicked in.

With that adjustment, the results flipped. Across ten separate training runs, the forgetful AI consistently outperformed the standard version on a broad language modeling evaluation and on BLiMP, a standardized grammar test that checks whether a model correctly prefers grammatical sentences over ungrammatical ones. Gains were especially visible for subject-verb agreement, the kind of rule that depends on nearby words. The improvement held up on the larger dataset too.

The Twist Nobody Fully Predicted

Better grammar scores raised an obvious follow-up question: was the forgetful AI also better at mimicking how humans actually read? Researchers tested this by checking whether the model could predict how long a person pauses on a given word. When a word is surprising in context, readers slow down, and a well-calibrated language model should capture that pattern.

It didn’t. Performance on reading-time prediction got worse, not better, across two datasets: one from a self-paced reading study of 181 participants, and another tracking eye movements from 10 people reading news articles.

Two leading explanations were tested and ruled out. One holds that large models trained on superhuman quantities of text memorize patterns in ways humans never could, but these models were trained on human-scale data. A second points to over-memorization of rare words. Rare words did show up as a factor, but the telltale signature of that kind of memorization wasn’t present. Whatever is driving the gap remains unexplained.

Learning Like a Person Is Not the Same as Thinking Like One

Study authors note that their results apply specifically to the human-scale data range they studied and may not extend to systems trained on vastly larger datasets. Nothing here says commercial chatbots would improve if made more forgetful; the experiment used smaller, research-specific models far removed from systems like ChatGPT.

But the core finding holds: a constraint inspired by human biology made an AI learn grammar better, in a measurable and repeatable way. A more human-like AI still predicted human reading behavior worse than the standard version. Building a machine that learns like a person and building one that behaves like a person may be two very different problems.


Paper Notes

Limitations

The authors acknowledge several important boundaries around their findings. The study was conducted specifically within a developmentally plausible data range, roughly 10 million to 100 million words, and the authors note that the benefits of fleeting memory would likely diminish at the massive scales used in commercial AI training, where models can discover local language patterns from sheer data volume alone. Training data was specific to the BabyLM corpus, which skews toward spoken and child-directed language, meaning the results may not generalize to text types with longer structural dependencies, such as academic papers, novels, or programming code. The memory decay function is described by the authors as “a crude approximation of human memory” and does not account for the content-sensitive nature of real human forgetting, in which emotionally or informationally significant words may persist longer than predictable ones. The echoic memory buffer was added after observing poor performance in the naive decay model, introducing a degree of post-hoc adjustment to the design, even though it was grounded in human memory research and was not optimized against performance metrics.

Funding and Disclosures

This work was partly funded by the Dutch Research Council (NWO) under a Veni grant to co-author M. Heilbron. No other funding sources or conflicts of interest are mentioned in the paper.

Publication Details

Authors: Abishek Thamma (University of Amsterdam, Amsterdam Brain and Cognition; Vrije Universiteit Amsterdam, Department of Informatics) and Micha Heilbron (University of Amsterdam, Amsterdam Brain and Cognition; Max Planck Institute for Psycholinguistics) | Journal: Transactions of the Association for Computational Linguistics, Volume 14, pp. 877–892, 2026 | Paper Title: “Human-like Fleeting Memory Improves Language Learning but Impairs Reading Time Prediction in Transformer Language Models” | DOI: https://doi.org/10.1162/TACL.a.688 | Action Editor: Dilek Hakkani-Tur | Submission: July 2025; Revision: January 2026; Published: June 2026 | Licensed under CC-BY 4.0 by the Association for Computational Linguistics.

About StudyFinds Analysis

Called "brilliant," "fantastic," and "spot on" by scientists and researchers, our acclaimed StudyFinds Analysis articles are created using an exclusive AI-based model with complete human oversight by the StudyFinds Editorial Team. For these articles, we use an unparalleled LLM process across multiple systems to analyze entire journal papers, extract data, and create accurate, accessible content. Our writing and editing team proofreads and polishes each and every article before publishing. With recent studies showing that artificial intelligence can interpret scientific research as well as (or even better) than field experts and specialists, StudyFinds was among the earliest to adopt and test this technology before approving its widespread use on our site. We stand by our practice and continuously update our processes to ensure the very highest level of accuracy. Read our AI Policy (link below) for more information.

Our Editorial Process

StudyFinds publishes digestible, agenda-free, transparent research summaries that are intended to inform the reader as well as stir civil, educated debate. We do not agree nor disagree with any of the studies we post, rather, we encourage our readers to debate the veracity of the findings themselves. All articles published on StudyFinds are vetted by our editors prior to publication and include links back to the source or corresponding journal article, if possible.

Our Editorial Team

Steve Fink

Editor-in-Chief

John Anderer

Associate Editor

Leave a Comment