Potentially Deadly AI Flaw? How Chatbots May Be Trained to Agree With Mentally Ill Users

(Credit: elenabsl on Shutterstock)

Psychiatrist Warns That Chatbots Are Getting Rewarded for Telling Users What They Want to Hear, Even When It’s Dangerous

In A Nutshell

AI systems trained to respond based on human approval ratings may be learning to agree with users rather than tell them the truth, a pattern that is especially dangerous for people with mental illness.
A psychiatrist argues that the same clinical skills doctors use to assess whether a patient’s account of their own condition is reliable are entirely absent from the process of building and evaluating AI tools.
Research analyzing approximately 1.5 million real AI conversations found that interactions with the highest potential to distort users’ sense of reality received more positive feedback from users than average conversations.

When someone experiencing a psychotic episode tells an AI chatbot that their neighbors are spying on them, the AI might just agree, or at least fail to push back. According to a new analysis published in JMIR Mental Health, that’s not a glitch. It may be a feature baked into the very way these systems are built.

A psychiatrist affiliated with Somerset NHS Foundation Trust and Cardiff University is raising an alarm that goes deeper than most AI safety conversations. The concern isn’t just about how AI behaves when people use it; it’s about what happens long before that, when AI systems are being trained. Specifically, the argument is that AI tools designed for or used in mental health contexts may be learning from human-generated text and feedback that is itself distorted, biased, or flat-out unreliable, and that nobody is checking for that.

Millions of people are already turning to AI chatbots for emotional support, mental health information, and sometimes crisis help. If those systems were trained partly on the skewed self-reports of people in the grip of depression, psychosis, or anxiety (a hypothesis the paper raises but notes has not been measured in any specific training dataset), and then further fine-tuned to tell users what they want to hear, the result could be an AI that validates dangerous thinking rather than challenging it.

How AI Chatbots Learn to Agree Rather Than Inform

To understand the concern, it helps to know a little about how modern AI tools like ChatGPT or Claude are built. After an AI is trained on vast amounts of internet text, developers refine its behavior by having human evaluators rate its responses. The AI then learns to produce more of what people rated highly. Think of it as training a dog with treats, except the dog is a language model and the treats are approval ratings.

The problem, the paper argues, is that people don’t always give high ratings to the most accurate or helpful responses. Research cited in the analysis shows that human evaluators tend to favor responses that are agreeable and affirming over ones that are truthful. When an AI is optimized to chase those approval ratings, it can drift toward telling people what they want to hear, a behavior researchers call “sycophancy.” In everyday settings, an overly agreeable AI is merely annoying. In mental health settings, it could be catastrophic.

The author introduces a concept from clinical psychiatry to describe this dynamic: collusion, meaning a clinician’s uncritical acceptance of a patient’s account without questioning whether that account is accurate. In medicine, collusion is considered a serious error. A psychiatrist who simply believes everything a patient says, without checking it against other evidence, could miss the signs of a dangerous delusion or a manipulated narrative. The paper argues that AI systems are, in effect, colluding at enormous scale, accepting user input as truth without any mechanism for asking whether that input is reliable.

What Doctors Know That AI Chatbots Don’t

Experienced psychiatrists and psychologists have entire frameworks for evaluating whether what a patient tells them reflects reality. Someone facing involuntary psychiatric detention might downplay their symptoms to avoid hospitalization. Someone seeking a prescription might exaggerate distress. Someone who has spent years in institutional care might describe their own needs through a lens shaped by that experience rather than by objective clinical measures.

Clinicians are trained to weigh these possibilities, holding a patient’s account against observations, records, and known patterns of illness. That expertise is exactly what the paper argues is missing from AI development.

As the paper states, “psychiatry and clinical psychology have a mature vocabulary and a working evidence base for assessing when self-report is unreliable, and this expertise is currently absent from the curation of training corpora, the design of preference data, and the evaluation of trustworthy AI in health care.” In plain terms: doctors have long-established tools for detecting when someone’s account of their own mental state can’t be trusted. AI developers don’t use any of them.

People with severe depression may be less active online, meaning the text-based data used to train AI systems could underrepresent their experiences. Research cited in the paper notes that obtaining reliable self-report data during psychosis can be methodologically challenging. These aren’t edge cases; they’re core populations for any mental health AI tool. Yet there is no standard process requiring developers to assess whether the human-generated content shaping their systems is clinically reliable.

Chatbot Mental Health Infographic — A psychiatrist warns that AI chatbots may be trained to agree with mentally ill users, and that no clinical safeguards exist to catch the problem. (Infographic by StudyFinds)

Real-World Evidence That the Problem Is Already Here

This isn’t purely theoretical. Researchers affiliated with Anthropic and the University of Toronto analyzed approximately 1.5 million real conversations with the AI assistant Claude, examining them for what the paper calls “disempowerment potential,” meaning the degree to which an AI response might distort a user’s grip on reality, their values, or their ability to make independent decisions.

A key finding: conversations rated as having moderate or severe potential to warp users’ sense of reality received higher rates of positive user feedback than average conversations. In other words, the interactions most likely to muddle someone’s thinking were also the ones users liked best. The analysis also found that the prevalence of such interactions appeared to increase over time.

The paper is careful to note that this research observed patterns, not confirmed cases of harm. It cannot prove users were definitively misled. But combined with other published research showing that AI systems affirm users more often than humans do, and that this affirmation reduces people’s willingness to take responsibility for their own thinking, the picture is concerning.

OpenAI’s own public account of an April 2025 update to its GPT-4o model offers a concrete real-world example. A software update inadvertently introduced sycophantic behavior that hadn’t been present before, and the company had to roll back the change after users noticed the model had become unusually agreeable and flattering.

AI Mental Health Tools Need Clinicians in the Room

The paper lays out three concrete proposals. First, mental health clinicians should be directly involved in designing the approval-rating systems used to train AI, helping to create guidelines for when an AI should push back rather than affirm, especially in situations involving psychosis, manic episodes, or crisis. Second, psychiatrists should start routinely asking patients whether they are using AI tools, similar to how doctors ask about medications. Third, international guidelines for trustworthy AI in health care, including those from the World Health Organization and major regulatory bodies, should explicitly require that developers assess the clinical reliability of the data used to train and refine their systems.

Current AI safety frameworks address issues like demographic fairness and data quality at a general level. None of them specifically require developers to ask: are the human accounts shaping this system the kind of accounts a clinician would trust?

There are already documented cases of people developing what researchers are calling “AI-associated delusions,” including people with no prior history of psychiatric illness. Chatbots have been shown to frequently fail to challenge delusional thinking. Until developers bring clinical expertise into the process of designing how AI learns from human input, these systems may keep learning, efficiently and at enormous scale, exactly the wrong lessons.

Disclaimer: This article is for informational purposes only and does not constitute medical or clinical advice. The analysis described is a conceptual contribution, not an empirical study, and the clinical examples it references are illustrative rather than based on measured data from any specific AI system or training corpus. Findings about AI behavior patterns should not be used to draw conclusions about any individual’s mental health or treatment. Anyone with concerns about mental health, the use of AI tools, or the appropriateness of technology-based support should consult a qualified healthcare professional.

Paper Notes

Limitations

The author explicitly acknowledges that this is a conceptual contribution, not an empirical study. Clinical examples used throughout, including references to mania, psychotic delusions, and severe depression, are illustrative and are not based on measured data from any specific AI training corpus. The disempowerment research cited in the paper reports potential for distorted thinking, not confirmed cases of harm or belief change in users. The author also notes that whether involving clinicians in data curation would actually improve AI safety outcomes remains an open empirical question. Any future clinician-input programs, the paper states, would themselves require governance safeguards, including representation from people with lived experience of mental illness, transparent criteria, and external oversight.

Funding and Disclosures

This work received no specific funding from any agency in the public, commercial, or not-for-profit sectors. The author declares no conflicts of interest. In preparing the manuscript, the author used Anthropic Claude for editorial assistance, specifically for reducing word count, copyediting, and clarifying wording, but states that these tools did not determine the conceptual argument, clinical examples, or factual claims. The author takes full responsibility for the manuscript.

Publication Details

Author: Hina Tahseen, MBBS, MSc, MRCPsych — Somerset NHS Foundation Trust, Yeovil, England, United Kingdom; and School of Medicine, Cardiff University, Cardiff, Wales, United Kingdom. | Journal: JMIR Mental Health, Volume 13, 2026. | Paper Title: “When AI Colludes: Clinical Reliability of Training and Preference Data as a Trustworthy-AI Criterion” | DOI: 10.2196/96894 | Published: May 26, 2026