Even the Best AI Chatbot Gets Health Questions Wrong 1 in 5 Times, Doctors Find

ChatGPT, Claude, and Gemini are among the most widely used AI apps used. (© prima91 - stock.adobe.com)

Board-Certified Physicians Put Popular LLMs Through Their Paces, and Found Real Problems

In a Nutshell

Across more than 200 AI-generated medical responses, doctors found that about 76% were considered valid, meaning roughly 1 in 4 fell short.
ChatGPT-4o performed best among the four AI models tested, while Llama3-8b performed worst, with doctors rating only half of its responses as valid.
Adding a specialized medical knowledge library to the AI did not consistently improve results. For some models, doctors actually preferred the standard version.
Mental health queries drew special concern from physicians, with some warning that AI responses in crisis situations could be actively dangerous.

When people feel a strange pain or notice a worrying symptom, more and more of them are skipping the doctor’s office and heading straight to an AI chatbot. It’s fast, free, and available at 3 a.m. But a study suggests that convenience might come with a serious catch: even the best-performing AI gets medical questions wrong roughly one out of every five times.

In a preprint study (not yet peer-reviewed) posted online by researchers from Penn State, four popular AI chatbots were put to the test using real and imagined health concerns submitted by university students, staff, and faculty. A panel of nine board-certified physicians then graded the AI responses. Overall results were mixed: impressive enough to turn heads, but flawed enough to raise real concerns about what happens when someone acts on bad medical advice.

Nearly one in four adults under 30 already use AI monthly for health-related guidance, according to data cited in the paper. Understanding what these tools get right (and wrong) is essential.

Robotic hand signifying artificial intelligence (AI) touching a stethoscope — The robot doctor may not be ready to see you just yet. (© Slowlifetrader – stock.adobe.com)

How Researchers Tested AI Chatbots on Health Questions

Researchers organized a university-wide competition in fall 2024. A total of 34 participants were invited to query one of four AI chatbots — ChatGPT-4o, ChatGPT-3.5, Gemini-1.5 Pro, and Llama3-8b — with health-related questions they might genuinely want answered. Participants could approach the task from one of three angles: as a patient describing personal symptoms, as a medical professional seeking diagnostic help, or through an out-of-the-box track that allowed for alternative medical query scenarios, such as analyzing images of handwritten prescriptions.

Competition entries generated 212 AI responses in total. Those responses were then divided among a panel of nine board-certified physicians, each of whom graded them on four measures: how valid the information was, the quality of the information, how well the AI reasoned through the problem, and whether the response could cause harm.

Gemini-1.5 Pro produced the largest share of responses, 140 out of 212, while Llama3-8b generated only 6. That imbalance matters when comparing models directly, and the researchers acknowledged it as a limitation.

What Doctors Found When They Graded the AI Responses

Across all four AI models, about 76% of responses were rated as valid by physicians. That sounds reasonable until the math flips: nearly one in four responses didn’t make the cut. For ChatGPT-4o, the highest-performing model, validity hit 84.6%, still leaving more than 15% of answers falling short. Llama3-8b landed at the bottom, with only half its responses rated as valid.

Which type of medical question was asked also mattered. Questions about obstetrics and gynecology scored the highest for accuracy, while neurology, internal medicine, and dermatology consistently ranked lower. Neurology cases in the study often involved rare conditions that are hard to diagnose under any circumstances, while dermatology relies heavily on visual examination — something a text-based chatbot simply cannot replicate.

Prompt length turned out to be a factor, too. Very short questions and very long, detailed ones both produced weaker results. Best performance came from medium-length queries, somewhere between 60 and 250 characters. Medical professionals said in follow-up interviews that the more specific and focused the question, the better the AI tended to perform.

ChatGPT prompt on computer — ChatGPT-4o proved to be the best model among those tested in the study. (Bangla press/Shutterstock)

Adding a Medical Encyclopedia Didn’t Always Help AI Chatbots

One of the study’s more surprising results involved a technique called Retrieval-Augmented Generation, or RAG, essentially giving the AI access to a curated library of medical textbooks, clinical guidelines, and research articles from a university medical school before it generates a response. Grounding the AI in vetted medical sources should, in theory, make its answers more reliable.

Seven medical professionals were recruited to compare standard AI responses against RAG-enhanced ones, side by side. For Gemini-1.5 Pro and Llama3-8b, the medical professionals actually preferred the standard, unenhanced versions by a wide and statistically significant margin. For the ChatGPT models, there was no significant difference either way.

Researchers stopped short of declaring RAG unhelpful overall, noting that the results varied by model and that future research should explore the approach further.

What Doctors Really Think About AI and Patient Safety

Seven medical professionals who took part in the evaluation were also interviewed about their broader views on AI in medicine. On the positive side, they saw real potential for AI to improve health literacy: helping patients understand their conditions, explore possible explanations for symptoms, and feel more engaged in their own care. Several noted that AI could serve as a useful first step for people deciding whether a symptom warrants a doctor’s visit, potentially easing the burden on overcrowded emergency rooms.

Concerns ran equally deep. Every doctor interviewed raised worries about overreliance. One described the scenario of a parent being falsely reassured by an AI while their child was seriously ill. Another flagged the risk that patients from groups historically underrepresented in medical research might receive less accurate responses, potentially widening existing health disparities.

Privacy was another concern. Several doctors warned that people entering detailed personal health information into AI chatbots may be exposing themselves to serious data risks.

Mental health queries drew particular caution. Some interviewees said AI responses on mental health topics could be actively dangerous, with one suggesting that if an AI can’t handle a mental health crisis responsibly, it simply shouldn’t respond at all.

A 20% error rate, the approximate failure rate for even the best AI model in this study, would be considered unacceptable in almost any medical setting. Researchers are direct about this: these results are not a green light for using AI chatbots as a substitute for professional medical advice. People who rely on these tools for diagnosis or health decisions should treat them as a starting point for conversation, not a final answer. In medicine, “pretty good” has never been good enough.

Disclaimer: This article is for general informational purposes only and does not constitute medical advice. The findings described come from a university competition in which crowdsourced health questions were posed to AI chatbots and evaluated by physicians; they reflect performance on a specific set of queries and may not predict how any AI tool will perform in all real-world situations. Always consult a qualified healthcare professional before making decisions about symptoms, diagnoses, or medical care.

Paper Notes

Limitations

The study’s dataset of 212 responses was unevenly distributed across the four AI models, with Gemini-1.5 Pro accounting for 140 responses and Llama3-8b contributing only 6. This imbalance limits the statistical power of direct model-to-model comparisons and reduces the ability to generalize findings for underrepresented models. Specialty-level analysis was restricted to medical categories with at least 10 entries, meaning several fields were excluded. The RAG pipeline was built from the curriculum of a single university medical school, which may not represent the full breadth of available medical knowledge. Researchers also note that the relatively small number of participants and the voluntary, competition-based format of data collection may affect how broadly the findings apply to the general public’s everyday use of AI health tools.

Funding and Disclosures

No external funding sources or financial disclosures are identified in the paper’s content. The study was conducted under institutional review board (IRB) approval. Participants in the follow-up interviews received a $60 Amazon e-gift card as compensation. Prize money was awarded to competition participants, with a first-place prize of $1,000, a second-place prize of $500, a third-place prize of $250, five consolation prizes of $50 each, and a separate $1,000 prize for the submission rated highest on harm assessment.

Publication Details

Authors: Bonam Mingole, Aditya Majumdar, Firdaus Ahmed Choudhury, Jennifer L. Kraschnewski, Shyam S. Sundar, Amulya Yadav — Pennsylvania State University and Penn State College of Medicine

Paper Title: Dr. GPT Will See You Now, but Should It? Exploring the Benefits and Harms of Large Language Models in Medical Diagnosis using Crowdsourced Clinical Cases

Source: arXiv preprint (submitted June 2025). This paper has not yet undergone formal peer review.