News - 26 Mar `26AI’s Growing Role in Health Advice

New

AI’s Growing Role in Health Advice

 

Promising, But Is It Safe?

Consumer health AI is moving fast. Safety evidence is not. And that should make everyone a little less relaxed.

In Brief

OpenAI launched ChatGPT Health in January 2026 as a dedicated health space inside ChatGPT, and Microsoft announced Copilot Health on March 12, 2026 as a separate health-focused experience inside Copilot. Both products are built around the same promise: pull fragmented health data together and turn it into something more useful for patients. 

But the first independent safety test of ChatGPT Health, published by Mount Sinai researchers in Nature Medicine, found a troubling pattern. The system undertriaged true emergencies and overtriaged nonurgent cases, which is exactly the kind of split personality you do not want in a medical advisor. 

AI’s influence in medicine is no longer creeping in quietly. It has kicked the door open.

Floodgates Are Open

OpenAI introduced ChatGPT Health in January 2026 as a dedicated experience for health and wellness conversations, with connected records, supported apps, and separate privacy controls. The company says Health is meant to help people feel more informed and better prepared for medical conversations, not to replace medical care. 

Microsoft joined the same race on March 12, 2026 with Copilot Health. Its pitch is familiar and ambitious at the same time: gather records, medications, wearable data, and health history into one place, then use AI to turn the mess into a coherent story a patient can actually understand.

That all sounds sensible. Frankly, much of it is sensible. Modern healthcare data is a landfill with a user interface. If AI can help people make sense of visit summaries, medications, labs, and device data before seeing a doctor, there is obvious value in that. The problem is not the idea.

The problem is whether these consumer-facing tools are safe enough to influence real decisions when the situation is urgent, confusing, or subtle.

The first hard look

A new independent evaluation from Mount Sinai finally gives us something better than product copy. Researchers tested ChatGPT Health using 60 clinician-authored vignettes across 21 specialties and 16 factorial conditions, producing 960 total responses. The study was published in Nature Medicine in February 2026. 

The findings were not subtle. The chatbot showed two major biases at once: it consistently undertriaged emergencies and overtriaged routine cases. That means it could be too calm when immediate care was needed, and too dramatic when home care or watchful waiting would have been enough. In medicine, that is a nasty little combo. 

WHAT RESEARCHERS FOUND WHY IT MATTERS
In 51.6% of true emergency scenarios, ChatGPT Health recommended routine doctor evaluation within 24–48 hours instead of immediate emergency care. A patient facing something like impending respiratory failure or diabetic ketoacidosis could be falsely reassured and delay lifesaving treatment.
In 64.8% of nonurgent scenarios, the system escalated to appointments that were not necessary. That creates extra anxiety, wasted visits, and more strain on already overloaded systems.

 

According to the study and Mount Sinai’s summary, ChatGPT Health handled obvious textbook emergencies such as stroke or anaphylaxis better than it handled nuanced threats like diabetic ketoacidosis or impending respiratory failure. In other words, it did best where almost any half-decent checklist would do best — and stumbled where real clinical judgment starts to matter.

“It performed well in textbook emergencies... but struggled in more nuanced situations where clinical judgment matters most.”

That observation, attributed by Mount Sinai to lead author Dr. Ashwin Ramaswamy, is the heart of the matter. Medicine is not just spotting the obvious. The real work is context, severity, timing, comorbidity, uncertainty, and knowing when something ordinary-looking is about to go sideways. 

Helpful Hints

  • Undertriage means a tool treats a dangerous situation as less urgent than it really is.
  • Overtriage means a tool escalates a low-risk situation into a higher level of care than is actually needed.
  • Consumer-facing health AI means tools built for patients or the general public, not just for clinicians inside controlled hospital systems.

The upside is real. So is the risk.

To be fair, health AI tools do offer real benefits. They are available instantly. They are usually cheaper than professional care and often free at the point of use. They can help people organize symptoms, make sense of scattered records, prepare for appointments, and sometimes spot obvious red flags. In overloaded health systems, that is not trivial. 

But usefulness is not the same thing as safety. A calculator can be useful and still dangerous if it quietly gives the wrong answer exactly when the stakes are high.

That is what makes this moment so awkward. The deployment curve is sprinting ahead while the independent evidence base is still putting on its shoes. Millions of people are already using these systems as a first stop for medical questions, but the public still has limited visibility into failure modes, edge-case behavior, and how often the tools get the big calls wrong. 

Why this matters to us

This issue hits close to home for us at VRF. We work in a field where patients often spend years being misread, underinformed, dismissed, or bounced around between half-answers. That makes the appeal of AI obvious. When the system is slow, expensive, and fragmented, a clean digital interface that sounds confident can feel like relief.

But confidence is cheap. Safety is not.

This echoes concerns I've raised before:

That case now looks a lot less theoretical.

The adult conclusion

As patients, clinicians, and researchers, we should insist on rigorous independent testing before widespread dependence on consumer-facing health AI becomes normal. Not after a few million people have already started using it that way. Before.

If a system is going to shape decisions about self-care, doctor visits, urgent symptoms, or emergency care, then it should be held to standards closer to safety-critical medicine than to ordinary software launches. That means independent audits, transparent reporting of failure modes, and far less trust in polished demos.

For now, the practical rule is simple. Use these tools as a starting point, not the final word. They may help you ask better questions. They may help you prepare for a visit. They may even help you spot something worth checking. But for anything serious, fast-moving, unusual, or urgent, they are not your doctor.

Not yet. And in some situations, not even close.

 

Yan Valle

Prof. h.c., CEO VRF

References

  1. OpenAI. Introducing ChatGPT Health. January 2026.
  2. OpenAI Help Center. Health in ChatGPT release notes. January 7, 2026.
  3. Microsoft AI. Introducing Copilot Health. March 12, 2026.
  4. Ramaswamy A, et al. ChatGPT Health performance in a structured test of triage safety. Nature Medicine. 2026.
  5. Mount Sinai. Research Identifies Blind Spots in AI Medical Triage. February 24, 2026.

Educational note: This article is an opinion piece for public discussion and should not be treated as personal medical advice. For urgent or serious symptoms, contact a qualified healthcare professional or emergency services.