Why Most Medical AI Fails in Clinics — And How to Fix It

A new Harvard-led study argues that medical AI will not be ready for routine clinical use until it can understand context — from specialty and geography to patients’ daily lives. The team also lays out a roadmap to make these systems more trustworthy partners for doctors and patients.

Medical artificial intelligence is often sold as a game changer for health care, promising to sift through mountains of data and spot patterns no human could see. Yet despite thousands of models built in labs and companies, only a small fraction have made a real difference in hospitals and clinics.

In a new study published in Nature Medicine, the researchers argue that one big reason is hiding in plain sight: most medical AI systems do not understand context.

They say that even when an algorithm appears to perform well on standardized tests, it can stumble badly in real-world care because it does not account for the specific situation in which its advice will be used. That includes the medical specialty involved, where in the world the patient is being treated, and the socioeconomic and cultural realities that shape a patient’s life.

“This is not a minor fluke,” co-corresponding author Marinka Zitnik, an associate professor of biomedical informatics in the Blavatnik Institute at Harvard Medical School and associate faculty in the Kempner Institute for the Study of Natural and Artificial Intelligence at Harvard University, said in a news release. “It is a broad limitation of all the types of medical AI models that we are developing in the field.”

Zitnik and colleagues call these missteps contextual errors. They are not the obvious, glaring mistakes that come from a model misunderstanding a lab value or misreading an image. Instead, they are answers that sound reasonable and even technically correct, but are wrong or unhelpful for the particular patient in front of a clinician.

Why context goes missing

In an interview with Harvard Medicine News, Zitnik explained that many of these errors start with the data used to train AI systems. The datasets often lack crucial information that clinicians rely on when making decisions, such as local treatment options, typical disease patterns in a region, or the practical barriers a patient might face in following a care plan.

As a result, models can generate recommendations that look sensible on paper but do not translate into relevant, actionable guidance in the clinic.

To close that gap, the researchers outline three major steps.

First, they argue that contextual information needs to be built into training datasets from the start. That means including richer details about patients, health systems and environments, not just lab results and diagnoses.

Second, they call for stronger computational benchmarks — the standardized test cases used to evaluate models before deployment. Those benchmarks, they say, should be designed to reveal how well a system handles different clinical contexts, not just how accurately it predicts outcomes in a narrow setting.

Third, they recommend weaving context directly into model architectures, the structural designs that determine how AI systems process information. That way, models can be built to recognize and adjust to different specialties, locations and patient circumstances rather than treating all cases as interchangeable.

When specialty, place and life circumstances matter

The paper highlights three examples where lack of context can derail medical AI.

The first involves medical specialties. Patients with complex conditions often have symptoms that cut across multiple organ systems. A person who arrives in the emergency department with neurological symptoms and breathing problems might see both a neurologist and a pulmonologist, each trained to focus on a different part of the body.

An AI model trained mostly on data from one specialty might do the same, zeroing in on the organ system it “knows” best and missing that the combination of symptoms points to a multisystem disease. The researchers argue that more robust models will need to be trained across specialties and able to switch focus in real time to whatever information is most relevant.

Geography is another powerful source of context. The same medical question can have very different answers depending on where a patient lives. A disease that is common in one country may be rare in another. Treatments that are standard in one health system may be unavailable, unaffordable or not yet approved elsewhere.

If a model gives the same recommendation in South Africa, the United States and Sweden, the team notes, that advice is likely to be wrong in at least some of those places. Zitnik’s lab is working on models that incorporate geographic information to generate location-specific guidance, which could have major implications for global health.

The third example centers on socioeconomic and cultural factors that shape whether a patient can realistically follow through on a care plan. Consider a patient who shows up in the emergency department with severe symptoms after never making an oncology appointment they were referred to earlier. A clinician might simply remind the patient to schedule the visit.

But that response may ignore real barriers: long travel distances, lack of reliable childcare, rigid work schedules or limited transportation. Those constraints rarely appear in electronic health records, so a typical AI system would not factor them in either.

The researchers envision models that can account for such realities and suggest more practical options, such as arranging transportation or offering appointment times that fit around childcare or work. Done well, that kind of context-aware AI could expand access to care instead of reinforcing existing inequities.

Beyond context: trust and collaboration

Contextual errors are not the only hurdle to bringing AI safely into everyday medicine. Zitnik points to trust as another major challenge. Patients, clinicians, regulators and health systems all need to be confident that AI tools are both reliable and used responsibly.

One way to build that trust, the team suggests, is to design models that are transparent and easy to interpret. Rather than offering opaque recommendations, systems should show how they arrived at their conclusions and be able to signal uncertainty — including, when appropriate, effectively saying “I don’t know.”

The way people interact with AI also needs to evolve. Many current tools resemble chatbots: users type a question and receive a single answer. Zitnik argues that future systems should support richer, two-way collaboration.

That could mean tailoring explanations to different audiences, such as providing plain-language summaries for patients and more technical details for specialists. It could also mean allowing models to ask follow-up questions when they lack key information, turning the interaction into a dialogue aimed at solving a shared problem.

New possibilities for treatment

Despite the obstacles, the researchers see enormous potential if these challenges can be addressed.

Some AI tools are already helping with routine tasks, such as drafting clinical notes or helping researchers quickly find relevant scientific papers. But Zitnik is especially interested in how context-aware models could transform treatment for patients with complex conditions.

In the future, she envisions systems that can shift their focus as a patient moves through the care journey. Early on, a model might help analyze symptoms and suggest possible causes. Later, it could surface evidence about treatments that worked in similar patients, then pivot again to practical questions such as drug side effects, prior medications and what therapies are actually available in a given hospital or region.

By continuously adjusting to the most relevant context, such tools could help clinicians tailor decisions for patients whose needs do not fit neatly into standard guidelines.

Building AI that does more good than harm

Zitnik is clear that AI in health care is not a passing trend. These tools are already being used, even as researchers are still learning where they work best and where they fall short.

She argues that the medical AI community has a responsibility to make sure these systems are developed and deployed responsibly. That includes designing models with real-world use in mind, rigorously testing them in clinical environments, and creating clear guidelines for when and how they should be used.

If researchers and developers stay aligned on those goals and ask hard questions early, Zitnik believes they can catch problems before they cause harm.

In the long run, Zitnik and her colleagues are optimistic. With richer context, stronger safeguards and more thoughtful human-AI collaboration, they say, medical AI could help make research more efficient, ease the burden on clinicians and, most importantly, improve care for patients.

Source: Harvard Medical School