A Mass General Brigham team has created an autonomous AI “clinical team” that scans everyday medical notes to flag early signs of cognitive decline, aiming to catch problems before the treatment window closes.
A new artificial intelligence system that reads routine medical notes like a digital clinical team could help doctors catch early signs of cognitive decline long before patients receive a formal diagnosis.
Researchers at Mass General Brigham have developed one of the first fully autonomous AI systems designed to screen for cognitive impairment using everyday clinical documentation. In real-world validation, the system reached very high accuracy in ruling out problems, suggesting it could become a powerful safety net for patients who might otherwise slip through the cracks.
The work, published in npj Digital Medicine, comes as early detection of conditions such as Alzheimer’s disease is becoming more urgent. New therapies tend to work best when given early, but many patients are diagnosed only after their memory or thinking problems have already progressed.
“By the time many patients receive a formal diagnosis, the optimal treatment window may have closed,” co-lead study author Lidia Moura, the director of Population Health and the Center for Healthcare Intelligence in the Department of Neurology at Mass General Brigham, said in a news release.
Traditional cognitive screening tests can be time-consuming, require special training, and are not always accessible to patients. At the same time, clinicians are under intense time pressure and may not be able to sift through years of notes to spot subtle patterns of decline.
The Mass General Brigham team set out to turn that everyday documentation into a continuous, behind-the-scenes screening tool.
Instead of building a single model that spits out a yes-or-no answer, the researchers created a coordinated set of AI “agents” that work together and critique each other’s reasoning.
“We didn’t build a single AI model — we built a digital clinical team,” added corresponding author Hossein Estiri, the director of the Clinical Augmented Intelligence (CLAI) research group and associate professor of medicine at Massachusetts General Hospital, a founding member of the Mass General Brigham health care system. “This AI system includes five specialized agents that critique each other and refine their reasoning, just like clinicians would in a case conference.”
Each of the five agents has a different role in reviewing the text of clinical notes and making a determination about whether there are signs of cognitive concerns. They run in an iterative loop, revisiting their own conclusions, challenging one another, and refining their judgments until they either reach performance targets or decide they have converged on a stable answer.
Under the hood, the system uses an open-weight large language model that can be deployed locally within a hospital’s own information technology infrastructure. That means patient data do not have to leave the health system or be sent to external cloud services, a key consideration for privacy and security.
To test the approach, the researchers analyzed more than 3,300 clinical notes from 200 anonymized patients at Mass General Brigham. These were notes generated during regular health care visits, not special research assessments.
The idea is simple but powerful: if an AI system can read those notes and flag patients whose documentation suggests possible cognitive issues, clinicians can then follow up with formal testing and referrals. In effect, every routine visit becomes a chance to catch early warning signs.
Moura explained why the notes themselves are so valuable.
“Clinical notes contain whispers of cognitive decline that busy clinicians can’t systematically surface,” she said. “This system listens at scale.”
When the AI disagreed with human reviewers, the team brought in an independent expert to re-evaluate those cases. In more than half of the disagreements, the expert sided with the AI, concluding that its reasoning was defensible based on the information in the notes.
“We expected to find AI errors. Instead, we often found the AI was making defensible judgments based on the evidence in the notes,” Estiri added.
The system performed especially well when it had access to rich, narrative documentation that described patients’ history, symptoms and daily functioning. It struggled more when information about cognitive concerns appeared only as brief entries in problem lists without supporting detail, or when it encountered clinical indicators it had not been trained to recognize.
In controlled, balanced testing, the AI achieved high sensitivity, meaning it was good at catching cases with cognitive concerns. In more realistic conditions, where only about a third of patients had documented concerns, sensitivity dropped, but the system maintained very high specificity. That high specificity is important in practice, because it reduces the risk of overwhelming clinics with false alarms.
The researchers were explicit about these limitations, emphasizing that transparency about where AI falls short is essential if such tools are to be trusted in medicine.
“We’re publishing exactly the areas in which AI struggles,” added Estiri. “The field needs to stop hiding these calibration challenges if we want clinical AI to be trusted.”
Beyond the specific system tested in the study, the team is releasing an open-source tool called Pythia. It is designed to help other health systems and research groups build and deploy their own autonomous AI screening workflows, not just for cognitive concerns but potentially for other conditions that leave subtle traces in clinical notes.
Because the underlying language model is open-weight and can run locally, Pythia offers a way for institutions to experiment with advanced AI while keeping tight control over patient data.
The study’s authors stress that the AI is not meant to replace clinicians or formal neurocognitive testing. Instead, it acts as a constant, background observer, scanning the record for patterns that might otherwise go unnoticed and prompting clinicians to take a closer look.
In the long term, tools like this could help health systems identify at-risk patients earlier, connect them to treatment and support, and better understand how cognitive decline shows up in real-world care.
As new therapies for Alzheimer’s and related conditions emerge, the pressure to find patients early will only grow. This autonomous AI “clinical team” offers a glimpse of how health systems might meet that challenge by turning the medical record itself into an early warning system.
Source: Mass General Brigham

