A study using agentic artificial intelligence to detect early signs of cognitive decline in unstructured medical records found the technology achieved near-expert performance without any human guidance.
Mass General Brigham researchers built a multi-agent workflow that relied on five debating AI agents using large language models (LLMs) from Meta: Llama and Med42. The data were based on 200 real MGB patients and more than 3,300 clinical notes.
The AI came close to matching human-guided performance, achieving over 90% of expert-level accuracy without human intervention.
“We basically built a digital clinical team,” Hossein Estiri, Ph.D., a researcher on the study, director of the Clinical Augmented Intelligence research group and associate professor of medicine at Massachusetts General Hospital, told Fierce Healthcare. “And it was much cheaper, because you don’t need a human. It’s completely autonomous.”
The study looked at model performance across a validation dataset, which resembled real-world conditions, and a refinement dataset, which had more balanced training data. Alongside the study's publication, the researchers released Pythia, an open-source tool to help other researchers deploy autonomous prompt optimization for their own AI screening applications.
Human reviewers initially disagreed with a number of cases reviewed by the AI: It appeared to flag 16 cases as false negatives in the validation dataset, meaning AI had determined no cognitive concern. Ultimately, independent experts sided with the AI in 44% of those cases, meaning the AI correctly ruled out concerns based on available evidence. This was despite being at an information disadvantage—the AI only worked off of clinical notes, while human reviewers had access to complete medical records.
The AI struggled in areas with isolated data points lacking clinical context but excelled at analyzing comprehensive clinical narratives, history of present illness, exam findings and clinical reasoning.

No comments:
Post a Comment