Binghamton Researchers Find Way to Eliminate AI Hallucinations

AI chatbots confidently delivering wrong medical information could become a thing of the past. Binghamton University researchers built a verification system using seven competing AI models that eliminated hallucinations across more than 10,000 biomedical experiments.

The University Network

Millions of college students turn to AI chatbots each year with health questions, from mysterious rashes to unexplained fatigue. The problem: those chatbots sometimes answer with complete confidence and complete inaccuracy, a phenomenon researchers call “hallucinations.” A team at Binghamton University may have cracked the code on stopping them.

A study published in STAR Protocols in May 2026 outlines a new verification workflow that, across more than 10,000 experiments, produced zero unmatched — or fabricated — medical terms. The research was funded by a $100,000 grant from New York state’s Empire AI Consortium.

The Method: Make the AIs Vote

The core idea behind the Binghamton approach is surprisingly democratic. Rather than relying on any single large language model (LLM) to answer a medical question, researchers Ahmed Abdeen Hamed and Luis M. Rocha selected seven open-source AI models and forced them to work from the same authoritative source before responding.

That constraint is called retrieval-augmented generation, or RAG, which requires each model to consult a vetted database of medical terminology before generating a response. When given identical plain-language symptom descriptions, each of the seven models independently produced what it believed to be the correct medical terms, complete with official identification numbers. The models then effectively cast votes, and only terms that earned meaningful consensus were accepted.

The results were striking. According to the study, 76.85% of answers were supported by at least four of the seven LLMs, while the remaining 23.15% were backed by at least two — leaving no unmatched terms and no hallucinations.

Hamed, a research fellow in the Thomas J. Watson College of Engineering and Applied Science’s School of Systems Science and Industrial Engineering who is moving to a new position as a research associate professor at the University of Nebraska-Lincoln, described the new workflow as broadly capable.

“The new workflow is incredible because it can verify anything from a biomedical point of view — biological knowledge with disease and genetics, translational knowledge from diseases to treatments and clinical trials, and also from a healthcare point of view with symptoms and treatments,” Hamed said in a news release.

Built to Scale

One of the most compelling features of the protocol is its scalability. Because there are hundreds of open-source AI models available, the experiment can be rerun endlessly with different random combinations of seven models drawn from that larger pool, each iteration reinforcing the reliability of the results.

“There can be 100 large language models that are open source, and every time we can perform an experiment with seven LLMs selected at random from that list,” Hamed added. “When we perform the experiment many, many times, we increase the confidence in the voting.”

That built-in reproducibility is a significant departure from how most AI tools operate, where a single model’s architecture and training data determine whether the output is trustworthy — and users rarely know which way it will go.

Why It Matters for Students and Young Adults

For college students who routinely use ChatGPT or similar tools to research symptoms before deciding whether to visit a campus health center, the stakes of AI hallucinations are real. A fabricated drug interaction or a misidentified condition could delay appropriate care or cause unnecessary alarm.

Beyond individual health decisions, the research has broader implications for the medical field. Rocha, who holds the George J. Klir Professorship in Systems Science, noted that the protocol is a meaningful step forward for large multiscale network models of disease — a central focus of his Complex Adaptive Systems and Computational Intelligence Lab at Binghamton.

One application his lab is actively pursuing involves “digital twins” for precision medicine — dynamic, virtual replicas of human biological processes that are continuously updated with real-time data to simulate how a patient might respond to a given treatment before it is ever tested in the real world.

Rocha explained that the new protocol can do far more than confirm a diagnosis.

“For instance, the protocol can extract and provide multi-agent verification of evidence for an adverse drug reaction for a given medication that is available in clinical trials, the scientific literature, pharmacological databases, and even social media discourse,” Rocha said in the news release. “And it can assist in the extraction of evidence at multiple scales, from multiomics to epidemiological and behavioral data sources, which we have already started to pilot by building multi-layer models of ER+ breast cancer.”

Beyond Medicine

Although the Binghamton team designed their protocol with biomedical use cases in mind, the underlying framework applies wherever AI hallucinations cause harm. Fabricated legal citations — a problem that has already led to real-world court sanctions — fake academic references, and distorted historical facts could all be targets for similar multi-agent verification systems.

“This protocol is a big step toward the democratization of knowledge verification,” Hamed added.

For students writing research papers, journalists fact-checking stories, or anyone who relies on AI for information they need to trust, that kind of systematic verification could eventually become a standard feature rather than a specialized research tool.

Source: Binghamton University