Study Finds AI Errors Can Double Human Mistakes in Life-or-Death Decisions

When AI stumbles, so can people, highlighting the urgent need to assess both human and machine performance in safety-critical systems like healthcare. 

Image Credit: Collagery / Shutterstock

When it comes to adopting artificial intelligence in high-stakes settings like hospitals and airplanes, good AI performance and brief worker training on the technology are not sufficient to ensure systems will run smoothly and patients and passengers will be safe, a new study suggests.

Instead, algorithms and the people who use them in the most safety-critical organizations must be evaluated simultaneously to get an accurate view of AI's effects on human decision making, researchers say.

The team also contends these evaluations should assess how people respond to good, mediocre, and poor technology performance to put the AI-human interaction to a meaningful test – and to expose the level of risk linked to mistakes.

Participants in the study, led by engineering researchers at The Ohio State University, were 450 Ohio State nursing students, primarily undergraduates with varying amounts of clinical training, and 12 licensed nurses. They used AI-assisted technologies in a remote patient-monitoring scenario to determine how likely urgent care would be needed in a range of patient cases.

Results showed that more accurate AI predictions about whether or not a patient was trending toward a medical emergency improved participant performance by between 50% and 60%. But when the algorithm produced an inaccurate prediction, even when accompanied by explanatory data that didn't support that outcome, human performance collapsed, with an over 100% degradation in proper decision making when the algorithm was the most wrong.

"An AI algorithm can never be perfect. So if you want an AI algorithm that's ready for safety-critical systems, that means something about the team, about the people and AI together, has to be able to cope with a poor-performing AI algorithm," said first author Dane Morey, a research scientist in the Department of Integrated Systems Engineering at Ohio State.

"The point is this is not about making really good safety-critical system technology. It's the joint human-machine capabilities that matter in a safety-critical system."

Morey completed the study with Mike Rayo, associate professor, and David Woods, faculty emeritus, both in integrated systems engineering at Ohio State. The research was published recently in npj Digital Medicine.

The authors, all members of the Cognitive Systems Engineering Lab directed by Rayo, developed the Joint Activity Testing research program in 2020 to address a perceived gap in responsible AI deployment in risky environments, especially medical and defense settings.

The team is also refining a set of evidence-based guiding principles for machine design with joint activity in mind that can smooth the AI-human performance evaluation process and, after that, actually improve system outcomes.

According to their preliminary list, a machine, first and foremost, should convey to people how it is misaligned to the world, even when it is unaware that it is misaligned to the world.

"Even if a technology does well on those heuristics, it probably still isn't quite ready," Rayo said. "We need to do some form of empirical evaluation because those are risk-mitigation steps, and our safety-critical industries deserve at least those two steps of measuring performance of people and AI together and examining a range of challenging cases."

The Cognitive Systems Engineering Lab has been running studies for five years on real technologies to arrive at best-practice evaluation methods, mainly on projects with 20 to 30 participants. Having 462 participants in this project – especially a target population for AI-infused technologies whose study enrollment was connected to a course-based educational activity – gives the researchers high confidence in their findings and recommendations, Rayo said.

Each participant analyzed a sequence of 10 patient cases under differing experimental conditions: no AI help, an AI percentage prediction of imminent need for emergency care, AI annotations of data relevant to the patient's condition, and both AI predictions and annotations.

All examples included a data visualization showing demographics, vital signs and lab results intended to help users anticipate changes to or stability in a patient's status.

Participants were instructed to report their concern for each patient on a scale from 0 to 10. Higher concern for emergency patients and lower concern for non-emergency patients were the indicators deemed to show better performance.

"We found neither the nurses nor the AI algorithm were universally superior to the other in all cases," the authors wrote. The analysis accounted for differences in participants' clinical experience.

While the overall results provided evidence that there is a need for this type of evaluation, the researchers said they were surprised that explanations included in some experimental conditions had very little sway in participant concern – instead, the algorithm recommendation, presented in a solid red bar, overruled everything else.

"Whatever effect that those annotations had was roundly overwhelmed by the presence of that indicator that swept everything else away," Rayo said.

The team considered the study methods, including custom-built technologies representative of health care applications currently in use, as a template for why their recommendations are needed and how industries could put the suggested practices in place.

The coding data for the experimental technologies is publicly available, and Morey, Rayo, and Woods further explained their work in an article published at AI-frontiers.org.

"What we're advocating for is a way to help people better understand the variety of effects that may come about from technologies," Morey said. "Basically, the goal is not the best AI performance. It's the best team performance."

This research was funded by the American Nurses Foundation Reimagining Nursing Initiative.

Source:
Journal reference:
  • Morey, D. A., Rayo, M. F., & Woods, D. D. (2025). Empirically derived evaluation requirements for responsible deployments of AI in safety-critical settings. Npj Digital Medicine, 8(1), 1-11. DOI: 10.1038/s41746-025-01784-y, https://www.nature.com/articles/s41746-025-01784-y

Comments

The opinions expressed here are the views of the writer and do not necessarily reflect the views and opinions of AZoAi.
Post a new comment
Post

Sign in to keep reading

We're committed to providing free access to quality science. By registering and providing insight into your preferences you're joining a community of over 1m science interested individuals and help us to provide you with insightful content whilst keeping our service free.

or

While we only use edited and approved content for Azthena answers, it may on occasions provide incorrect responses. Please confirm any data provided with the related suppliers or authors. We do not provide medical advice, if you search for medical information you must always consult a medical professional before acting on any information provided.

Your questions, but not your email details will be shared with OpenAI and retained for 30 days in accordance with their privacy principles.

Please do not ask questions that use sensitive or confidential information.

Read the full Terms & Conditions.

You might also like...
4 Essential Insights From A Deepfake Expert On The Take It Down Act