Decoding Healthcare: Evaluating Advanced Language Models in Clinical Decision-Making

Download PDF Copy

By Soham NandiReviewed by Susha Cheriyedath, M.Sc.Mar 15 2024

In an article published in the journal Nature, researchers evaluated the clinical accuracy of generative pre-trained transformers (GPT)-3.5 and GPT-4, as well as two configurations of the Llama 2 open-source large language models (LLMs), in providing initial diagnoses, examination steps, and treatment suggestions for various medical cases.

*Study: Decoding Healthcare: Evaluating Advanced Language Models in Clinical Decision-Making. Image credit: Mizkit/Shutterstock*

GPT-4 demonstrated superior performance over GPT-3.5 and Google for diagnosis tasks, with generally better performance on common diseases. While showing promise, weaknesses highlighted the necessity for robust and regulated artificial intelligence (AI) models in healthcare.

Background

The rise of LLMs, particularly exemplified by OpenAI's ChatGPT with versions like GPT-3.5 and GPT-4, has revolutionized various text-based tasks, including text summarization, code generation, and personal assistance. However, concerns have been raised about the accuracy and reliability of these models, especially in critical fields like medicine where misinformation can have severe consequences. While preliminary studies have showcased potential applications of ChatGPT in medical contexts, comprehensive evaluations of their diagnostic and therapeutic capabilities are lacking.

Existing research has primarily focused on simulating medical exams or assisting with medical writing, leaving a gap in assessing their performance in clinical decision-making tasks such as initial diagnosis, examination recommendations, and treatment suggestions across various diseases. This paper aimed to address this gap by conducting a thorough analysis of the clinical accuracy of GPT-3.5 and GPT-4 in handling these tasks, considering the frequency of diseases to account for varying difficulty levels. Additionally, it explored the potential of open-source LLMs like Llama 2 as an alternative.

Methods

The methods employed by the researchers aimed to evaluate the clinical accuracy of LLMs, specifically GPT-3·5 and GPT-4, in performing diagnostic, examination, and treatment tasks across a diverse range of medical cases. Firstly, a comprehensive selection process was undertaken to ensure a representative sample of realistic cases from German clinical casebooks. Cases were categorized based on disease frequency, focusing on cases of rare, less frequent, and frequent diseases.

To generate patient queries, cases were translated into layman's language and presented to the LLMs and Google search engine. Two independent physicians evaluated the output generated by the LLMs and Google using a five-point Likert scale. Additionally, an exploratory analysis was conducted on open-source LLMs, specifically Llama 2, with two different model sizes.

The study powered its analysis for comparisons between LLMs and Google and within LLM groups while also conducting a descriptive analysis of Llama 2 models. The performance evaluation considered the cumulative scores across all three tasks for each LLM, stratified by disease frequency subgroups. This comprehensive approach provided insights into the capabilities of LLMs in clinical decision-making tasks and contributed to understanding their potential applications in healthcare.

Results

The study assessed the clinical accuracy of two successive LLMs, GPT-3·5 and GPT-4, in diagnosing, examining, and treating medical cases, along with the comparison to Google search results. Inter-rater reliability analysis revealed substantial to almost perfect agreement among raters for all tasks. Performance evaluation showed GPT-4 outperforming both GPT-3·5 and Google in diagnosis, with significant differences observed.

Notably, all tools demonstrated better performance for frequent diseases compared to rare ones. In examination, GPT-4 showed superior performance over GPT-3·5, especially for rare diseases. Treatment options comparison indicated slightly better performance of GPT-4 over GPT-3·5, although not statistically significant. These findings suggested the potential of commercial LLMs like GPT-4 in assisting clinical decision-making, particularly in diagnosing medical cases.

Yet, further improvements were needed to enhance performance, especially for rare diseases. Moreover, the comparison with open-source models highlighted the ongoing advancements in LLM technology and the need for continued evaluation to ensure reliability and effectiveness in clinical settings.

Discussion

The researchers comprehensively evaluated GPT-3·5 and GPT-4, along with Google search, in clinical decision support tasks across various disease frequencies. GPT-4 exhibited significant improvement over GPT-3·5, outperforming both GPT-3·5 and Google in diagnosis, examination, and treatment recommendation. However, challenges persisted, particularly in diagnosing rare diseases and refining prompts for accurate responses.

While open-source models like Llama 2 showed promise, they lagged slightly behind their commercial counterparts. The study underscores the evolving role of LLMs in healthcare decision-making, emphasizing the need for continual improvements in accuracy, transparency, and regulatory compliance. Despite advancements, caution was warranted, as LLMs still fell short of consistently high accuracy levels required for standalone medical consultation. The future integration of LLMs in healthcare would necessitate adherence to rigorous regulatory standards and exploring open-source alternatives for enhanced transparency and oversight.

Conclusion

In conclusion, the researchers underscored the potential of advanced LLMs like GPT-4 in clinical decision support, particularly for diagnosing common diseases. While improvements over previous models were evident, challenges remained, especially in diagnosing rare conditions.

Additionally, open-source LLMs showed promise but required further refinement. The findings highlighted the evolving landscape of AI in healthcare and emphasized the need for ongoing evaluation, regulatory compliance, and transparency to ensure safe and effective integration into clinical practice.

Journal reference:

Sandmann, S., Riepenhausen, S., Plagwitz, L., & Varghese, J. (2024). Systematic analysis of ChatGPT, Google search, and Llama 2 for clinical decision support tasks. Nature Communications, 15(1), 2050. https://doi.org/10.1038/s41467-024-46411-8, https://www.nature.com/articles/s41467-024-46411-8

Posted in: AI Research News

Comments (0)

Written by

Soham Nandi

Soham Nandi is a technical writer based in Memari, India. His academic background is in Computer Science Engineering, specializing in Artificial Intelligence and Machine learning. He has extensive experience in Data Analytics, Machine Learning, and Python. He has worked on group projects that required the implementation of Computer Vision, Image Classification, and App Development.

Download PDF Copy

Citations

Please use one of the following formats to cite this article in your essay, paper or report:

APA
Nandi, Soham. (2024, March 15). Decoding Healthcare: Evaluating Advanced Language Models in Clinical Decision-Making. AZoAi. Retrieved on February 20, 2026 from https://www.azoai.com/news/20240315/Decoding-Healthcare-Evaluating-Advanced-Language-Models-in-Clinical-Decision-Making.aspx.
MLA
Nandi, Soham. "Decoding Healthcare: Evaluating Advanced Language Models in Clinical Decision-Making". AZoAi. 20 February 2026. <https://www.azoai.com/news/20240315/Decoding-Healthcare-Evaluating-Advanced-Language-Models-in-Clinical-Decision-Making.aspx>.
Chicago
Nandi, Soham. "Decoding Healthcare: Evaluating Advanced Language Models in Clinical Decision-Making". AZoAi. https://www.azoai.com/news/20240315/Decoding-Healthcare-Evaluating-Advanced-Language-Models-in-Clinical-Decision-Making.aspx. (accessed February 20, 2026).
Harvard
Nandi, Soham. 2024. Decoding Healthcare: Evaluating Advanced Language Models in Clinical Decision-Making. AZoAi, viewed 20 February 2026, https://www.azoai.com/news/20240315/Decoding-Healthcare-Evaluating-Advanced-Language-Models-in-Clinical-Decision-Making.aspx.

Comments

The opinions expressed here are the views of the writer and do not necessarily reflect the views and opinions of AZoAi.

Post a new comment

(Logout)

Post

Sign in to keep reading

We're committed to providing free access to quality science. By registering and providing insight into your preferences you're joining a community of over 1m science interested individuals and help us to provide you with insightful content whilst keeping our service free.