AI in Academic Assessments: Evaluating GPT-4 in Biomedical Science Exams

Download PDF Copy

By Soham NandiReviewed by Susha Cheriyedath, M.Sc.Mar 19 2024

In an article published in the journal Nature, researchers evaluated the performance of generative pre-trained transformer (GPT)-4, a large language model (LLM), on nine graduate-level biomedical science examinations. GPT-4 generally outperformed student averages on various question formats but struggled with questions involving simulated data and hand-drawn answers. Instances of plagiarism and hallucinations were identified. The study aimed to inform the design of future academic examinations in light of the capabilities and limitations of artificial intelligence (AI) LLMs like GPT-4.

*Study: AI in Academic Assessments: Evaluating GPT-4 in Biomedical Science Exams. Image credit: 1st footage/Shutterstock*

Background

LLM AI systems, such as OpenAI's GPT-3.5 and GPT-4, have revolutionized text generation tasks since their emergence. These models, particularly exemplified by the ChatGPT chatbot, have demonstrated remarkable capabilities in generating human-like text, impacting various domains previously exclusive to human expertise. The release of ChatGPT marked a significant advancement, lowering barriers to access and expanding the utility of GPT-based models in diverse fields.

Prior research has evaluated the performance of LLMs in tasks ranging from standardized exam questions to professional certification exams. However, existing studies primarily focused on broad-based or discipline-specific exams, potentially influenced by readily available study materials online, thus limiting the assessment's accuracy in gauging the models' domain-specific knowledge and capabilities. Moreover, relying on standardized questions may not fully represent the depth and distribution of topics within real-world exams, raising questions about the assessment's validity.

The present paper addressed these gaps by assessing GPT-4's performance on graduate-level final examinations in the biomedical sciences, a field requiring both deep subject knowledge and critical thinking skills. By focusing on free-response questions typical of doctor of philosophy (Ph.D.) level assessments, the study provided a robust benchmark for evaluating GPT-4's ability to generate accurate and logically consistent responses in expert-level contexts.

Additionally, the study employed blinded grading to reduce potential bias and compared GPT-4's performance directly against student grades, offering insights into areas where the model excelled or fell short. Through this comprehensive evaluation, the researchers contributed to a deeper understanding of LLM capabilities in scientific domains, informing the design of future academic examinations amidst the advent of AI-driven text generation technologies.

Methods

The methods employed in this study aimed to rigorously assess the performance of the GPT-4 LLM in answering graduate-level examination questions in biomedical sciences. Nine courses from the University of Florida provided examinations administered from March to May 2023, encompassing various question formats such as short response, fill-in-the-blank, essay, and diagram-drawing questions. GPT-4 was queried either interactively via ChatGPT or programmatically using the OpenAI API, with responses generated between May and June 2023.

To ensure fair evaluation, different prompt patterns were utilized, including "GPT4-Simple," "GPT4-Expert," and "GPT4-Short," each influencing the style and content of the model's responses. Examination questions were transcribed or copied into the query format, with figures replaced by textual descriptions to maintain uniformity. Blinded grading processes were implemented where possible to minimize bias, with GPT-4 responses handwritten or copied into exam forms to mirror student submissions.

Additionally, hand-drawn diagrams were included to match the style of student answers. Student performance data, obtained anonymously, facilitated comparison with GPT-4 scores. Statistical analysis was performed using Microsoft Excel, and the variance between student and GPT-4 scores was assessed, and different prompt patterns were compared.

Surveys collected instructor opinions on the efficacy of GPT-4 responses. Exploratory queries further investigated the model's knowledge base and responses to varied stimuli. The manuscript was prepared to adhere to standard conventions, utilizing appropriate software for figure creation and textual editing. The authors contributed valuable insights into the potential applications and limitations of AI LLMs in educational assessment contexts.

Key Results

GPT-4 responses were evaluated against student performance, with the model meeting or exceeding the average student score in seven out of nine examinations, outperforming all students in four cases. However, instances of plagiarism were identified in one course, highlighting the need for vigilance in utilizing AI-generated content. Notably, examinations containing figures posed challenges for GPT-4, resulting in lower performance compared to textual questions.

Exploratory queries revealed the model's ability to interpret scientific figures, albeit with occasional hallucinated descriptions. Instructor surveys reflected a mixture of surprise at GPT-4's quality, concerns about student misuse, and uncertainty about its impact on learning outcomes. Overall, the researchers offered valuable insights into the capabilities and limitations of AI LLMs in academic assessments, prompting considerations for future educational practices and technology integration.

Discussion

GPT-4 performed comparably to above-average graduate students in most examinations, excelling in short-answer and essay questions but struggling with questions based on figures and requiring hand-drawn answers. Prompt patterns did not significantly impact answer grades, suggesting promise in utilizing the model as an "answer engine." However, GPT-4's performance varied across different question types and domains, necessitating caution in its use. Modifications to assessments were proposed to mitigate the influence of LLMs, including the incorporation of complex figures and the avoidance of basic, easily accessible knowledge.

Despite its capabilities, reliance solely on LLMs like GPT-4 for reference information was cautioned due to the risk of false responses. Overall, while LLMs offered convenience and reliability, their integration into education required careful consideration of their strengths and limitations to preserve academic integrity and efficacy.

Conclusion

In conclusion, the researchers provided a comprehensive evaluation of GPT-4's performance in graduate-level biomedical science examinations, highlighting its strengths in answering various question formats and its limitations in handling questions involving figures and hand-drawn answers. The findings underscored the need for caution in relying solely on AI LLMs like GPT-4 for educational assessments.

While these models offered convenience, their integration should be accompanied by measures to mitigate potential misuse and uphold academic integrity. Moving forward, thoughtful consideration of AI's role in education is essential to harness its benefits effectively while addressing associated challenges.

Journal reference:

Stribling, D., Xia, Y., Amer, M. K., Graim, K. S., Mulligan, C. J., & Renne, R. (2024). The model student: GPT-4 performance on graduate biomedical science exams. Scientific Reports, 14(1). https://doi.org/10.1038/s41598-024-55568-7, https://www.nature.com/articles/s41598-024-55568-7

Posted in: AI Research News

Comments (0)

Written by

Soham Nandi

Soham Nandi is a technical writer based in Memari, India. His academic background is in Computer Science Engineering, specializing in Artificial Intelligence and Machine learning. He has extensive experience in Data Analytics, Machine Learning, and Python. He has worked on group projects that required the implementation of Computer Vision, Image Classification, and App Development.

Download PDF Copy

Citations

Please use one of the following formats to cite this article in your essay, paper or report:

APA
Nandi, Soham. (2024, March 19). AI in Academic Assessments: Evaluating GPT-4 in Biomedical Science Exams. AZoAi. Retrieved on July 18, 2025 from https://www.azoai.com/news/20240319/AI-in-Academic-Assessments-Evaluating-GPT-4-in-Biomedical-Science-Exams.aspx.
MLA
Nandi, Soham. "AI in Academic Assessments: Evaluating GPT-4 in Biomedical Science Exams". AZoAi. 18 July 2025. <https://www.azoai.com/news/20240319/AI-in-Academic-Assessments-Evaluating-GPT-4-in-Biomedical-Science-Exams.aspx>.
Chicago
Nandi, Soham. "AI in Academic Assessments: Evaluating GPT-4 in Biomedical Science Exams". AZoAi. https://www.azoai.com/news/20240319/AI-in-Academic-Assessments-Evaluating-GPT-4-in-Biomedical-Science-Exams.aspx. (accessed July 18, 2025).
Harvard
Nandi, Soham. 2024. AI in Academic Assessments: Evaluating GPT-4 in Biomedical Science Exams. AZoAi, viewed 18 July 2025, https://www.azoai.com/news/20240319/AI-in-Academic-Assessments-Evaluating-GPT-4-in-Biomedical-Science-Exams.aspx.