AI in Academic Assessments: Evaluating GPT-4 in Biomedical Science Exams

In an article published in the journal Nature, researchers evaluated the performance of generative pre-trained transformer (GPT)-4, a large language model (LLM), on nine graduate-level biomedical science examinations. GPT-4 generally outperformed student averages on various question formats but struggled with questions involving simulated data and hand-drawn answers. Instances of plagiarism and hallucinations were identified. The study aimed to inform the design of future academic examinations in light of the capabilities and limitations of artificial intelligence (AI) LLMs like GPT-4.

Study: AI in Academic Assessments: Evaluating GPT-4 in Biomedical Science Exams. Image credit: 1st footage/Shutterstock
Study: AI in Academic Assessments: Evaluating GPT-4 in Biomedical Science Exams. Image credit: 1st footage/Shutterstock


LLM AI systems, such as OpenAI's GPT-3.5 and GPT-4, have revolutionized text generation tasks since their emergence. These models, particularly exemplified by the ChatGPT chatbot, have demonstrated remarkable capabilities in generating human-like text, impacting various domains previously exclusive to human expertise. The release of ChatGPT marked a significant advancement, lowering barriers to access and expanding the utility of GPT-based models in diverse fields.

Prior research has evaluated the performance of LLMs in tasks ranging from standardized exam questions to professional certification exams. However, existing studies primarily focused on broad-based or discipline-specific exams, potentially influenced by readily available study materials online, thus limiting the assessment's accuracy in gauging the models' domain-specific knowledge and capabilities. Moreover, relying on standardized questions may not fully represent the depth and distribution of topics within real-world exams, raising questions about the assessment's validity.

The present paper addressed these gaps by assessing GPT-4's performance on graduate-level final examinations in the biomedical sciences, a field requiring both deep subject knowledge and critical thinking skills. By focusing on free-response questions typical of doctor of philosophy (Ph.D.) level assessments, the study provided a robust benchmark for evaluating GPT-4's ability to generate accurate and logically consistent responses in expert-level contexts.

Additionally, the study employed blinded grading to reduce potential bias and compared GPT-4's performance directly against student grades, offering insights into areas where the model excelled or fell short. Through this comprehensive evaluation, the researchers contributed to a deeper understanding of LLM capabilities in scientific domains, informing the design of future academic examinations amidst the advent of AI-driven text generation technologies. 


The methods employed in this study aimed to rigorously assess the performance of the GPT-4 LLM in answering graduate-level examination questions in biomedical sciences. Nine courses from the University of Florida provided examinations administered from March to May 2023, encompassing various question formats such as short response, fill-in-the-blank, essay, and diagram-drawing questions. GPT-4 was queried either interactively via ChatGPT or programmatically using the OpenAI API, with responses generated between May and June 2023.

To ensure fair evaluation, different prompt patterns were utilized, including "GPT4-Simple," "GPT4-Expert," and "GPT4-Short," each influencing the style and content of the model's responses. Examination questions were transcribed or copied into the query format, with figures replaced by textual descriptions to maintain uniformity. Blinded grading processes were implemented where possible to minimize bias, with GPT-4 responses handwritten or copied into exam forms to mirror student submissions.

Additionally, hand-drawn diagrams were included to match the style of student answers. Student performance data, obtained anonymously, facilitated comparison with GPT-4 scores. Statistical analysis was performed using Microsoft Excel, and the variance between student and GPT-4 scores was assessed, and different prompt patterns were compared.

Surveys collected instructor opinions on the efficacy of GPT-4 responses. Exploratory queries further investigated the model's knowledge base and responses to varied stimuli. The manuscript was prepared to adhere to standard conventions, utilizing appropriate software for figure creation and textual editing. The authors contributed valuable insights into the potential applications and limitations of AI LLMs in educational assessment contexts. 

Key Results

GPT-4 responses were evaluated against student performance, with the model meeting or exceeding the average student score in seven out of nine examinations, outperforming all students in four cases. However, instances of plagiarism were identified in one course, highlighting the need for vigilance in utilizing AI-generated content. Notably, examinations containing figures posed challenges for GPT-4, resulting in lower performance compared to textual questions.

Exploratory queries revealed the model's ability to interpret scientific figures, albeit with occasional hallucinated descriptions. Instructor surveys reflected a mixture of surprise at GPT-4's quality, concerns about student misuse, and uncertainty about its impact on learning outcomes. Overall, the researchers offered valuable insights into the capabilities and limitations of AI LLMs in academic assessments, prompting considerations for future educational practices and technology integration. 


GPT-4 performed comparably to above-average graduate students in most examinations, excelling in short-answer and essay questions but struggling with questions based on figures and requiring hand-drawn answers. Prompt patterns did not significantly impact answer grades, suggesting promise in utilizing the model as an "answer engine." However, GPT-4's performance varied across different question types and domains, necessitating caution in its use. Modifications to assessments were proposed to mitigate the influence of LLMs, including the incorporation of complex figures and the avoidance of basic, easily accessible knowledge.

Despite its capabilities, reliance solely on LLMs like GPT-4 for reference information was cautioned due to the risk of false responses. Overall, while LLMs offered convenience and reliability, their integration into education required careful consideration of their strengths and limitations to preserve academic integrity and efficacy.


In conclusion, the researchers provided a comprehensive evaluation of GPT-4's performance in graduate-level biomedical science examinations, highlighting its strengths in answering various question formats and its limitations in handling questions involving figures and hand-drawn answers. The findings underscored the need for caution in relying solely on AI LLMs like GPT-4 for educational assessments.

While these models offered convenience, their integration should be accompanied by measures to mitigate potential misuse and uphold academic integrity. Moving forward, thoughtful consideration of AI's role in education is essential to harness its benefits effectively while addressing associated challenges.

Journal reference:
Soham Nandi

Written by

Soham Nandi

Soham Nandi is a technical writer based in Memari, India. His academic background is in Computer Science Engineering, specializing in Artificial Intelligence and Machine learning. He has extensive experience in Data Analytics, Machine Learning, and Python. He has worked on group projects that required the implementation of Computer Vision, Image Classification, and App Development.


Please use one of the following formats to cite this article in your essay, paper or report:

  • APA

    Nandi, Soham. (2024, March 19). AI in Academic Assessments: Evaluating GPT-4 in Biomedical Science Exams. AZoAi. Retrieved on April 16, 2024 from

  • MLA

    Nandi, Soham. "AI in Academic Assessments: Evaluating GPT-4 in Biomedical Science Exams". AZoAi. 16 April 2024. <>.

  • Chicago

    Nandi, Soham. "AI in Academic Assessments: Evaluating GPT-4 in Biomedical Science Exams". AZoAi. (accessed April 16, 2024).

  • Harvard

    Nandi, Soham. 2024. AI in Academic Assessments: Evaluating GPT-4 in Biomedical Science Exams. AZoAi, viewed 16 April 2024,


The opinions expressed here are the views of the writer and do not necessarily reflect the views and opinions of AZoAi.
Post a new comment

While we only use edited and approved content for Azthena answers, it may on occasions provide incorrect responses. Please confirm any data provided with the related suppliers or authors. We do not provide medical advice, if you search for medical information you must always consult a medical professional before acting on any information provided.

Your questions, but not your email details will be shared with OpenAI and retained for 30 days in accordance with their privacy principles.

Please do not ask questions that use sensitive or confidential information.

Read the full Terms & Conditions.

You might also like...
Empowering Legal Education: AI's Role in International Law