LLMs Automate Automated Essay Scoring

In a recent article published in the journal Computers and Education: Artificial Intelligence, researchers investigated the potential of large language models (LLMs) to automate the scoring of essays written by English language learners. Their goal was to evaluate these advanced artificial intelligence (AI) systems as tools for automated essay scoring (AES).

Study: LLMs Automate Automated Essay Scoring. Image Credit: Noin90650/Shutterstock
Study: LLMs Automate Automated Essay Scoring. Image Credit: Noin90650/Shutterstock


AES is the process of using technology to analyze and evaluate written work, usually by assigning a numerical score. It offers a solution to challenges like the time, cost, and inconsistency associated with human scoring. It also faces some issues, such as validity, reliability, transparency, and ethical issues.

Traditional AES systems use machine learning to extract and compare specific features of writing, such as grammar, vocabulary, and coherence against human-scored essays or set criteria. However, these systems are limited by their feature selection, the genre of writing, and their accessibility and cost.

LLMs are AI systems capable of generating natural language texts. Although not specifically designed for AES, LLMs offer advantages over traditional systems, such as versatility and user interaction through chatbots like chat generative pre-text transformer (ChatGPT), Bard, and Claude.

About the Research

In this paper, the authors aimed to explore the validity and reliability of generative LLMs in scoring student writing. Their primary goal was to evaluate the performance of four widely used LLMs: Google’s PaLM2, Anthropic’s Claude 2, and OpenAI’s generative pre-text transfer 3.5 (GPT-3.5) and GPT-4, in assessing essays written by English language learners.

For this study, the researchers selected 119 essays from an English language university admission and placement test. Each essay was scored twice by each LLM on separate occasions and by two human raters using a holistic rubric. The main metrics for assessing the models' performance were intrarater reliability (consistency of scores given by the same rater over time) and interrater reliability (agreement between scores given by different raters). The authors also evaluated the validity of the LLMs' scores by comparing them to the human ratings. They measured these reliability metrics using the intraclass correlation coefficient (ICC) and Pearson’s correlation.

The methodology involved a detailed analysis of the models' scoring patterns and their consistency over time. The study also examined potential reasons for any variability in the models' performance and offered insights into the strengths and weaknesses of each LLM in the context of AES.

Research Findings

The outcomes showed that GPT-4 was the most reliable LLM, showing excellent intrarater reliability and strong validity, with a high correlation to human raters, comparable to traditional AES systems. Claude 2 demonstrated good intrarater reliability and moderate validity. PaLM2 and GPT-3.5 showed moderate intra and interrater reliability. Most LLMs, except GPT-3.5, improved their intrarater reliability over time. However, the interrater reliability of GPT-3.5 and GPT-4 decreased slightly over time.

The study also identified limitations in LLM performance, such as scoring on a continuous scale, completing unfinished sentences, hallucinating text features, and showing non-deterministic behavior. These issues could arise from factors like randomness in sampling, temperature settings, token limits, and model updates. Despite their advanced capabilities, LLMs can exhibit variability due to essay topic complexity, training data differences, and the distinct ways humans and AI assess writing.


This research has significant implications for the future of AES and educational technology. The demonstrated reliability and validity of models like GPT-4 suggest that they can be effectively integrated into educational environments to assist with essay grading. This integration could reduce the grading burden on educators, allowing them to focus more on teaching and providing personalized support to students.

Additionally, the adaptability of generative AI models extends beyond traditional essay assessments. They can be used for a variety of writing tasks, including creative writing and technical reports. Their accessibility and ease of use make them valuable tools for providing formative feedback, enabling students to improve their writing skills through immediate and detailed evaluations.


In summary, the LLMs proved effective in revolutionizing language assessment practices. The researchers highlighted that these models were not specifically designed for AES and lacked full transparency and understanding. Moving forward, they emphasized the need for further research to evaluate the validity and reliability of LLMs across various contexts, writing genres, and assessment criteria.

Additionally, they underscored the importance of addressing the ethical and pedagogical implications of using LLMs for AES. Furthermore, they suggested that cross-disciplinary collaboration between computational linguists, machine learning experts, and language assessment experts can fine-tune LLMs for the specific purpose of assessing language.

Journal reference:
Muhammad Osama

Written by

Muhammad Osama

Muhammad Osama is a full-time data analytics consultant and freelance technical writer based in Delhi, India. He specializes in transforming complex technical concepts into accessible content. He has a Bachelor of Technology in Mechanical Engineering with specialization in AI & Robotics from Galgotias University, India, and he has extensive experience in technical content writing, data science and analytics, and artificial intelligence.


Please use one of the following formats to cite this article in your essay, paper or report:

  • APA

    Osama, Muhammad. (2024, July 09). LLMs Automate Automated Essay Scoring. AZoAi. Retrieved on July 17, 2024 from https://www.azoai.com/news/20240709/LLMs-Automate-Automated-Essay-Scoring.aspx.

  • MLA

    Osama, Muhammad. "LLMs Automate Automated Essay Scoring". AZoAi. 17 July 2024. <https://www.azoai.com/news/20240709/LLMs-Automate-Automated-Essay-Scoring.aspx>.

  • Chicago

    Osama, Muhammad. "LLMs Automate Automated Essay Scoring". AZoAi. https://www.azoai.com/news/20240709/LLMs-Automate-Automated-Essay-Scoring.aspx. (accessed July 17, 2024).

  • Harvard

    Osama, Muhammad. 2024. LLMs Automate Automated Essay Scoring. AZoAi, viewed 17 July 2024, https://www.azoai.com/news/20240709/LLMs-Automate-Automated-Essay-Scoring.aspx.


The opinions expressed here are the views of the writer and do not necessarily reflect the views and opinions of AZoAi.
Post a new comment

While we only use edited and approved content for Azthena answers, it may on occasions provide incorrect responses. Please confirm any data provided with the related suppliers or authors. We do not provide medical advice, if you search for medical information you must always consult a medical professional before acting on any information provided.

Your questions, but not your email details will be shared with OpenAI and retained for 30 days in accordance with their privacy principles.

Please do not ask questions that use sensitive or confidential information.

Read the full Terms & Conditions.

You might also like...
Optimizing Wastewater Treatment with Machine Learning