In a study submitted to the arxiv* server, researchers investigated fine-tuning large language models (LLMs) as scalable judges to evaluate LLMs in open-ended benchmarks. As LLMs like  Chat Generative Pre-trained Transformer (GPT) and GPT-3 demonstrate remarkable capabilities in open-ended tasks, evaluating them becomes challenging using existing benchmarks and metrics. To address this, the researchers propose JudgeLM, scalable LLM judges that are trained to grade the quality of LLM-generated responses.
 Study: JudgeLM: Scalable Language Models for Evaluating Large Language Models. Image credit: Generated using DALL.E.3
Study: JudgeLM: Scalable Language Models for Evaluating Large Language Models. Image credit: Generated using DALL.E.3

 *Important notice:  arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as definitive, used to guide development decisions, or treated as established information in the field of artificial intelligence research.
*Important notice:  arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as definitive, used to guide development decisions, or treated as established information in the field of artificial intelligence research.
Recent advances in foundation models like GPT-3 and T5 have enabled the creation of powerful LLMs through instruction tuning, such as ChatGPT and GPT-4. These models exhibit strong few-shot learning abilities across diverse tasks. However, evaluating their open-ended capabilities using existing benchmarks (like SuperGLUE) and metrics (like BLEU) has proven inadequate. Alternative evaluation methods based on human assessments or closed-source LLMs as judges have downsides like high cost, bias, privacy concerns and instability. This underscores the need for reproducible, efficient LLM judge models to evaluate LLMs in open-ended scenarios accurately.
Data Generation
The dataset comprises over 100,000 samples with seed tasks, LLM responses, and judgments from GPT-4. Seed tasks are drawn from diverse sources to ensure heterogeneity. LLM responses are collated from leading models like LLaMA and Vicuna. Judgments include scores and detailed reasoning for response pairs, with and without reference answers. This high-quality data enables training judges to score responses, even with external context reliably.
The data generation process involves three key steps. First, over 100,000 seed tasks are sampled from diverse instruction-tuning datasets to create a heterogeneous set of questions and prompts. Second, responses to these seed tasks are gathered from 11 popular LLMs encompassing models like LLaMA, Vicuna, and Alpaca. Third, these responses are fed alongside the seed tasks into GPT-4 to obtain fine-grained scores and reasoning judgments for pairs of responses. Two judgments are collected per response pair - one with and one without reference answers. This yields a rich training source with judgments adaptable to both settings.
Model Training
JudgeLMs are initialized from base LLM checkpoints like Vicuna and fine-tuned on the judge dataset using templates that frame it as a grading task. Multiple JudgeLMs are trained at 7B to 33B parameters to analyze size-capability tradeoffs.
The model training process formulates judge response scoring as an instruction following the task. JudgeLMs leverage the strong few-shot learning capabilities of modern LLMs. They are initialized with weights from base models like Vicuna-7B or Vicuna-33B. These base models ensure the foundational language skills. JudgeLMs are then fine-tuned on the released dataset using prompt templates tailored to framing judging as grading paired responses. To study scaling trends, JudgeLMs ranging from 7B to 33B parameters are trained. Larger JudgeLMs generally achieve higher performance but come at a proportional increase in computing costs.
Evaluation Protocol
JudgeLMs are evaluated on agreement with GPT-4 judgments and consistency when answers are swapped. Position bias, knowledge bias, and format bias are measured to study inherent limitations. Objective metrics like accuracy and subjective metrics like human alignment are reported on existing and new benchmarks.
The judges are evaluated using a rigorous protocol assessing objective and subjective capabilities. Quantitative metrics evaluate agreement with the GPT-4 teacher judgments and consistency when response positions are swapped. The consistency reveals biases like position bias, knowledge bias, and format bias. These provide insights into the judges' reliability limitations.
To measure human alignment, judgments are compared against expert and crowd annotations on existing and new benchmarks. The evaluation spans diverse settings - scoring single responses, ranking multiple responses, multimodal tasks, dialogue evaluation, etc. This multi-faceted protocol evaluates JudgeLMs' capabilities holistically.
Results
JudgeLMs achieved over 90% agreement with GPT-4, surpassing human consistency. Larger models exhibit higher performance. The best JudgeLM reaches up to 90% agreement with GPT-4, exceeding typical human consistency rates. Agreement tends to improve with model scale, with the 33B JudgeLM performing the strongest.
They obtain state-of-the-art results on existing judge benchmarks like PandaLM. The 33B JudgeLM outscores GPT-4 in accuracy. On the PandaLM benchmark for LLM judges, the JudgeLMs achieve new state-of-the-art results. Notably, the 33B JudgeLM surpasses the performance of the GPT-4 teacher in terms of accuracy.
JudgeLMs scale efficiently, judging 5000 responses in just minutes on eight graphics processing units (GPUs). They are over 100 times cheaper than GPT-4. Owing to their optimized design, JudgeLMs showcase excellent efficiency. A JudgeLM-7B can judge 5000 responses in just 3 minutes on 8 GPUs, drastically faster than previous methods. Their low cost makes them over 100 times more affordable than GPT-4. These findings underscore the viability of using fine-tuned LLM judges for reliable open-ended LLM evaluation.
JudgeLMs address key pain points of human evaluation like cost, bias, and scope constraints. Their quantifiable reliability, efficiency, and customizable nature enable autonomous LLM testing. The study also offers insights into biases that can degrade judge consistency, informing future work into robust judge architectures. Overall, JudgeLMs provide a scalable and reproducible solution for evaluating modern LLMs rapidly and accurately in the wild. Their continued development promises to accelerate the aligned deployment of increasingly capable LLMs.
Future Outlook
This research opens promising directions for future work on LLM judges. Two key priorities are scaling up judge models and dataset size. Larger judge model sizes, augmented training data, and techniques like synthetic data hold the potential for boosting capability further. Testing JudgeLMs on broader tasks and investigating sample-efficiency merits exploration. Architectural enhancements like hybrid human-JudgeLM loops could improve robustness. Overall, advancing JudgeLMs as an autonomous, low-cost, and unbiased LLM testing framework could profoundly impact the development of aligned LLMs.

 *Important notice:  arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as definitive, used to guide development decisions, or treated as established information in the field of artificial intelligence research.
*Important notice:  arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as definitive, used to guide development decisions, or treated as established information in the field of artificial intelligence research.