LiveBench: A Dynamic Benchmark for Large Language Models

In an article recently submitted to the arXiv* server, researchers introduced LiveBench, a benchmark designed to prevent test set contamination and biases from large language model (LLM) judging and human crowdsourcing.

Study: LiveBench: A Dynamic Benchmark for Large Language Models. Image Credit: BOY ANTHONY/Shutterstock
Study: LiveBench: A Dynamic Benchmark for Large Language Models. Image Credit: BOY ANTHONY/Shutterstock

*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as conclusive, guide clinical practice/health-related behavior, or treated as established information.

LiveBench features frequently updated questions from recent sources, automatic scoring based on objective values, and various challenging tasks. The benchmark includes contamination-free tasks from previous benchmarks like big-bench hard and automated mathematical proofs scoring (AMPS). Evaluations of both closed and open-source models show top models achieving low accuracy.

Related Work

Previous works have introduced several prominent LLM benchmarks relevant to the study. The huggingface open LLM Leaderboard tracks LLM performance but is prone to test set contamination due to its static nature. Benchmarks like alpacaeval, model test bench (MT-Bench), and arena-hard use LLM judges but suffer from biases and errors, while human-judging benchmarks like chatbot arena are labor-intensive and variable. Other benchmarks, such as LiveCodeBench (LCB) and the scalable evaluation of aligned language models (SEAL) Benchmark, focus on specific tasks or use private questions. Still, LiveBench aims to be comprehensive and continuously updated.

LiveBench Overview Summary

This section introduces LiveBench as a benchmark comprising six categories: math, coding, reasoning, data analysis, instruction following, and language comprehension. Each category includes two to four tasks with questions from recent information sources or more challenging variants of existing benchmarks.

Tasks typically consist of approximately 50 questions, varying in difficulty from easy to highly demanding, aiming for an overall success rate of 30-70% across top models. Prompts within each category are tailored to include a zero-shot chain of thought, requiring models to guess when unsure and to output answers in a format that is easily parseable, indicated by double asterisks.

The math category incorporates questions from recent high school competitions, fill-in-the-blank problems from the proof-based United states of America Mathematical Olympiad (USAMO), and a more demanding version of the arbitrary math problem-solving (AMPS) dataset. Tasks such as olympiad feature questions from competitions like assessing arithmetic, algebra, geometry, number theory, and more complex mathematical problem-solving skills. Additionally, the AMPS_hard task includes synthetic questions more challenging than those found in the original AMPS dataset.

LiveBench's coding category includes two distinct tasks: an adapted version of the code generation task from lower confidence bound (LCB) and a novel code completion task. The LCB Generation task evaluates a model's ability to interpret and correctly respond to a competition-level coding prompt using questions derived from the LiveCodeBench collection. Meanwhile, the Completion task focuses on the model's capability to finish partially correct solutions sourced from GitHub for LeetCode medium and hard problems, omitting the final portion of each solution and prompting the LLM to complete it.

The reasoning category of LiveBench includes tasks derived from BigBench Hard and Zebra puzzles. The Web of Lies v2 task expands on a challenge from BigBench, requiring models to evaluate the validity of a Boolean function presented in natural language with added deductive elements and misleading clues to heighten difficulty.

Similarly, the zebra puzzles task assesses models' ability to follow constraints and logically deduce information using procedurally generated puzzles. In LiveBench's data analysis category, three tasks evaluate the LLM's skills in data manipulation and interpretation: column type annotation, table reformatting, and table join, each testing the model's capability in different aspects of handling structured data.

LLM Benchmark Evaluation

This Experimental setup involves 49 different LLMs, encompassing a mix of proprietary, large open-source, and small open-source models. These include various versions of generative pre-trained transformer (GPT) models like GPT-4 and GPT-3.5, Anthropic models such as Claude-3, Mistral models like mistral-large-2402 and mistral-small-2402, Google's Gemini-1.5 models, and a range of others from the open-source community such as deep seek-coder-v2 and phi-3-small-128k-instruct.

The experiments evaluate these models across all 18 LiveBench tasks using standardized evaluation settings with FastChat templates and bfloat16 precision.

The comparison then proceeds to LiveBench's performance against established benchmarks. Notably, varying strengths are observed among models across different benchmarks, with some models demonstrating significantly higher performance on one benchmark over the other.

For instance, models like GPT-4-0125-preview and GPT-4-turbo-2024-04-09 exhibit notably stronger results on Arena-Hard, potentially influenced by biases associated with using GPT-4 as the judging LLM. These findings underscore the importance of comprehensively considering benchmark-specific biases and preferences in evaluating LLM capabilities.


To summarize this work, LiveBench was introduced as an LLM benchmark to address issues like test set contamination and reliance on LLM judging and human crowdsourcing. It was the first benchmark to incorporate regularly updated questions sourced from recent information, with difficulty increasing over time. Answers were objectively scored based on ground-truth values, eliminating the need for LLM judges. LiveBench featured various challenging tasks in math, coding, reasoning, language, instruction following, and data analysis.

Future work for LiveBench will expand task repositories to cover emerging artificial intelligence (AI) and natural language processing (NLP) domains. Efforts will refine evaluation methods for enhanced benchmark robustness, fostering collaboration with the research community to drive innovation and advance LLM capabilities.

*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as conclusive, guide clinical practice/health-related behavior, or treated as established information.

Journal reference:
  • Preliminary scientific report. White, C., et al. (2024). LiveBench: A Challenging, Contamination-Free LLM Benchmark. ArXiv. DOI:10.48550/arXiv.2406.19314,
Silpaja Chandrasekar

Written by

Silpaja Chandrasekar

Dr. Silpaja Chandrasekar has a Ph.D. in Computer Science from Anna University, Chennai. Her research expertise lies in analyzing traffic parameters under challenging environmental conditions. Additionally, she has gained valuable exposure to diverse research areas, such as detection, tracking, classification, medical image analysis, cancer cell detection, chemistry, and Hamiltonian walks.


Please use one of the following formats to cite this article in your essay, paper or report:

  • APA

    Chandrasekar, Silpaja. (2024, July 09). LiveBench: A Dynamic Benchmark for Large Language Models. AZoAi. Retrieved on July 17, 2024 from

  • MLA

    Chandrasekar, Silpaja. "LiveBench: A Dynamic Benchmark for Large Language Models". AZoAi. 17 July 2024. <>.

  • Chicago

    Chandrasekar, Silpaja. "LiveBench: A Dynamic Benchmark for Large Language Models". AZoAi. (accessed July 17, 2024).

  • Harvard

    Chandrasekar, Silpaja. 2024. LiveBench: A Dynamic Benchmark for Large Language Models. AZoAi, viewed 17 July 2024,


The opinions expressed here are the views of the writer and do not necessarily reflect the views and opinions of AZoAi.
Post a new comment

While we only use edited and approved content for Azthena answers, it may on occasions provide incorrect responses. Please confirm any data provided with the related suppliers or authors. We do not provide medical advice, if you search for medical information you must always consult a medical professional before acting on any information provided.

Your questions, but not your email details will be shared with OpenAI and retained for 30 days in accordance with their privacy principles.

Please do not ask questions that use sensitive or confidential information.

Read the full Terms & Conditions.

You might also like...
Mitigating Semantic Drift in AI Language Models