Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

Abstract: Evaluating large language model (LLM) based chat assistants is challenging due to their broad capabilities and the inadequacy of existing benchmarks in measuring human preferences. To address this, we explore using strong LLMs as judges to evaluate these models on more open-ended questions. We examine the usage and limitations of LLM-as-a-judge, including position, verbosity, and self-enhancement biases, as well as limited reasoning ability, and propose solutions to mitigate some of them. We then verify the agreement between LLM judges and human preferences by introducing two benchmarks: MT-bench, a multi-turn question set; and Chatbot Arena, a crowdsourced battle platform. Our results reveal that strong LLM judges like GPT-4 can match both controlled and crowdsourced human preferences well, achieving over 80% agreement, the same level of agreement between humans. Hence, LLM-as-a-judge is a scalable and explainable way to approximate human preferences, which are otherwise very expensive to obtain. Additionally, we show our benchmark and traditional benchmarks complement each other by evaluating several variants of LLaMA and Vicuna. The MT-bench questions, 3K expert votes, and 30K conversations with human preferences are publicly available at https://github.com/lm-sys/FastChat/tree/main/fastchat/llm_judge.

Preview

PDF Thumbnail

Synopsis

Overview

  • Keywords: Large Language Models, Evaluation, Human Preferences, MT-Bench, Chatbot Arena
  • Objective: To evaluate large language models (LLMs) as judges for assessing chatbot performance using new benchmarks.
  • Hypothesis: LLMs can effectively approximate human preferences in evaluating chatbot responses.
  • Innovation: Introduction of two new benchmarks (MT-bench and Chatbot Arena) and the concept of LLM-as-a-judge for scalable evaluation.

Background

  • Preliminary Theories:

    • Reinforcement Learning from Human Feedback (RLHF): A method where models are fine-tuned based on human preferences, enhancing their alignment with user expectations.
    • Bias in Evaluation: Existing evaluation methods often exhibit biases such as position bias, verbosity bias, and self-enhancement bias, affecting the reliability of results.
    • Multi-turn Dialogue: The complexity of human conversation necessitates evaluation methods that can assess models over multiple exchanges rather than isolated responses.
  • Prior Research:

    • Development of traditional benchmarks like MMLU and HELM focused on closed-ended tasks, revealing limitations in assessing conversational capabilities.
    • The emergence of instruction-following benchmarks that expanded evaluation criteria but still lacked depth in open-ended dialogue assessment.
    • Prior attempts at using LLMs for evaluation without systematic analysis, leading to inconsistent results.

Methodology

  • Key Ideas:

    • MT-bench: A benchmark featuring multi-turn questions designed to evaluate conversational and instruction-following abilities.
    • Chatbot Arena: A crowdsourced platform where users engage with multiple chatbots and vote on their preferences, providing real-world evaluation data.
    • LLM-as-a-Judge: Utilizing advanced LLMs like GPT-4 to evaluate chatbot responses, aiming for scalability and efficiency in preference assessment.
  • Experiments:

    • MT-bench Evaluation: Involves 80 multi-turn questions across various categories, with responses evaluated by both LLM judges and human experts.
    • Chatbot Arena: Collects user votes on chatbot interactions, yielding a dataset of 30,000 votes for analysis.
    • Bias Analysis: Examines biases in LLM judgments, including position bias and verbosity bias, and explores methods to mitigate these biases.
  • Implications: The methodology allows for a more nuanced understanding of chatbot performance, bridging the gap between traditional benchmarks and human-centric evaluation.

Findings

  • Outcomes:

    • LLM judges, particularly GPT-4, achieved over 80% agreement with human preferences, indicating a strong alignment with human evaluators.
    • The introduction of MT-bench and Chatbot Arena provided valuable datasets that enhance the evaluation landscape for LLMs.
    • Identified biases in LLM evaluations, such as position bias, which can skew results but can be mitigated through methodological adjustments.
  • Significance: The research highlights the limitations of traditional benchmarks and proposes a hybrid evaluation framework that combines capability-based and preference-based assessments.

  • Future Work: Suggested avenues include expanding the scope of benchmarks, developing open-source LLM judges, and enhancing reasoning capabilities in models.

  • Potential Impact: Advancements in evaluation methods could lead to improved chatbot performance, better alignment with user needs, and more efficient evaluation processes in AI development.

Notes

Meta

Published: 2023-06-09

Updated: 2025-08-27

URL: https://arxiv.org/abs/2306.05685v3

Authors: Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, Ion Stoica

Citations: 1594

H Index: 276

Categories: cs.CL, cs.AI

Model: gpt-4o-mini