Different Types of Benchmarks for LLM
Evaluating the capabilities of large language models (LLMs) involves using a variety of benchmarks that test different aspects of their knowledge, ...
Evaluating the capabilities of large language models (LLMs) involves using a variety of benchmarks that test different aspects of their knowledge, reasoning, and problem-solving skills. These benchmarks encompass a wide range of domains, from common sense and social reasoning to complex mathematical problem-solving and programming. Here's an introduction summarizing the key benchmarks used to assess LLMs:
MMLU
The MMLU benchmark is a test that measures the breadth of knowledge and problem-solving ability acquired by large language models during pretraining.
HellaSwag
The HellaSwag benchmark challenges a language model's ability to understand and apply common sense reasoning by selecting the most logical ending to a story.
PIQA
The PIQA benchmark tests a language model's ability to understand and apply physical commonsense knowledge by answering questions about everyday physical interactions.
SIQA
The SIQA benchmark evaluates a language model's understanding of social interactions and social common sense by asking questions about people’s actions and their social implications.
Boolq
The BoolQ benchmark tests a language model's ability to answer naturally occurring (generated in unprompted and unconstrained settings) yes/no questions, testing the models ability to do real-world natural language inference tasks.
Winograde
The Winogrande benchmark tests a language model's ability to resolve ambiguous fill-in-the-blank tasks with binary options, requiring generalized commonsense reasoning.
CQA:
7-Shot The CQA benchmark assesses the performance of language models on multiple-choice question-answering, requiring different types of commonsense knowledge.
OBQA
The OBQA benchmark evaluates a language model's ability to perform advanced question-answering with multi-step reasoning, commonsense knowledge, and rich text comprehension, modeled after open book exams.
ARC-e
The ARC-e benchmark tests a language model's advanced question-answering skills with genuine grade-school level, multiple-choice science questions.
ARC-c
The ARC-c benchmark is a more focused subset of the ARC-e dataset, containing only questions answered incorrectly by common (retrieval-base and word co-occurrence) algorithms.
TriviaQA
5-shot The TriviaQA benchmark tests reading comprehension skills with question-answer-evidence triples.
HumanEval
Pass@1 The HumanEval benchmark tests a language model's code generation abilities by evaluating whether its solutions pass functional unit tests for programming problems.
MBPP
The MBPP benchmark tests a language model's ability to solve basic Python programming problems, focusing on fundamental programming concepts and standard library usage.
GSM8K
The GSM8K benchmark tests a language model's ability to solve grade-school-level math problems that frequently require multiple steps of reasoning.
MATH:
4-shot The MATH benchmark evaluates a language model's ability to solve complex mathematical word problems, requiring reasoning, multi-step problem-solving, and the understanding of mathematical concepts.
AGIEval
The AGIEval benchmark tests a language model's general intelligence by using questions derived from real-world exams designed to assess human intellectual abilities (college entrance exams, law exams, etc.).
BBH
The BBH (BIG-Bench Hard) benchmark focuses on tasks deemed beyond the abilities of current language models, testing their limits across various reasoning and understanding domains.
These diverse benchmarks provide a comprehensive evaluation of large language models, helping researchers understand their strengths and areas for improvement across various knowledge and reasoning domains.