Evaluation Harnesses
ℹ️ Information: Consider adding available Dutch benchmarks into an evaluation harness and sharing.
Context
AN evaluation harness allows you to easily run a standardized set of benchmarks against a model of choice. This allows you fill in gaps in your knowledge about the capabilites of your model of interest. The harness also allows you to attach your own benchmarks by following well documented conventions ad structures.
LM Evaluation harness
lm-evaluation-harness : The Language Model Evaluation Harness is a unified framework for testing generative language models on a wide variety of benchmarks. It ensures reproducibility by using publicly available prompts and supports customized evaluations.
Quote: The LM Evaluation Harness is a Python-based framework developed by EleutherAI for evaluating the performance of language models on a wide range of NLP benchmarks. It supports multiple tasks such as multiple-choice, question answering, and classification, and is compatible with both local and API-based models (like OpenAI’s GPT or Hugging Face models). The harness provides a standardized way to compare model performance across datasets like MMLU, HellaSwag, ARC, and more. It is modular, extensible, and widely used in the research community for assessing language model capabilities.
Key features include Over 60 standard academic benchmarks with hundreds of subtasks
It includes a Dutch version of the MMLU benchmark which consists of 15,908 multiple-choice questions, with 1,540 of them being used to select and assess optimal settings for models – temperature, batch size and learning rate. The questions span across 57 subjects, from highly complex STEM fields and international law, to nutrition and religion. It was one of the most commonly used benchmarks for comparing the capabilities of large language models, with over 100 million downloads as of July 2024
A Dutch Version
⚠️ Warning Last modified 2 years ago:
You can find a Dutch evaluation harness here: Dutch Evaluation Harness. The harness includes support for Support for Dutch evaluation benchmarks (e.g. SQUADNL) and Dutch prompts.