Narrative

Benchmarks are essential for evaluating the performance, safety, and ethical implications of AI models in educational contexts. They provide standardized metrics to assess how well AI systems can support learning outcomes, ensure student safety, and promote equitable access to education. By categorizing benchmarks into pedagogical, ethical, safety, and operational domains, educators and developers can better understand the strengths and limitations of AI models and tools, guiding informed decisions about their deployment.

But what are benchmarks? What are their limitations? How do we use them together to create a balanced scorecard and how do we make them accessible and fresh for educators? How can we realistically scale benchmarking and for what other purposes can we reuse benchmark data? These are the questions we will explore in this dashboard.

Benchmark Landscape

Landscape

There are a lot of benchmarks measuring many different LLM capabilities. Here is a sample of 250 benchmarks.

How to read this dashboard?

Purpose of this dashboard

I designed this dashboard to present benchmarking and scorecarding information in a logical order. Its purpose is to serve as a standalone document with dynamic elements that we can use to develope an understanding of the value of benchmarking in the context of AI Alignment in Education. The dashboard serves a secondary purpose: providing material for presentations.

As the field of AI alignment evolves rapidly, expect updates.

I initially wrote this dashboard in April 2026

Motivation

AL Alignment does not happen by accident

The LA cycle, grounded in pedagogical data, clear metrics, and a feedback loop, is an essential framework for guiding AI responsibly and effectively in an educational context.

At the end of 2025 the Npuls LA team came to the following conclusion about the state of Generative AI in Dutch education:

The importance of LA in (Gen)AI was best summarized by the “Confused Expert” analogy. This argument posits that GenAI is a powerful but unguided medium, not a solution. The LA cycle, grounded in pedagogical data, clear metrics, and a feedback loop, is an essential framework for guiding AI responsibly and effectively in an educational context.

Effective AI will need better data and guiding practices, not the other way around.

The solution argues for Benchmarking and training AI models on pedagogical data, and using the LA cycle to guide the development and deployment of AI in education. This approach ensures that AI systems are aligned with educational goals and can be effectively integrated into teaching and learning processes.

Before taking time efort and gold to train models based on Learning Analytics data we need to be able to measure the qualities of the models and AI systems against our values. This project is an initial review of the theme with the goal of developing a framework for benchmarking and scorecarding AI systems in education, with a focus on alignment with educational values and goals. The project will involve identifying relevant metrics, developing benchmarking methodologies, and creating scorecards to evaluate AI systems in the context of education.

Why a scorecard of benchmarks?

Why a scorecard?

A scorecard based on a number of benchmarks

No one benchmark can track the capabilities of AI models and composite AI tools in education. We need to track multiple dimensions. Among many themes, models need to be safe, where possible, cheap and reliable, align with the curriculum, support different learning styles and pedagogical taxonomies such as Bloom, be multilingual and multimodal and be able to explain their own reasoning. As educational tools become more agentic, models will need to work together and follow instructions, and find patterns in ever increasing volumes of information. Domain-specific tasks, such as coding, researching, and solving mathematical problems, also require specific benchmarks.

Creating a scorecard is complex. We will need to research what is available and generalise to the Dutch context. Organizing starts with understanding what the Dutch educational community cares about.

Risks and benefits?

Risks and benefits

The benefits of using benchmarks include the ability to compare the relative performance of AI systems within our context, improving data-driven decision-making. The risks include cheating by training models on the benchmark data, data contamination from AI-generated content and most importantly, a false sense of security.

Using a scorecard or benchmark properly requires a degree of common sense and literacy, and considerable reflection during application.

Change is a constant

Which primary factor should motivate us to track AI alignment in education?

Europe falling behind

Europe is falling behind

Towards AGI

Narrow AGI first

AI Observatory

An AI Observatory for education

An AI Observatory is a center of excellence that systematically monitors, evaluates, and reports on the impact of AI in education.

Here is a concept of what an AI observatory can look like in respect to AI benchmarking and score carding.

It serves as a trusted source of information for educators, policymakers, and researchers, providing insights into the capabilities, limitations, and ethical implications of AI tools. The observatory collects data from various benchmarks, real-world deployments, and user feedback to create a comprehensive picture of how AI is shaping education. It also fosters collaboration among stakeholders to ensure that AI technologies are developed and used in ways that promote equity, safety, and effective learning outcomes.

Does an AI observatory exist?

There are a number of AI related sources of intelligence, but none that are specifically focused on Dutch education. Examples include:

AI Watch: An initiative by the European Commission that monitors AI developments and their societal impact, which could serve as a model for an education-focused observatory.
CBS AI monitor: All figures, articles and reports from Statistics Netherlands about artificial intelligence.
AI Index: An annual report that tracks the progress and impact of AI across various domains, including education, providing valuable data for an educational AI observatory.
AI Incident Database: A repository of real-world AI incidents, including safety failures and ethical breaches, which can inform the development of safer AI systems in education.

Blooms Taxonomy

Learning Analytics

With Little Data

Little Data

Zhou et al. 2023) Large language models are trained in two stages: (1) unsupervised pretraining from raw text, to learn general-purpose representations, and (2) large scale instruction tuning and reinforcement learning, to better align to end tasks and user preferences. We measure the relative importance of these two stages by training LIMA, a 65B parameter LLaMa language model fine-tuned with the standard supervised loss on only 1,000 carefully curated prompts and responses

Taken together, these results strongly suggest that almost all knowledge in large language models is learned during pretraining, and only limited instruction tuning data is necessary to teach models to produce high quality output.

Zhou, Chunting, Pengfei Liu, Puxin Xu, Srini Iyer, Jiao Sun, Yuning Mao, Xuezhe Ma, et al. 2023. “LIMA: Less Is More for Alignment.” https://doi.org/10.48550/ARXIV.2305.11206.

Applications

Benchmark

Blooms Taxonomy

Prompt

Keeping track of taxonomies

Here is an example prompt to create an initial taxonomy of benchmarks to measure a generative AI model against the Bloom taxonomy. The use of prompting is an example of AI keeping up to date with AI, and it provides you with a springboard to drill down into the capabilities you consider important.

Note: I can reproduce the output in this dashboard; however, there is a risk of giving a particular set of benchmarks too much relevance. First, it is important to nail down the themes and priorities (cost vs accuracy, safety vs model size, etc.) most important to the sector.

What is a benchmark

A benchmark is a standardized way to measure how well a model performs on specific tasks.

NOTE

Benchmarks may be extended to measure the performance of tools, which are combinations of models and other components (e.g. RAG, tools, etc.).
Benchmarks are often associated with datasets that are used to train and evaluate models.

Which types of benchmark exist

Capabilities

Leadershipboards

Strengths and weaknesses

Strengths and weaknesses of benchmarks

The requirements for AI to effectively serve students and teachers are significant. These include factual accuracy, effective tool and agent utilization, emotional intelligence, and safety.

Individual benchmarks often lack the complexity needed to fully capture these requirements. A composite set of benchmarks is more likely to address the necessary nuances. For instance, while a benchmark may measure the accuracy of AI models in answering multiple-choice questions, it may fail to assess harmful outputs, consistency across prompts, hallucination rates, security, or student well-being during interactions. Additionally, relying on individual benchmarks can lead to saturation, where scores across different models become very similar. Model developers frequently use benchmark datasets to train newer models, and the benchmarks themselves often measure unrealistic tasks that do not reflect real-world applications or generalize effectively.

Leadership boards that rely on a composite set of benchmarks and community feedback tend to provide more robust findings regarding the relative quality of different models. However, designing these leadership boards must take into account the diverse and, at times, specific needs of the education sector. A challenge.

Community

Community weighting process

Visualization

Mockup of a community process

Scorecard Community Weighting (mockup)

✏️ Rename or Delete

➕ Add Dimension

⚙️ Adjust Community Weights

Community Scorecard Summary

Missing Benchmarks

Missingness of benchmark

LLM Model	Benchmark 1	Benchmark 2	Benchmark N
LLM Benchmark Results
Simulation of missing data
GPT-4o	0.91	0.87	Missing
Claude 3.5 Sonnet	0.88	Missing	0.83
Llama 3.1 70B	Missing	0.74	0.69
Mistral Large	0.79	0.81	Missing
Gemini 1.5 Pro	0.86	0.84	0.82

Critical Path

Critical path

The complexity of a community process is compounded by defining who should be the authoritative body for tracking AI evolution and translating the signals into actionable insights for educators. This is a critical path issue because if we do not have a clear and trusted authority, then the community process will be less effective and less likely to be adopted by educators. We need to identify and empower ONE authoritative body.

Live data - API

Live data

Live data - Ollama

Open Source models

Ollama provides a local platform to run and manage LLM models. The data table below contains metadata about the available models, including their names, descriptions, and categories. The interactive table allows you to explore this information with searching, filtering, and export.

Data source April 2026 The data is sourced from an LLM-checker JSON file that contains detailed information about the models available in Ollama.

Live data - Parameter complexity

Live data

NARRATIVE

Benchmarking is made more difficult by rapidly changing versions of a given model. Ollama is a tool that allows you to run models locally. Using statistics from Ollama is relevant for benchmarking because it ensures you are comparing the same model version across benchmarks. As our laptops, mobile phones, and OSs adapt to the needs of AI, we can expect more and more models to run locally and version changes to accelerate. Trends such as declining computing costs and the rise of open-source models are significantly contributing to this trend.

METHOD

LLM checker is a tool that makes suggestions for the best models to run on your computer. As part of the recommendation process, it downloads the most up-to-date information about models that can run through Ollama. By using the same data source, we can track popular locally run model specifications and, if necessary, benchmark viable models based on early warnings.

Ollama Model Options
tag	size	quantization	command	estimated_size_gb	real_size_gb	categories
llama3.1:latest	unknown	Q4_0	ollama pull llama3.1:latest	1	4.9	chat
llama3.1:8b	8b	Q4_0	ollama pull llama3.1:8b	8	4.9	chat
llama3.1:70b	70b	Q4_0	ollama pull llama3.1:70b	70	43.0	chat
llama3.1:405b	405b	Q4_0	ollama pull llama3.1:405b	405	243.0	chat

Snapshots - Kaggle

Kaggle (LLM dataset 2024-2026)

Biblometrics (data)

Valuable information

Biblometrics (drilling down)

Drilling down into the literature

Content Analysis

Extracting insight from publications

AI with tools

AI-generated analysis using local models and Google Scholar

Here are the final thoughts April 2026 from AI (Googles open source gemma 4) running locally on my laptop and talking to Google Scholar. Code at bottom of blog.

The Measurement Era: Analyzing the Rise of LLM Benchmarking in Education

The integration of Large Language Models (LLMs) into the classroom has moved rapidly from a period of experimental novelty to a phase of rigorous academic scrutiny. As these models transition from simple chatbots to sophisticated pedagogical agents, the focus of the research community is shifting. It is no longer enough to ask if an LLM can generate text; the critical question has become whether these models can reliably navigate the complexities of specific academic disciplines, adhere to curriculum standards, and support diverse learning needs. We are currently witnessing a fundamental shift toward the development of specialized, domain-specific evaluative frameworks.

A survey of recent research reveals an expansive, multi-disciplinary effort to map the capabilities of LLMs across the sciences and engineering. Significant strides are being made in evaluating LLMs within Computer Science (CS) concept inventories and programming education, specifically regarding their ability to assess the quality of multiple-choice questions. In the realm of engineering, new methodologies are emerging to benchmark undergraduate curricula in Electrical and Computer Engineering (ECE), utilizing standardized prompting taxonomies to ensure consistency. This trend extends even further into the natural sciences and mathematics, with researchers introducing Hebrew-language chemistry benchmarks and specialized frameworks like “Mathtutorbench” to measure the open-ended pedagogical capabilities of LLM-based tutors.

The primary driver behind this surge in literature is the urgent need for benchmarking as a pillar of scientific rigor. As researchers note, benchmarking is essential for moving toward more reproducible, replicable, and robust investigations into the intersection of AI and education. Without standardized benchmarks, it is impossible to determine if an LLM is truly performing at a level comparable to human students or if it is merely mimicking patterns. These frameworks serve as the foundation for “continuous quality improvement” in outcome-based education, allowing institutions to treat AI integration not as a trend, but as a measurable component of instructional design.

However, the path toward reliable AI integration is fraught with significant technical and structural challenges. Current research highlights critical gaps in how LLMs process complex data; for instance, many existing models struggle to interpret essential pedagogical data stored in table formats. Beyond data parsing, there are significant hurdles in the deployment of LLM agents within existing educational ecosystems, particularly regarding the seamless integration of new tools into established digital infrastructures. Furthermore, there is a documented lack of benchmarks specifically designed to evaluate an LLM’s pedagogical knowledge or its ability to support students with Special Educational Needs and Disability (SEND), suggesting that our current evaluative tools are still heavily biased toward standard, text-heavy academic tasks.

In conclusion, the landscape of educational AI is currently defined by a transition from “capability discovery” to “capability verification.” The recent influx of papers regarding automated benchmarking infrastructure suggests that the industry is moving toward a more mature, automated, and disciplined approach to AI evaluation. While the potential for LLMs to act as personalized tutors and curriculum evaluators is vast, the academic community’s focus remains correctly placed on the necessity of robust, multi-disciplinary, and technically sound benchmarks. Until we can solve the challenges of data integration and pedagogical nuance, the true efficacy of LLMs in the classroom will remain an unverified promise.

Click here to view code that generated the posting

import ollama
from scholarly import scholarly


def fetch_google_scholar_abstracts(topic, max_results=5):
    """
    Fetch publication metadata + abstracts/snippets from Google Scholar.
    Note: Scholar may rate-limit or challenge frequent requests.
    """
    print(f"📚 Fetching Google Scholar abstracts for: {topic}...")
    rows = []

    try:
        search = scholarly.search_pubs(topic)

        for i, pub in enumerate(search):
            if i >= max_results:
                break

            bib = pub.get("bib", {})
            title = bib.get("title", "Untitled")
            year = bib.get("pub_year", "NA")
            venue = bib.get("venue", "Unknown venue")

            # Abstract is not always present in Scholar results.
            abstract = bib.get("abstract") or pub.get("snippet") or "No abstract/snippet available."

            rows.append(
                f"Paper {i+1}\n"
                f"Title: {title}\n"
                f"Year: {year}\n"
                f"Venue: {venue}\n"
                f"Abstract: {abstract}"
            )

        if not rows:
            return None

        return "\n\n".join(rows)

    except Exception as e:
        print(f"❌ Google Scholar fetch error: {e}")
        return None


def generate_educational_blog():
    search_topic = "LLM benchmarks in education"
    is_speculative = False

    # Replace Wikipedia with Google Scholar abstracts
    research_data = fetch_google_scholar_abstracts(search_topic, max_results=10)

    if not research_data:
        print("⚠️ Could not fetch Scholar abstracts. Using general knowledge.")
        research_data = "No specific recent paper abstracts found. General knowledge applies."

    prompt = f"""
    You are a tech journalist.

    ABSTRACT DATA FROM GOOGLE SCHOLAR:
    {research_data}

    TASK:
    Write a 5-paragraph blog post about LLM benchmarks in education.

    {'NOTE: You are writing about the FUTURE. Use a visionary and predictive tone.' if is_speculative else 'NOTE: You are writing about EXISTING research. Use an analytical tone.'}

    STRUCTURE:
    1. Intro to AI in classrooms.
    2. Key findings from the provided abstracts.
    3. The importance of benchmarks.
    4. Challenges (bias, accuracy).
    5. Conclusion.
    """

    print("🧠 Gemma 4 is generating the post...")
    try:
        response = ollama.chat(
            model="gemma4:26b",
            messages=[{"role": "user", "content": prompt}]
        )
        print("\n--- FINAL BLOG POST ---\n")
        print(response["message"]["content"])
    except Exception as e:
        print(f"❌ Error: {e}")


if __name__ == "__main__":
    generate_educational_blog()

Benchmarking

Benchmark harness

Benchmark Harness

There are really simple-to-use code libraries that take the effort out of running a predetermined set of benchmarks. A notable example is llm-eval, which at present (April 2026) runs 60 different benchmarks. This is a concrete starting point to select benchmarks that are relevant for education and combine and community weight with already measured benchmarks. As our understanding of the landscape matures, we can tune the initial scorecard for domain-specific comparisons of different AI systems.

Who am I

Alan Berg is a Learning Analytics and Data Expert and Security Architect at the University of Amsterdam (UvA). He works within the ICT Services (ICTS) department, providing central technical services for both the UvA and the Amsterdam University of Applied Sciences (HvA).

Key Roles and Expertise

Learning Analytics: He is a prominent specialist in learning analytics, focusing on the architectural lifecycle of data in education.
Data Privacy & Security: His work often intersects with data privacy, synthetic data generation for research, and security architecture within university systems.
Academic Background: He holds a Doctorate and has authored several papers and theses related to Learning Analytics Architecture and campus-wide information systems.

The Project

This project explores the variability of tracking the quality of AI models. We do this by defining the relative quality of AI models deployed within Education.

The project assesses whether the models are performing as intended and which models are best for a given pedagogical context. We explore the role of benchmarking and the use of a composite of benchmarks to score models.

Education isn’t just about recalling facts. Bloom’s taxonomy reminds us that learning spans multiple levels—from remembering and understanding to analysing, evaluating, and creating. An aligned AI model should support all these levels, not just the simplest ones. That means our scorecard must include benchmarks that test how well AI handles tasks across this full spectrum of cognitive skills.

No single benchmark can capture everything. Some tests measure factual accuracy, others measure reasoning, creativity, or ethical behaviour. By combining them into one composite scorecard, we get a more complete picture of how an AI model performs in real educational settings.

AI models evolve quickly. A benchmark that was challenging last year may be too easy today. To keep the scorecard meaningful, we need up‑to‑date benchmarks that reflect current capabilities, new risks, and emerging educational practices. Benchmarking ensures that the AI remains aligned not only with today’s curriculum but also with the rapidly changing world students are preparing for.

CEDA

The goal of CEDA (Center for Educational Data Analytics) is to move from data to reliable insights more quickly through co‑creation with vocational (mbo), applied sciences (hbo), and research universities (wo). This enables educational institutions to maintain a firm grip on the future in an increasingly fast‑changing world.

CEDA does this by gathering successful information products from institutions and making them practically accessible to all other institutions. We develop in Python, R, Power BI, and Azure.

On the one hand, this directly helps all institutions because they can apply these information products locally. CEDA also develops documentation and best practices in the areas of machine learning, visualization, and data analysis related to these tools.

On the other hand, this yields a wealth of information about the real challenges surrounding study data and AI, as well as the differences between mbo, hbo, and wo. CEDA shares these insights with chain partners. In this way, CEDA also contributes to sector‑wide conditions and architecture for data‑informed working.

CODE Location

Acknowledgements

A CEDA funded project: I acknowledge the copious use of Microsoft and GitHub Copilot, and the expert-in-the-loop in the development process. I would also like to recognize the use of the R language, especially the flexdashboard package.