Narrative

Narrative

Benchmarks are essential for evaluating the performance, safety, and ethical implications of AI models in educational contexts. They provide standardized metrics to assess how well AI systems can support learning outcomes, ensure student safety, and promote equitable access to education. By categorizing benchmarks into pedagogical, ethical, safety, and operational domains, educators and developers can better understand the strengths and limitations of AI models and tools, guiding informed decisions about their deployment.

But what are benchmarks? What are their limitations? How do we use them together to create a balanced scorecard and how do we make them accessible and fresh for educators? How can we realistically scale benchmarking and for what other purposes can we reuse benchmark data? These are the questions we will explore in this dashboard.

You will find context specific explanations by clicking on the buttons such as this one.

Benchmark Landscape

Landscape

There are a lot of benchmarks measuring many different LLM capabilities. Here is a sample of 250 benchmarks.

This is a network diagram of 250 benchmarks. You can click on a particular node and it will give you the benchmark name. At the centre of each group of benchmarks is the type of benchmarking. You can also drag and drop.

Number of benchmarks per theme

You can see from the diagrams that:

  • There are popular and less popular themes.
  • Themes are at times interconnected.
  • There are a lot of benchmarks.

What you should also know:

  • The network is evolving rapidly with new benchmarks being developed and older ones discarded.
  • There are new abilities emerging that impact education, such as emotive understanding, which might affect the popularity of specific benchmarks.
  • Benchmarks have associated datasets which may be used to train new models, which in turn might mean that scoring becomes less reliable over time.

You can currently find the related dataset at Evidentlyai. The dataset is not exhaustive and is based on the data available at the time of analysis (250 benchmarks, Cutoff late 2025).

How to read this dashboard?

Purpose of this dashboard

I designed this dashboard to present benchmarking and scorecarding information in a logical order. Its purpose is to serve as a standalone document with dynamic elements that we can use to develope an understanding of the value of benchmarking in the context of AI Alignment in Education. The dashboard serves a secondary purpose: providing material for presentations.

As the field of AI alignment evolves rapidly, expect updates.

I initially wrote this dashboard in April 2026

AI alignment is the effort to ensure that artificial intelligence systems act in ways that match human values, goals, and intentions. In other words, it’s about making sure AI behaves as we want it to

Specifically, the interaction between teachers, students and the AI.

Motivation

AL Alignment does not happen by accident

The LA cycle, grounded in pedagogical data, clear metrics, and a feedback loop, is an essential framework for guiding AI responsibly and effectively in an educational context.

At the end of 2025 the Npuls LA team came to the following conclusion about the state of Generative AI in Dutch education:

The importance of LA in (Gen)AI was best summarized by the “Confused Expert” analogy. This argument posits that GenAI is a powerful but unguided medium, not a solution. The LA cycle, grounded in pedagogical data, clear metrics, and a feedback loop, is an essential framework for guiding AI responsibly and effectively in an educational context.

Effective AI will need better data and guiding practices, not the other way around.

The solution argues for Benchmarking and training AI models on pedagogical data, and using the LA cycle to guide the development and deployment of AI in education. This approach ensures that AI systems are aligned with educational goals and can be effectively integrated into teaching and learning processes.

Before taking time efort and gold to train models based on Learning Analytics data we need to be able to measure the qualities of the models and AI systems against our values. This project is an initial review of the theme with the goal of developing a framework for benchmarking and scorecarding AI systems in education, with a focus on alignment with educational values and goals. The project will involve identifying relevant metrics, developing benchmarking methodologies, and creating scorecards to evaluate AI systems in the context of education.

Why a scorecard of benchmarks?

Why a scorecard?

A scorecard based on a number of benchmarks

No one benchmark can track the capabilities of AI models and composite AI tools in education. We need to track multiple dimensions. Among many themes, models need to be safe, where possible, cheap and reliable, align with the curriculum, support different learning styles and pedagogical taxonomies such as Bloom, be multilingual and multimodal and be able to explain their own reasoning.  As educational tools become more agentic, models will need to work together and follow instructions, and find patterns in ever increasing volumes of information. Domain-specific tasks, such as coding, researching, and solving mathematical problems, also require specific benchmarks.

Creating a scorecard is complex. We will need to research what is available and generalise to the Dutch context. Organizing starts with understanding what the Dutch educational community cares about.

Risks and benefits?

Risks and benefits

The benefits of using benchmarks include the ability to compare the relative performance of AI systems within our context, improving data-driven decision-making. The risks include cheating by training models on the benchmark data, data contamination from AI-generated content and most importantly, a false sense of security.

Using a scorecard or benchmark properly requires a degree of common sense and literacy, and considerable reflection during application.

As the field matures and we become accustomed to deploying scorecards, we will gain a more fine-grained understanding of the risks and benefits of using benchmarks and scorecards. Using AI to keep track of AI is a helpful start. Consider using a prompt such as the following one to monitor evolution:

You are an expert AI policy and education technology analyst. Your sole mission is to keep me continuously up-to-date on the risks and benefits of using AI benchmarks to track, evaluate, and compare the performance of AI systems deployed in the education sector (AI tutors, automated grading tools, personalized learning platforms, adaptive content generators, misconception detectors, etc.).

Every time I ask for an update (or on a schedule if you support it), deliver a concise, balanced, evidence-based briefing that covers:

  1. Recent developments in how benchmarks are being used or critiqued in educational contexts (new studies, papers, reports, leaderboards, or real-world EdTech deployments from the last 1–3 months).
  2. Benefits of benchmark-driven evaluation, such as:
    • Objective, standardized comparisons across models
    • Guidance for schools/districts/EdTech companies when selecting tools
    • Ability to measure progress on education-specific capabilities (pedagogy, student misconception detection, essay scoring, etc.)
    • Acceleration of model improvement for teaching and learning tasks
  3. Risks and limitations, including but not limited to:
    • Benchmark saturation or overfitting that does not translate to real classroom outcomes
    • Lack of correlation with actual student learning gains, teacher effectiveness, or long-term retention
    • Cultural, linguistic, or socioeconomic biases (especially in low- and middle-income countries)
    • Narrow focus on test-like performance instead of pedagogical depth, emotional support, or equity
    • Potential for “teaching to the benchmark” instead of genuine educational value
  4. Well-known benchmarks you must explicitly reference and analyze in every update (including how they are currently performing, any new critiques, and their relevance to education):
    • MMLU (Massive Multitask Language Understanding) – general knowledge across academic subjects
    • GSM8K (Grade School Math 8K) – mathematical reasoning with word problems
    • Pedagogy Benchmark and SEND Pedagogy Benchmark (AI for Education / Fab AI)
    • Visual Maths Benchmark (early-grade visual math problems)
    • ASAP 2.0 (automated scoring of student persuasive essays – Learning Agency leaderboard)
    • Eedi Misconception Annotation Project (identifying student math misconceptions)
    • Any newly emerging education-specific benchmarks (e.g., PERSUADE dataset extensions or multi-modal classroom benchmarks)
  5. Practical implications and actionable recommendations for:
    • Educators and school leaders
    • Policymakers and regulators
    • AI developers building education tools
  6. Always cite sources with dates and links (academic papers, leaderboards, reputable reports, or news). Structure the response clearly with headings (Recent Developments, Benefits, Risks, Benchmark Spotlight, Recommendations). Keep the tone neutral, evidence-driven, and forward-looking. If no major new information exists since the last update, explicitly say so and summarize the current state of the field.

Begin every response with: “AI Benchmarks in Education – Update as of [current date]

Change is a constant

Change is a constant

Which primary factor should motivate us to track AI alignment in education?

The graph is from Epoch AI, which creates a scorecard based on several well-understood, popular, and high-quality benchmarks.

Change is not a constant; it is actually accelerating across several fronts. Inference is getting cheaper and faster. Model release cycles are decreasing in time. Capabilities are emerging, and our collective understanding of how to use and orchestrate AI for meaningful results is improving and scaling.

Primarily we should do no harm. Given the rapidly changing landscape we should track to ensure that we are prepared for the changes that are coming, and to ensure that we are able to adapt to those changes in a way that is beneficial for students, teachers, and society as a whole.

EPOCH AI releases its findings and datasets regularly. Yor can find the datsets here

Europe falling behind

Europe is falling behind

Europe is falling behind in AI frontier capabilities. We can use Chinese (open source) and US models, but as a community, we should add measurable risk factors such as digital sovereignty to our composite scorecard.

Towards AGI

Narrow AGI first

Artificial General Intelligence(AGI) refers to a type of artificial intelligence that has the ability to understand, learn, and apply knowledge across a wide range of tasks at a level comparable to human intelligence. Unlike narrow AI, which is designed for specific tasks, AGI would be capable of performing any intellectual task that a human can do. AGI raises ethical and safety concerns.

Change is a constant… Implications

  1. How do we keep track of the changes that are coming?
  2. Play to our sectoral strengths and avoid weaknesses.
  3. Adopt emergent properties and apply a safetynet to mitigate emergent risks.

The diagram is from AGI Definition, which points to a paper on benchmarking various aspects in AGI. The diagram illustrates the different dimensions of AGI and how ChatGPT performs across those dimensions. The trend is obvious, capabilities are improving, but unevenly. There are still qualities that need addressing.

AI Observatory

An AI Observatory for education

This section explains the concept of an observatory to monitor rapid changes associated with the implications of the rapid evolution of AI. Without a central observatory, organisations would need to repeatedly track these changes, which may not have sufficient attention or resources to do so effectively.

An AI Observatory is a center of excellence that systematically monitors, evaluates, and reports on the impact of AI in education.

Here is a concept of what an AI observatory can look like in respect to AI benchmarking and score carding.

It serves as a trusted source of information for educators, policymakers, and researchers, providing insights into the capabilities, limitations, and ethical implications of AI tools. The observatory collects data from various benchmarks, real-world deployments, and user feedback to create a comprehensive picture of how AI is shaping education. It also fosters collaboration among stakeholders to ensure that AI technologies are developed and used in ways that promote equity, safety, and effective learning outcomes.

Does an AI observatory exist?

There are a number of AI related sources of intelligence, but none that are specifically focused on Dutch education. Examples include:

  • AI Watch: An initiative by the European Commission that monitors AI developments and their societal impact, which could serve as a model for an education-focused observatory.

  • CBS AI monitor: All figures, articles and reports from Statistics Netherlands about artificial intelligence.

  • AI Index: An annual report that tracks the progress and impact of AI across various domains, including education, providing valuable data for an educational AI observatory.

  • AI Incident Database: A repository of real-world AI incidents, including safety failures and ethical breaches, which can inform the development of safer AI systems in education.

Blooms Taxonomy

Blooms Taxonomy

Source

Bloom’s taxonomy of needs or an updated version allows for a more nuanced understanding of the different levels of cognitive processes and how they relate to learning outcomes. It can help educators design more effective learning experiences and assessments that target specific cognitive skills and promote deeper learning. The taxonomy also allows us to divide up AI-related activities into pedologically relevant zones of sophistication and benchmark each zone.

In general, the higher the zone, the more sophisticated. Google, for example, trains and then measures its models based on the different. This allows them to track progress and identify areas for improvement.

Other taxonomies are also possible, but Bloom’s is a good starting point because it is widely used and understood in education. It also allows us to categorise benchmarks in a way that is meaningful for educators and students.

Learning Analytics

Learning Analytics

Source Npuls LA team

Learning Analytics data is an accelerator the education sector.

Learning Analytics (LA) defines a life cycle around designing educational nudges that improve student outcomes based on online digital traces such as those left by interacting with AI.

Npuls has had a team ofLearning Analytics experts review the literature and create a comprehensive overview of best and worst practices in learning analytics. This overview can be used to inform the design of benchmarks that are relevant for learning analytics, and to identify areas where more research is needed.

Correctly structured data gathering can provide the training data for models and benchmarks. If the Dutch cannot compete on building frontier models, then perhaps we should focus on specialising in domain-specific models. These models need fine-tuning.

With Little Data

Little Data

Only a small amount of Learning Analytics data is needed to fine-tune models. Making the process practical, cheap, and fast. This is a crucial finding because it means that we can create domain-specific models that are tailored to the needs of educators and students, without needing to invest in large amounts of data or computational resources. We will however need to invest in building our expertise and processes and align our institutional policies to agile data driven practices.

Zhou et al. 2023) Large language models are trained in two stages: (1) unsupervised pretraining from raw text, to learn general-purpose representations, and (2) large scale instruction tuning and reinforcement learning, to better align to end tasks and user preferences. We measure the relative importance of these two stages by training LIMA, a 65B parameter LLaMa language model fine-tuned with the standard supervised loss on only 1,000 carefully curated prompts and responses

Taken together, these results strongly suggest that almost all knowledge in large language models is learned during pretraining, and only limited instruction tuning data is necessary to teach models to produce high quality output.

Zhou, Chunting, Pengfei Liu, Puxin Xu, Srini Iyer, Jiao Sun, Yuning Mao, Xuezhe Ma, et al. 2023. “LIMA: Less Is More for Alignment.” https://doi.org/10.48550/ARXIV.2305.11206.

Applications

Applications

Source

At the different levels of Bloom’s taxonomy, different applications of AI can be used to support learning outcomes. For example, at the lower levels of the taxonomy, AI can be used to provide personalised feedback and support for basic skills such as memorisation and comprehension. At the higher levels of the taxonomy, AI can be used to support more complex cognitive processes such as analysis, synthesis, and evaluation.

As the deployment of AI systems becomes more sophisticated, we will be looking at more complex interactions between generative AI agents at times glued together with scripts and custom text documents providing instructions in written form. Overtime benchmarking will focus on the composite systems and applications, rather than on the individual models. This is because the composite systems are what educators and students will be interacting with, and they will need to be evaluated based on their overall performance and impact on learning outcomes, rather than just on the performance of the individual models that make up the system.

Benchmark

Blooms Taxonomy

Source Google are using taxonomies to align and improve their models and have quantified the improvements via benchmarking. This leads to a virtuous cycle of improvement, where the benchmarks inform the development of the models, and the improved models lead to better benchmarks. This is an accelerator example of how benchmarking can drive progress in AI development and deployment in education.

Prompt

Keeping track of taxonomies

Here is an example prompt to create an initial taxonomy of benchmarks to measure a generative AI model against the Bloom taxonomy. The use of prompting is an example of AI keeping up to date with AI, and it provides you with a springboard to drill down into the capabilities you consider important.

Note: I can reproduce the output in this dashboard; however, there is a risk of giving a particular set of benchmarks too much relevance. First, it is important to nail down the themes and priorities (cost vs accuracy, safety vs model size, etc.) most important to the sector.

You are an AI Research Scientist specializing in Large Language Model Evaluation. Your task is to map contemporary LLM evaluation suites to the hierarchical levels of Bloom’s Taxonomy and provide concise, expert‑level analysis.

Instructions

1. Scope

  • Include static benchmarks (e.g., ARC, MMLU, HellaSwag, GSM8K).
  • Include dynamic/agentic benchmarks (e.g., AgentBench, SWE‑Bench, WebArena, ToolBench).
  • When relevant, distinguish between task format and cognitive demand.

2. Cognitive Mapping

For each benchmark:

  • Assign its primary Bloom’s Taxonomy level.
  • Note any secondary cognitive skills (e.g., multi‑step reasoning, planning).
  • Use Bloom’s hierarchy:
    • Lower‑Order Thinking: remembering, understanding, basic application

    • Higher‑Order Thinking: analysis, evaluation, synthesis/creation

3. Critical Evaluation

Provide a concise, evidence‑based discussion of limitations in using these benchmarks to measure human‑like cognition, including:

  • Overfitting and benchmark contamination
  • Shortcut exploitation
  • Limited ecological validity
  • Lack of metacognition measurement
  • Differences between human conceptual reasoning and LLM statistical patterning
  • Partial improvements from agentic benchmarks and remaining gaps

4. Output Format

Use:

  • Hierarchical headings
  • Bullet points
  • Clear taxonomy labels Structure the output as:
  1. Overview of Bloom’s Taxonomy
  2. Static Benchmark Mapping
  3. Dynamic/Agentic Benchmark Mapping
  4. Cross‑Benchmark Comparison
  5. Limitations for Human‑Like Cognition
  6. Summary Table

5. Constraints

  • Provide concise reasoning, not detailed internal thought processes.
  • Do not reveal chain‑of‑thought; instead give short, direct explanations.
  • Maintain a scientific, research‑grade tone.
  • Avoid generic descriptions; focus on cognitive implications.

What is a benchmark

What is a benchmark

A benchmark is a standardized way to measure how well a model performs on specific tasks.

NOTE

  • Benchmarks may be extended to measure the performance of tools, which are combinations of models and other components (e.g. RAG, tools, etc.).
  • Benchmarks are often associated with datasets that are used to train and evaluate models.

Which types of benchmark exist

Capabilities

Here is an example set of capabilites that can be benchmarked. This is not an exhaustive list.

The implication is that the more capabilities are deployed, the more benchmarks we need to keep track of. This is because each capability has its own set of risks and benefits, and we need to ensure that we either monitor effectively and/or buy in models that have been benchmarked appropriately. This is a big ask for individual institutions, which is why we need a central observatory to do this work for us and make the results accessible and actionable for educators.

Leadershipboards

Leadershipboards

LLM Leadershipboard display models against various benchmarks or composite sets of benchmarks. A number of the boards enable community updates and feedback processes.

Examples

It is unlikely that anyone leadershipboard will capture the nuances of the Educational sector for specific tasks. However, they can be useful for tracking the general progress of models across a variety of dimensions, and for tracking the general progress of specific models that are relevant to our work.

Strengths and weaknesses

Strengths and weaknesses of benchmarks

Saturation of LLM benchmarks refers to the point where large language models consistently score extremely high on widely used evaluation tests—so high that the benchmarks no longer meaningfully differentiate between models or reveal their weaknesses. When a benchmark becomes saturated, it stops functioning as a reliable measure of progress.

For the educational sector, this matters because schools, universities, and policymakers increasingly rely on benchmark results to judge whether AI tools are trustworthy. The sector needs fresh, robust, and domain‑specific evaluations to ensure AI supports learning rather than distorting it.

Communities of experts can help address this issue by continuously developing new benchmarks that are more complex, realistic, and aligned with educational goals.

See: Scorecard section for an example of benchmark evolution in action.

The requirements for AI to effectively serve students and teachers are significant. These include factual accuracy, effective tool and agent utilization, emotional intelligence, and safety.

Individual benchmarks often lack the complexity needed to fully capture these requirements. A composite set of benchmarks is more likely to address the necessary nuances. For instance, while a benchmark may measure the accuracy of AI models in answering multiple-choice questions, it may fail to assess harmful outputs, consistency across prompts, hallucination rates, security, or student well-being during interactions. Additionally, relying on individual benchmarks can lead to saturation, where scores across different models become very similar. Model developers frequently use benchmark datasets to train newer models, and the benchmarks themselves often measure unrealistic tasks that do not reflect real-world applications or generalize effectively.

Leadership boards that rely on a composite set of benchmarks and community feedback tend to provide more robust findings regarding the relative quality of different models. However, designing these leadership boards must take into account the diverse and, at times, specific needs of the education sector. A challenge.

Community

Community weighting process

Humanities Last Exam (HLE) is a shinning example on how to engage a community of experts similar to the ones working within Npuls and keep benchmarks fresh. We can certainly learn a lot form their approach.

Humanities Last Exam

Quote:” AI capability is evaluated based on benchmarks, yet as their progress accelerates, benchmarks become quickly saturated, losing their utility as a measurement tool. Performing well on formerly frontier benchmarks such as MMLUand GPQA are no longer strong signals of progress as frontier models reach or exceed human level performance on them.

In partnership with the Center for AI Safety, HLE addresses the problem of benchmark saturation by creating Humanity’s Last Exam (HLE): 2,500 of the toughest, subject-diverse, multi-modal questions designed to be the last academic exam of its kind for AI. HLE is designed to test for both depth of reasoning (eg. world-class mathematical problems) and breadth of knowledge across its subject domains, providing a precise measurement of model capability. Current frontier models perform poorly on HLE with low accuracies, and systematically exhibit uncalibrated overconfidence in their answers.

Searchable questions were removed by the following procedure. A question is potentially searchable if a model with search tools answered correctly, but answered incorrectly without search. Each of these potentially searchable questions was then manually audited, removing any that were easily found via web search. We used GPT-4o mini/GPT-4o search and Perplexity Sonar models in this procedure.”

Visualization

Mockup of a community process

We will need to engage a community to weight abilities and values and to keep the scorecard fresh and relevant. This is a mockup of what a community process could look like. Simply ask and weight based on the feedback. The process can be repeated as often as needed and incrementally.

Scorecard Community Weighting (mockup)

✏️ Rename or Delete

➕ Add Dimension

⚙️ Adjust Community Weights

Community Scorecard Summary

Missing Benchmarks

Missingness of benchmark

LLM Benchmark Results
Simulation of missing data
LLM Model Benchmark 1 Benchmark 2 Benchmark N
GPT-4o 0.91 0.87 Missing
Claude 3.5 Sonnet 0.88 Missing 0.83
Llama 3.1 70B Missing 0.74 0.69
Mistral Large 0.79 0.81 Missing
Gemini 1.5 Pro 0.86 0.84 0.82

A practical issue for generating scorecards is that not all models are tested against all benchmarks. The missingness issue leaves us with several possible, at times complementary approaches to benchmarks, including:

  • Opportunism: We choose benchmarks and leadership boards based on their popularity or relationship to leadership boards.
  • GAP analysis: We act on the missing data and run community-selected benchmarks for the models we care about.
  • Embrace our individuality: We design a composite scorecard and only deploy models tested against the scorecard.
  • Test per situation: Provide a framework to design your own scorecard based on already existing metrics.
If we follow the root of least effort and faster deployment, we should start by being opportunistic, building our knowledge and understanding of our values, and adopting a scorecard that weights benchmarks based on our organisation’s values.

Critical Path

Critical path

The complexity of a community process is compounded by defining who should be the authoritative body for tracking AI evolution and translating the signals into actionable insights for educators. This is a critical path issue because if we do not have a clear and trusted authority, then the community process will be less effective and less likely to be adopted by educators. We need to identify and empower ONE authoritative body.

Live data - API

Live data

https://artificialanalysis.ai provides an API to access data on LLM models, including their evaluations and pricing.

The example plot above visualizes the relationship between the MMLU PRO evaluation scores and the input/output token costs for various LLM models. Each point represents a model, with its name and MMLU PRO score displayed on hover. This allows us to compare the performance of different models against their associated costs.

Live data - Ollama

Open Source models

Ollama provides a local platform to run and manage LLM models. The data table below contains metadata about the available models, including their names, descriptions, and categories. The interactive table allows you to explore this information with searching, filtering, and export.

Data source April 2026 The data is sourced from an LLM-checker JSON file that contains detailed information about the models available in Ollama.

Live data - Parameter complexity

Live data

NARRATIVE

Benchmarking is made more difficult by rapidly changing versions of a given model. Ollama is a tool that allows you to run models locally. Using statistics from Ollama is relevant for benchmarking because it ensures you are comparing the same model version across benchmarks. As our laptops, mobile phones, and OSs adapt to the needs of AI, we can expect more and more models to run locally and version changes to accelerate. Trends such as declining computing costs and the rise of open-source models are significantly contributing to this trend.

METHOD

LLM checker is a tool that makes suggestions for the best models to run on your computer. As part of the recommendation process, it downloads the most up-to-date information about models that can run through Ollama. By using the same data source, we can track popular locally run model specifications and, if necessary, benchmark viable models based on early warnings.

Ollama Model Options
tag size quantization command estimated_size_gb real_size_gb categories
llama3.1:latest unknown Q4_0 ollama pull llama3.1:latest 1 4.9 chat
llama3.1:8b 8b Q4_0 ollama pull llama3.1:8b 8 4.9 chat
llama3.1:70b 70b Q4_0 ollama pull llama3.1:70b 70 43.0 chat
llama3.1:405b 405b Q4_0 ollama pull llama3.1:405b 405 243.0 chat

Open Source is closing the gap
Open Source is closing the gap

Costs are decreasing
Costs are decreasing

Snapshots - Kaggle

Kaggle (LLM dataset 2024-2026)

Kaggle is a popular platform for data science competitions and datasets. The dataset used in this analysis contains information about various LLM benchmarks, including their performance metrics and the licenses under which they are released.

Kaggle datasets are generally static and age quickly for LLM benchmarks, which are rapidly evolving. However, they can still provide valuable insights into the landscape of LLM benchmarks at a given point in time. The dataset used here is from April 2026, and it includes a ‘license’ column that indicates the type of license for each benchmark.

This pie chart shows the distribution of licenses among the LLM benchmarks in a Kaggle dataset. The ‘license’ column in the dataset indicates the type of license under which each benchmark is released. The chart helps to visualize how many benchmarks are available under each license type, which can be important for understanding the accessibility and usage rights of the benchmarks.

A lot of LLM family have slightly modified open source licenses, which can be confusing.The pie chart helps to clarify the distribution of licences and highlights the prevalence of of proprietary licence types in the LLM benchmark landscape.

The disadvantagoe of proprietary licences are many fold for the Educational Sector including the potential for leaking data to training sets, lack of transparency in model behavior, and the risk of models being discontinued or changed without notice. Open source licenses, on the other hand, typically allow for greater transparency, community involvement. Though trust but verify is still important, as open source does not necessarily mean safe or ethical.

Biblometrics (data)

Valuable information

Library databases such as Web of Science allow you to search for terms across large sets of research and, by applying those datasets, analyse the returned data. In this case, using a known library in R bibliometrix.

In this case, we search for the terms

dataset AND Github AND AI AND benchmark*

From the information, you can monitor the emergence of new benchmarks and the names of specific well-performing LLMs, the health of the community of researchers, which journals and authors to track, and much more.

Bibliometrics
Bibliometrics

Biblometrics (drilling down)

Drilling down into the literature

Within Seconds you can get an overview of your area of interest

And with little effort you can drill down to track the emergence of new benchmarks and the names of specific well-performing LLMs, the health of the community of researchers, which journals and authors to track, and much more.

Content Analysis

Extracting insight from publications

The screengrab shows how citations are distributed within a specific document. In this case: Can We Trust AI Benchmarks? An Interdisciplinary Review of Current Issues in AI Evaluation. Content analysis of highly rated papers or theses is helpful for what is called back propagation (while searching), where, from one good paper, you can find others.

From content analysis, you can also extract relevant entities, such as the location of datasets used for benchmarking. The datasets are reusable in educational contexts, such as fine-tuning domain-specific models or serving as a safety net against unsafe or inefficient practices.

The R package contentanalysis provides detailed analysis and features that automatically extract information from PDFs and other content types.

AI with tools

AI-generated analysis using local models and Google Scholar

Here are the final thoughts April 2026 from AI (Googles open source gemma 4) running locally on my laptop and talking to Google Scholar. Code at bottom of blog.

The Measurement Era: Analyzing the Rise of LLM Benchmarking in Education

The integration of Large Language Models (LLMs) into the classroom has moved rapidly from a period of experimental novelty to a phase of rigorous academic scrutiny. As these models transition from simple chatbots to sophisticated pedagogical agents, the focus of the research community is shifting. It is no longer enough to ask if an LLM can generate text; the critical question has become whether these models can reliably navigate the complexities of specific academic disciplines, adhere to curriculum standards, and support diverse learning needs. We are currently witnessing a fundamental shift toward the development of specialized, domain-specific evaluative frameworks.

A survey of recent research reveals an expansive, multi-disciplinary effort to map the capabilities of LLMs across the sciences and engineering. Significant strides are being made in evaluating LLMs within Computer Science (CS) concept inventories and programming education, specifically regarding their ability to assess the quality of multiple-choice questions. In the realm of engineering, new methodologies are emerging to benchmark undergraduate curricula in Electrical and Computer Engineering (ECE), utilizing standardized prompting taxonomies to ensure consistency. This trend extends even further into the natural sciences and mathematics, with researchers introducing Hebrew-language chemistry benchmarks and specialized frameworks like “Mathtutorbench” to measure the open-ended pedagogical capabilities of LLM-based tutors.

The primary driver behind this surge in literature is the urgent need for benchmarking as a pillar of scientific rigor. As researchers note, benchmarking is essential for moving toward more reproducible, replicable, and robust investigations into the intersection of AI and education. Without standardized benchmarks, it is impossible to determine if an LLM is truly performing at a level comparable to human students or if it is merely mimicking patterns. These frameworks serve as the foundation for “continuous quality improvement” in outcome-based education, allowing institutions to treat AI integration not as a trend, but as a measurable component of instructional design.

However, the path toward reliable AI integration is fraught with significant technical and structural challenges. Current research highlights critical gaps in how LLMs process complex data; for instance, many existing models struggle to interpret essential pedagogical data stored in table formats. Beyond data parsing, there are significant hurdles in the deployment of LLM agents within existing educational ecosystems, particularly regarding the seamless integration of new tools into established digital infrastructures. Furthermore, there is a documented lack of benchmarks specifically designed to evaluate an LLM’s pedagogical knowledge or its ability to support students with Special Educational Needs and Disability (SEND), suggesting that our current evaluative tools are still heavily biased toward standard, text-heavy academic tasks.

In conclusion, the landscape of educational AI is currently defined by a transition from “capability discovery” to “capability verification.” The recent influx of papers regarding automated benchmarking infrastructure suggests that the industry is moving toward a more mature, automated, and disciplined approach to AI evaluation. While the potential for LLMs to act as personalized tutors and curriculum evaluators is vast, the academic community’s focus remains correctly placed on the necessity of robust, multi-disciplinary, and technically sound benchmarks. Until we can solve the challenges of data integration and pedagogical nuance, the true efficacy of LLMs in the classroom will remain an unverified promise.

Click here to view code that generated the posting

import ollama
from scholarly import scholarly


def fetch_google_scholar_abstracts(topic, max_results=5):
    """
    Fetch publication metadata + abstracts/snippets from Google Scholar.
    Note: Scholar may rate-limit or challenge frequent requests.
    """
    print(f"📚 Fetching Google Scholar abstracts for: {topic}...")
    rows = []

    try:
        search = scholarly.search_pubs(topic)

        for i, pub in enumerate(search):
            if i >= max_results:
                break

            bib = pub.get("bib", {})
            title = bib.get("title", "Untitled")
            year = bib.get("pub_year", "NA")
            venue = bib.get("venue", "Unknown venue")

            # Abstract is not always present in Scholar results.
            abstract = bib.get("abstract") or pub.get("snippet") or "No abstract/snippet available."

            rows.append(
                f"Paper {i+1}\n"
                f"Title: {title}\n"
                f"Year: {year}\n"
                f"Venue: {venue}\n"
                f"Abstract: {abstract}"
            )

        if not rows:
            return None

        return "\n\n".join(rows)

    except Exception as e:
        print(f"❌ Google Scholar fetch error: {e}")
        return None


def generate_educational_blog():
    search_topic = "LLM benchmarks in education"
    is_speculative = False

    # Replace Wikipedia with Google Scholar abstracts
    research_data = fetch_google_scholar_abstracts(search_topic, max_results=10)

    if not research_data:
        print("⚠️ Could not fetch Scholar abstracts. Using general knowledge.")
        research_data = "No specific recent paper abstracts found. General knowledge applies."

    prompt = f"""
    You are a tech journalist.

    ABSTRACT DATA FROM GOOGLE SCHOLAR:
    {research_data}

    TASK:
    Write a 5-paragraph blog post about LLM benchmarks in education.

    {'NOTE: You are writing about the FUTURE. Use a visionary and predictive tone.' if is_speculative else 'NOTE: You are writing about EXISTING research. Use an analytical tone.'}

    STRUCTURE:
    1. Intro to AI in classrooms.
    2. Key findings from the provided abstracts.
    3. The importance of benchmarks.
    4. Challenges (bias, accuracy).
    5. Conclusion.
    """

    print("🧠 Gemma 4 is generating the post...")
    try:
        response = ollama.chat(
            model="gemma4:26b",
            messages=[{"role": "user", "content": prompt}]
        )
        print("\n--- FINAL BLOG POST ---\n")
        print(response["message"]["content"])
    except Exception as e:
        print(f"❌ Error: {e}")


if __name__ == "__main__":
    generate_educational_blog()

Benchmarking

Benchmark harness

Benchmark Harness

There are really simple-to-use code libraries that take the effort out of running a predetermined set of benchmarks. A notable example is llm-eval, which at present (April 2026) runs 60 different benchmarks. This is a concrete starting point to select benchmarks that are relevant for education and combine and community weight with already measured benchmarks. As our understanding of the landscape matures, we can tune the initial scorecard for domain-specific comparisons of different AI systems.

Who am I

Who am I

Alan Berg is a Learning Analytics and Data Expert and Security Architect at the University of Amsterdam (UvA). He works within the ICT Services (ICTS) department, providing central technical services for both the UvA and the Amsterdam University of Applied Sciences (HvA).

Key Roles and Expertise

  • Learning Analytics: He is a prominent specialist in learning analytics, focusing on the architectural lifecycle of data in education.
  • Data Privacy & Security: His work often intersects with data privacy, synthetic data generation for research, and security architecture within university systems.
  • Academic Background: He holds a Doctorate and has authored several papers and theses related to Learning Analytics Architecture and campus-wide information systems.

The Project

The Project

This project explores the variability of tracking the quality of AI models. We do this by defining the relative quality of AI models deployed within Education.

The project assesses whether the models are performing as intended and which models are best for a given pedagogical context. We explore the role of benchmarking and the use of a composite of benchmarks to score models.

Education isn’t just about recalling facts. Bloom’s taxonomy reminds us that learning spans multiple levels—from remembering and understanding to analysing, evaluating, and creating. An aligned AI model should support all these levels, not just the simplest ones. That means our scorecard must include benchmarks that test how well AI handles tasks across this full spectrum of cognitive skills.

No single benchmark can capture everything. Some tests measure factual accuracy, others measure reasoning, creativity, or ethical behaviour. By combining them into one composite scorecard, we get a more complete picture of how an AI model performs in real educational settings.

AI models evolve quickly. A benchmark that was challenging last year may be too easy today. To keep the scorecard meaningful, we need up‑to‑date benchmarks that reflect current capabilities, new risks, and emerging educational practices. Benchmarking ensures that the AI remains aligned not only with today’s curriculum but also with the rapidly changing world students are preparing for.

CEDA

CEDA

The goal of CEDA (Center for Educational Data Analytics) is to move from data to reliable insights more quickly through co‑creation with vocational (mbo), applied sciences (hbo), and research universities (wo). This enables educational institutions to maintain a firm grip on the future in an increasingly fast‑changing world.

CEDA does this by gathering successful information products from institutions and making them practically accessible to all other institutions. We develop in Python, R, Power BI, and Azure.

On the one hand, this directly helps all institutions because they can apply these information products locally. CEDA also develops documentation and best practices in the areas of machine learning, visualization, and data analysis related to these tools.

On the other hand, this yields a wealth of information about the real challenges surrounding study data and AI, as well as the differences between mbo, hbo, and wo. CEDA shares these insights with chain partners. In this way, CEDA also contributes to sector‑wide conditions and architecture for data‑informed working.

Acknowledgements

Acknowledgements

A CEDA funded project: I acknowledge the copious use of Microsoft and GitHub Copilot, and the expert-in-the-loop in the development process. I would also like to recognize the use of the R language, especially the flexdashboard package.