Benchmarks are essential for evaluating the performance, safety, and ethical implications of AI models in educational contexts. They provide standardized metrics to assess how well AI systems can support learning outcomes, ensure student safety, and promote equitable access to education. By categorizing benchmarks into pedagogical, ethical, safety, and operational domains, educators and developers can better understand the strengths and limitations of AI models and tools, guiding informed decisions about their deployment.
But what are benchmarks? What are their limitations? How do we use them together to create a balanced scorecard and how do we make them accessible and fresh for educators? How can we realistically scale benchmarking and for what other purposes can we reuse benchmark data? These are the questions we will explore in this dashboard.
There are a lot of benchmarks measuring many different LLM capabilities. Here is a sample of 250 benchmarks.
I designed this dashboard to present benchmarking and scorecarding information in a logical order. Its purpose is to serve as a standalone document with dynamic elements that we can use to develope an understanding of the value of benchmarking in the context of AI Alignment in Education. The dashboard serves a secondary purpose: providing material for presentations.
As the field of AI alignment evolves rapidly, expect updates.
I initially wrote this dashboard in April 2026
The LA cycle, grounded in pedagogical data, clear metrics, and a feedback loop, is an essential framework for guiding AI responsibly and effectively in an educational context.
At the end of 2025 the Npuls LA team came to the following conclusion about the state of Generative AI in Dutch education:
The importance of LA in (Gen)AI was best summarized by the “Confused Expert” analogy. This argument posits that GenAI is a powerful but unguided medium, not a solution. The LA cycle, grounded in pedagogical data, clear metrics, and a feedback loop, is an essential framework for guiding AI responsibly and effectively in an educational context.
Effective AI will need better data and guiding practices, not the other way around.
The solution argues for Benchmarking and training AI models on pedagogical data, and using the LA cycle to guide the development and deployment of AI in education. This approach ensures that AI systems are aligned with educational goals and can be effectively integrated into teaching and learning processes.
Before taking time efort and gold to train models based on Learning Analytics data we need to be able to measure the qualities of the models and AI systems against our values. This project is an initial review of the theme with the goal of developing a framework for benchmarking and scorecarding AI systems in education, with a focus on alignment with educational values and goals. The project will involve identifying relevant metrics, developing benchmarking methodologies, and creating scorecards to evaluate AI systems in the context of education.
A scorecard based on a number of benchmarks
No one benchmark can track the capabilities of AI models and composite AI tools in education. We need to track multiple dimensions. Among many themes, models need to be safe, where possible, cheap and reliable, align with the curriculum, support different learning styles and pedagogical taxonomies such as Bloom, be multilingual and multimodal and be able to explain their own reasoning. As educational tools become more agentic, models will need to work together and follow instructions, and find patterns in ever increasing volumes of information. Domain-specific tasks, such as coding, researching, and solving mathematical problems, also require specific benchmarks.
Creating a scorecard is complex. We will need to research what is available and generalise to the Dutch context. Organizing starts with understanding what the Dutch educational community cares about.
The benefits of using benchmarks include the ability to compare the relative performance of AI systems within our context, improving data-driven decision-making. The risks include cheating by training models on the benchmark data, data contamination from AI-generated content and most importantly, a false sense of security.
Using a scorecard or benchmark properly requires a degree of common sense and literacy, and considerable reflection during application.
As the field matures and we become accustomed to deploying scorecards, we will gain a more fine-grained understanding of the risks and benefits of using benchmarks and scorecards. Using AI to keep track of AI is a helpful start. Consider using a prompt such as the following one to monitor evolution:
You are an expert AI policy and education technology analyst. Your sole mission is to keep me continuously up-to-date on the risks and benefits of using AI benchmarks to track, evaluate, and compare the performance of AI systems deployed in the education sector (AI tutors, automated grading tools, personalized learning platforms, adaptive content generators, misconception detectors, etc.).
Every time I ask for an update (or on a schedule if you support it), deliver a concise, balanced, evidence-based briefing that covers:
Begin every response with: “AI Benchmarks in Education – Update as of [current date]”
Which primary factor should motivate us to track AI alignment in education?
An AI Observatory is a center of excellence that systematically monitors, evaluates, and reports on the impact of AI in education.
Here is a concept of what an AI observatory can look like in respect to AI benchmarking and score carding.
It serves as a trusted source of information for educators, policymakers, and researchers, providing insights into the capabilities, limitations, and ethical implications of AI tools. The observatory collects data from various benchmarks, real-world deployments, and user feedback to create a comprehensive picture of how AI is shaping education. It also fosters collaboration among stakeholders to ensure that AI technologies are developed and used in ways that promote equity, safety, and effective learning outcomes.
Does an AI observatory exist?
There are a number of AI related sources of intelligence, but none that are specifically focused on Dutch education. Examples include:
AI Watch: An initiative by the European Commission that monitors AI developments and their societal impact, which could serve as a model for an education-focused observatory.
CBS AI monitor: All figures, articles and reports from Statistics Netherlands about artificial intelligence.
AI Index: An annual report that tracks the progress and impact of AI across various domains, including education, providing valuable data for an educational AI observatory.
AI Incident Database: A repository of real-world AI incidents, including safety failures and ethical breaches, which can inform the development of safer AI systems in education.
Zhou et al. 2023) Large language models are trained in two stages: (1) unsupervised pretraining from raw text, to learn general-purpose representations, and (2) large scale instruction tuning and reinforcement learning, to better align to end tasks and user preferences. We measure the relative importance of these two stages by training LIMA, a 65B parameter LLaMa language model fine-tuned with the standard supervised loss on only 1,000 carefully curated prompts and responses
Taken together, these results strongly suggest that almost all knowledge in large language models is learned during pretraining, and only limited instruction tuning data is necessary to teach models to produce high quality output.
Zhou, Chunting, Pengfei Liu, Puxin Xu, Srini Iyer, Jiao Sun, Yuning Mao, Xuezhe Ma, et al. 2023. “LIMA: Less Is More for Alignment.” https://doi.org/10.48550/ARXIV.2305.11206.
Here is an example prompt to create an initial taxonomy of benchmarks to measure a generative AI model against the Bloom taxonomy. The use of prompting is an example of AI keeping up to date with AI, and it provides you with a springboard to drill down into the capabilities you consider important.
Note: I can reproduce the output in this dashboard; however, there is a risk of giving a particular set of benchmarks too much relevance. First, it is important to nail down the themes and priorities (cost vs accuracy, safety vs model size, etc.) most important to the sector.
A benchmark is a standardized way to measure how well a model performs on specific tasks.
NOTE
The requirements for AI to effectively serve students and teachers are significant. These include factual accuracy, effective tool and agent utilization, emotional intelligence, and safety.
Individual benchmarks often lack the complexity needed to fully capture these requirements. A composite set of benchmarks is more likely to address the necessary nuances. For instance, while a benchmark may measure the accuracy of AI models in answering multiple-choice questions, it may fail to assess harmful outputs, consistency across prompts, hallucination rates, security, or student well-being during interactions. Additionally, relying on individual benchmarks can lead to saturation, where scores across different models become very similar. Model developers frequently use benchmark datasets to train newer models, and the benchmarks themselves often measure unrealistic tasks that do not reflect real-world applications or generalize effectively.
Leadership boards that rely on a composite set of benchmarks and community feedback tend to provide more robust findings regarding the relative quality of different models. However, designing these leadership boards must take into account the diverse and, at times, specific needs of the education sector. A challenge.
| LLM Benchmark Results | |||
| Simulation of missing data | |||
| LLM Model | Benchmark 1 | Benchmark 2 | Benchmark N |
|---|---|---|---|
| GPT-4o | 0.91 | 0.87 | Missing |
| Claude 3.5 Sonnet | 0.88 | Missing | 0.83 |
| Llama 3.1 70B | Missing | 0.74 | 0.69 |
| Mistral Large | 0.79 | 0.81 | Missing |
| Gemini 1.5 Pro | 0.86 | 0.84 | 0.82 |
The complexity of a community process is compounded by defining who should be the authoritative body for tracking AI evolution and translating the signals into actionable insights for educators. This is a critical path issue because if we do not have a clear and trusted authority, then the community process will be less effective and less likely to be adopted by educators. We need to identify and empower ONE authoritative body.
Ollama provides a local platform to run and manage LLM models. The data table below contains metadata about the available models, including their names, descriptions, and categories. The interactive table allows you to explore this information with searching, filtering, and export.
Data source April 2026 The data is sourced from an LLM-checker JSON file that contains detailed information about the models available in Ollama.
NARRATIVE
Benchmarking is made more difficult by rapidly changing versions of a given model. Ollama is a tool that allows you to run models locally. Using statistics from Ollama is relevant for benchmarking because it ensures you are comparing the same model version across benchmarks. As our laptops, mobile phones, and OSs adapt to the needs of AI, we can expect more and more models to run locally and version changes to accelerate. Trends such as declining computing costs and the rise of open-source models are significantly contributing to this trend.
METHOD
LLM checker is a tool that makes suggestions for the best models to run on your computer. As part of the recommendation process, it downloads the most up-to-date information about models that can run through Ollama. By using the same data source, we can track popular locally run model specifications and, if necessary, benchmark viable models based on early warnings.
Here are the final thoughts April 2026 from AI (Googles open source gemma 4) running locally on my laptop and talking to Google Scholar. Code at bottom of blog.
The Measurement Era: Analyzing the Rise of LLM Benchmarking in Education
The integration of Large Language Models (LLMs) into the classroom has moved rapidly from a period of experimental novelty to a phase of rigorous academic scrutiny. As these models transition from simple chatbots to sophisticated pedagogical agents, the focus of the research community is shifting. It is no longer enough to ask if an LLM can generate text; the critical question has become whether these models can reliably navigate the complexities of specific academic disciplines, adhere to curriculum standards, and support diverse learning needs. We are currently witnessing a fundamental shift toward the development of specialized, domain-specific evaluative frameworks.
A survey of recent research reveals an expansive, multi-disciplinary effort to map the capabilities of LLMs across the sciences and engineering. Significant strides are being made in evaluating LLMs within Computer Science (CS) concept inventories and programming education, specifically regarding their ability to assess the quality of multiple-choice questions. In the realm of engineering, new methodologies are emerging to benchmark undergraduate curricula in Electrical and Computer Engineering (ECE), utilizing standardized prompting taxonomies to ensure consistency. This trend extends even further into the natural sciences and mathematics, with researchers introducing Hebrew-language chemistry benchmarks and specialized frameworks like “Mathtutorbench” to measure the open-ended pedagogical capabilities of LLM-based tutors.
The primary driver behind this surge in literature is the urgent need for benchmarking as a pillar of scientific rigor. As researchers note, benchmarking is essential for moving toward more reproducible, replicable, and robust investigations into the intersection of AI and education. Without standardized benchmarks, it is impossible to determine if an LLM is truly performing at a level comparable to human students or if it is merely mimicking patterns. These frameworks serve as the foundation for “continuous quality improvement” in outcome-based education, allowing institutions to treat AI integration not as a trend, but as a measurable component of instructional design.
However, the path toward reliable AI integration is fraught with significant technical and structural challenges. Current research highlights critical gaps in how LLMs process complex data; for instance, many existing models struggle to interpret essential pedagogical data stored in table formats. Beyond data parsing, there are significant hurdles in the deployment of LLM agents within existing educational ecosystems, particularly regarding the seamless integration of new tools into established digital infrastructures. Furthermore, there is a documented lack of benchmarks specifically designed to evaluate an LLM’s pedagogical knowledge or its ability to support students with Special Educational Needs and Disability (SEND), suggesting that our current evaluative tools are still heavily biased toward standard, text-heavy academic tasks.
In conclusion, the landscape of educational AI is currently defined by a transition from “capability discovery” to “capability verification.” The recent influx of papers regarding automated benchmarking infrastructure suggests that the industry is moving toward a more mature, automated, and disciplined approach to AI evaluation. While the potential for LLMs to act as personalized tutors and curriculum evaluators is vast, the academic community’s focus remains correctly placed on the necessity of robust, multi-disciplinary, and technically sound benchmarks. Until we can solve the challenges of data integration and pedagogical nuance, the true efficacy of LLMs in the classroom will remain an unverified promise.
import ollama
from scholarly import scholarly
def fetch_google_scholar_abstracts(topic, max_results=5):
"""
Fetch publication metadata + abstracts/snippets from Google Scholar.
Note: Scholar may rate-limit or challenge frequent requests.
"""
print(f"📚 Fetching Google Scholar abstracts for: {topic}...")
rows = []
try:
search = scholarly.search_pubs(topic)
for i, pub in enumerate(search):
if i >= max_results:
break
bib = pub.get("bib", {})
title = bib.get("title", "Untitled")
year = bib.get("pub_year", "NA")
venue = bib.get("venue", "Unknown venue")
# Abstract is not always present in Scholar results.
abstract = bib.get("abstract") or pub.get("snippet") or "No abstract/snippet available."
rows.append(
f"Paper {i+1}\n"
f"Title: {title}\n"
f"Year: {year}\n"
f"Venue: {venue}\n"
f"Abstract: {abstract}"
)
if not rows:
return None
return "\n\n".join(rows)
except Exception as e:
print(f"❌ Google Scholar fetch error: {e}")
return None
def generate_educational_blog():
search_topic = "LLM benchmarks in education"
is_speculative = False
# Replace Wikipedia with Google Scholar abstracts
research_data = fetch_google_scholar_abstracts(search_topic, max_results=10)
if not research_data:
print("⚠️ Could not fetch Scholar abstracts. Using general knowledge.")
research_data = "No specific recent paper abstracts found. General knowledge applies."
prompt = f"""
You are a tech journalist.
ABSTRACT DATA FROM GOOGLE SCHOLAR:
{research_data}
TASK:
Write a 5-paragraph blog post about LLM benchmarks in education.
{'NOTE: You are writing about the FUTURE. Use a visionary and predictive tone.' if is_speculative else 'NOTE: You are writing about EXISTING research. Use an analytical tone.'}
STRUCTURE:
1. Intro to AI in classrooms.
2. Key findings from the provided abstracts.
3. The importance of benchmarks.
4. Challenges (bias, accuracy).
5. Conclusion.
"""
print("🧠 Gemma 4 is generating the post...")
try:
response = ollama.chat(
model="gemma4:26b",
messages=[{"role": "user", "content": prompt}]
)
print("\n--- FINAL BLOG POST ---\n")
print(response["message"]["content"])
except Exception as e:
print(f"❌ Error: {e}")
if __name__ == "__main__":
generate_educational_blog()Benchmark Harness
There are really simple-to-use code libraries that take the effort out of running a predetermined set of benchmarks. A notable example is llm-eval, which at present (April 2026) runs 60 different benchmarks. This is a concrete starting point to select benchmarks that are relevant for education and combine and community weight with already measured benchmarks. As our understanding of the landscape matures, we can tune the initial scorecard for domain-specific comparisons of different AI systems.
Alan Berg is a Learning Analytics and Data Expert and Security Architect at the University of Amsterdam (UvA). He works within the ICT Services (ICTS) department, providing central technical services for both the UvA and the Amsterdam University of Applied Sciences (HvA).
Key Roles and Expertise
This project explores the variability of tracking the quality of AI models. We do this by defining the relative quality of AI models deployed within Education.
The project assesses whether the models are performing as intended and which models are best for a given pedagogical context. We explore the role of benchmarking and the use of a composite of benchmarks to score models.
Education isn’t just about recalling facts. Bloom’s taxonomy reminds us that learning spans multiple levels—from remembering and understanding to analysing, evaluating, and creating. An aligned AI model should support all these levels, not just the simplest ones. That means our scorecard must include benchmarks that test how well AI handles tasks across this full spectrum of cognitive skills.
No single benchmark can capture everything. Some tests measure factual accuracy, others measure reasoning, creativity, or ethical behaviour. By combining them into one composite scorecard, we get a more complete picture of how an AI model performs in real educational settings.
AI models evolve quickly. A benchmark that was challenging last year may be too easy today. To keep the scorecard meaningful, we need up‑to‑date benchmarks that reflect current capabilities, new risks, and emerging educational practices. Benchmarking ensures that the AI remains aligned not only with today’s curriculum but also with the rapidly changing world students are preparing for.
The goal of CEDA (Center for Educational Data Analytics) is to move from data to reliable insights more quickly through co‑creation with vocational (mbo), applied sciences (hbo), and research universities (wo). This enables educational institutions to maintain a firm grip on the future in an increasingly fast‑changing world.
CEDA does this by gathering successful information products from institutions and making them practically accessible to all other institutions. We develop in Python, R, Power BI, and Azure.
On the one hand, this directly helps all institutions because they can apply these information products locally. CEDA also develops documentation and best practices in the areas of machine learning, visualization, and data analysis related to these tools.
On the other hand, this yields a wealth of information about the real challenges surrounding study data and AI, as well as the differences between mbo, hbo, and wo. CEDA shares these insights with chain partners. In this way, CEDA also contributes to sector‑wide conditions and architecture for data‑informed working.
A CEDA funded project: I acknowledge the copious use of Microsoft and GitHub Copilot, and the expert-in-the-loop in the development process. I would also like to recognize the use of the R language, especially the flexdashboard package.