AI Aligned by Benchmarking

INTRODUCTION

Narrative

This dashboard shows that it is practical to collect live benchmark data and provides visual intuition into the value of this data for model selection. By the end of the story, we conclude that AI alignment through benchmark comparisons is operationalizable and requires a nuanced understanding of the details to be enacted safely. The dashboard then points to a second dashboard that explains some nuances and provides a mockup of a potential community practice.

The live data sources were gathered from the sources acknowledged below. Through their efforts, this dashboard was made viable.

Acknowledgements

The live datasets are gathered from the following sources:

Artificial.AI: Artificial Analysis performs intelligence, quality, performance and price benchmarking on AI models, inference API endpoints and systems. This section of our website describes our benchmarking methodology, including both our quality benchmarking and performance benchmarking.
/oolong-tea-2026: Auto-updated daily snapshots of every Arena AI (formerly LMSYS Chatbot Arena) leaderboard in structured JSON.
Epoch: Investigating the trajectory of AI for the benefit of society.

1 a: Coverage

Artificial Analysis (https://artificialanalysis.ai) provides its gathered benchmarks for free through a REST API. Their coverage spans a range of benchmarks (around 10) based on their own AGI benchmark. Artificial Analysis also provides a detailed dashboard that you may consider interacting with in your own research.

The graph is dynamic. Hovering over a point will display a bar with further information.

1 b: Best models

Model	Domain	Benchmark	Details
GPT-5.5 (xhigh)	Agentic	tau_banking	Link t2-Bench: benchmark for Tool-Agent-User interaction in real-world domains.
JT-35B-Flash	Agentic	tau2	Link t2-Bench: benchmark for Tool-Agent-User interaction in real-world domains.
Claude Fable 5 (Adaptive Reasoning, Max Effort, Opus 4.8 Fallback)	Agentic	terminalbench_hard	Link Terminal-Bench: agentic benchmark evaluating agents on software engineering, sysadmin, and game-playing via a terminal.
Claude Opus 4.8 (Adaptive Reasoning, Max Effort)	Agentic	terminalbench_v2_1	Link Terminal-Bench: agentic benchmark evaluating agents on software engineering, sysadmin, and game-playing via a terminal.
Claude Fable 5 (Adaptive Reasoning, Max Effort, Opus 4.8 Fallback)	Agentic	terminalbench_v2_1	Link Terminal-Bench: agentic benchmark evaluating agents on software engineering, sysadmin, and game-playing via a terminal.
Claude Fable 5 (Adaptive Reasoning, Max Effort, Opus 4.8 Fallback)	AGI	artificial_analysis_intelligence_index	Link Artificial Analysis Intelligence Index combines performance across ten evaluations: GDPval-AA, t2-Bench Telecom, Terminal-Bench Hard, SciCode, AA-LCR, AA-Omniscience, IFBench, Humanity’s Last Exam, GPQA Diamond, CritPt.
Claude Fable 5 (Adaptive Reasoning, Max Effort, Opus 4.8 Fallback)	AGI	hle	Link Humanity’s Last Exam: 2,500 challenging questions across over a hundred subjects at the frontier of human knowledge.
Claude Fable 5 (Adaptive Reasoning, Max Effort, Opus 4.8 Fallback)	Coding	artificial_analysis_coding_index	Link Artificial Analysis Intelligence Index combines performance across ten evaluations: GDPval-AA, t2-Bench Telecom, Terminal-Bench Hard, SciCode, AA-LCR, AA-Omniscience, IFBench, Humanity’s Last Exam, GPQA Diamond, CritPt.
Gemini 3 Pro Preview (high)	Coding	livecodebench	Link LiveCodeBench: holistic, contamination-free evaluation of LLMs for code, updated continuously with new problems.
Claude Fable 5 (Adaptive Reasoning, Max Effort, Opus 4.8 Fallback)	Coding	scicode	Link SciCode: models a realistic scientist workflow of identifying science concepts and transforming them into simulation code.
Grok 4.3 (medium)	Instructions	ifbench	Link IFBench
GPT-5.2 Codex (xhigh)	Long context Reasoning	lcr	Link AA-LCR: measures ability to extract and synthesise information from long-form documents ranging from 10k to 100k tokens.
GPT-5 (high)	Maths	aime	Link AIME 2025: all 30 problems from the 2025 American Invitational Mathematics Examination.
GPT-5.2 (xhigh)	Maths	aime_25	Link AIM25
GPT-5.2 (xhigh)	Maths	artificial_analysis_math_index	Link Artificial Analysis Intelligence Index combines performance across ten evaluations: GDPval-AA, t2-Bench Telecom, Terminal-Bench Hard, SciCode, AA-LCR, AA-Omniscience, IFBench, Humanity’s Last Exam, GPQA Diamond, CritPt.
Gemini 3.1 Pro Preview	Reasoning	gpqa	Link GPQA: 448 multiple-choice questions written by domain experts in biology, physics, and chemistry.
Gemini 3 Pro Preview (high)	Reasoning	mmlu_pro	Link MMLU-Pro: a more challenging and robust benchmark for language models across 12K complex questions in various disciplines.
GPT-5 (high)	Saturated	math_500	Link MATH-500: 500 problems spanning algebra, geometry, number theory, and probability, requiring step-by-step solutions.

This table shows the best performing models across all benchmarks.

1 c: Finding the balance

You can see that some models perform well on the benchmark at considerably less cost than the best performing model.

1 d: Exploring

Lots of actionable information on one graph.

Click on a point to reveal the model name and benchmark. The click also fixs all the datpoints for a given benchmark in focus. Click again to reset the view.

2 a: Human Judgement

Arena	models_benchmarked	best_model	ranked_second_model
agent	25	Claude Fable 5 (High)	GPT 5.5 (xHigh)
code	20	claude-fable-5	claude-opus-4-7-thinking
document	29	claude-opus-4-6	claude-opus-4-6-thinking
image-edit	49	gpt-image-2 (medium)	mai-image-2.5
image-to-video	41	gemini-omni-flash	dreamina-seedance-2.0-720p
search	31	claude-opus-4-6-search	gpt-5.5-search
text-to-image	70	gpt-image-2 (medium)	reve-2.0
text-to-video	41	gemini-omni-flash	dreamina-seedance-2.0-720p
text	20	claude-fable-5	claude-opus-4-6-thinking
video-edit	6	dreamina-seedance-2.0-720p	happyhorse-1.0
vision	40	claude-opus-4-7-thinking	claude-fable-5

Arena’s allow humans to choose (online) between different models on a given task. The table shows the results of a human judgement benchmarks.

2 b: Ranking

Opensource models are ranked on a leaderboard based on human judgements. Some of the Open Source models are near the top of the leaderboard and might be useful once we take into account the need for Digital Sovereignty or keeping your data within your own data center.

2 b: Ranking Safety

CISCO LLM Security Leadership board

Comprehensive model safety and security rankings, including single-turn score, multi-turn score, and detailed metrics.

See

Other available Security benchmarks:

CVE=Benchmark Evaluating AI agents on real world web vulnerabilities and exploits collected from National Vulnerability Database. CVE-Bench includes 40 critical-severity Common Vulnerability and Exposures (CVE) with the reference automatic exploits available on requests.
TOSSS A benchmark that measures the ability of LLMs to choose between secure and vulnerable code snippets.

2 c: Updating and Routing

id	open_weight	added_at	url
kimi-k2.7-code	TRUE	2026-06-12 23:25:40	Moonshot AI
mimo-v2.5-pro	TRUE	2026-06-10 21:36:16	Xiaomi
mimo-v2.5	TRUE	2026-06-10 21:36:16	Xiaomi
diffusiongemma-26b-a4b-it	TRUE	2026-06-10 21:12:04	Google
claude-fable-5	FALSE	2026-06-09 20:20:02	Anthropic
gemma-4-12b-it	TRUE	2026-06-09 17:39:52	Google
minimax-m3	TRUE	2026-06-05 21:54:39	MiniMax
mai-thinking-1	FALSE	2026-06-05 21:54:38	Microsoft
mai-code-1-flash	FALSE	2026-06-05 21:54:38	Microsoft
nova-2-sonic	FALSE	2026-06-05 21:54:33	Amazon

LLM-Stats provides a live feed of model updates, including newly added models to the ecosystem. We can use this type of information to keep our benchmarks up to date and to identify promising new models for testing.

LLM-Stats also provides a routing service that automatically directs API calls to the best-performing model. We can use such a service to ensure that we always use the near best available model for specific capabilities. For a more complex community scorecard, it would require a discussion with a provider of routing services.

3 a: More information

	Benchmark	Model.version	Release.date	Organization	Country
33	hella_swag	ada		OpenAI	United States of America
37	mmlu	amazon.nova-lite-v1:0	2024-12-03	Amazon	United States of America
44	trivia_qa	claude-2.0	2023-07-11	Anthropic	United States of America
29	cad_eval	claude-3-5-haiku-20241022	2024-10-22	Anthropic	United States of America
43	the_agent_company	claude-3-5-sonnet-20241022	2024-10-22	Anthropic	United States of America
11	weirdml	claude-fable-5_high	2026-06-09	Anthropic	United States of America
15	simplebench	claude-fable-5_max	2026-06-09	Anthropic	United States of America
17	metr_time_horizons	claude-mythos-preview-early
31	cybench	claude-opus-4-6_unknown	2026-02-05	Anthropic	United States of America
21	gso	claude-opus-4-7	2026-04-16	Anthropic	United States of America
19	terminalbench	claude-opus-4-7_unknown	2026-04-16	Anthropic	United States of America
22	webdev_arena	claude-opus-4-7_unknown	2026-04-16	Anthropic	United States of America
5	frontiermath	claude-opus-4-8_max	2026-05-28	Anthropic	United States of America
6	frontiermath_tier_4	claude-opus-4-8_max	2026-05-28	Anthropic	United States of America
20	posttrainbench	claude-opus-4-8_max	2026-05-28	Anthropic	United States of America
36	live_bench	gemini-2.5-pro-exp-03-25	2025-03-25	Google DeepMind	United States of America
14	geobench	gemini-3-flash-preview	2025-12-17	Google DeepMind	United States of America
10	balrog	gemini-3-pro-preview	2025-11-18	Google DeepMind	United States of America
12	vpct	gemini-3-pro-preview	2025-11-18	Google DeepMind	United States of America
25	arc_agi	gemini-3.1-pro-preview	2026-02-19	Google DeepMind	United States of America
48	hle	gemini-3.1-pro-preview	2026-02-19	Google DeepMind	United States of America
3	swe_bench_verified	gemini-3.5-flash_high	2026-05-19	Google	United States of America
46	apex_agents	gemini-3.5-flash_unknown	2026-05-19	Google	United States of America
2	math_level_5	gpt-5-mini-2025-08-07_high	2025-08-07	OpenAI	United States of America
16	gdpval	gpt-5.2-2025-12-11_none	2025-12-11	OpenAI	United States of America
47	arc_agi_2	gpt-5.5_xhigh	2026-04-23	OpenAI	United States of America
9	aider_polyglot	gpt-oss-120b_high	2025-08-05	OpenAI	United States of America
40	piqa	Llama-2-7b	2023-07-18	Meta AI	United States of America
28	bool_q	mpt-7b	2023-05-05	MosaicML	United States of America
34	lambada	mpt-7b	2023-05-05	MosaicML	United States of America
35	lech_mazur_writing	o3-2025-04-16_medium	2025-04-16	OpenAI	United States of America
45	wino_grande	PaLM 2-L	2023-05-17
27	bbh	Qwen-1_8B	2023-11-30
1	gpqa_diamond	qwen3.7-max	2026-05-19	Alibaba	China
4	otis_mock_aime_2024_2025	qwen3.7-max	2026-05-19	Alibaba	China
7	simpleqa_verified	qwen3.7-max	2026-05-19	Alibaba	China
8	chess_puzzles	qwen3.7-max	2026-05-19	Alibaba	China
49	epoch_capabilities_index	qwen3.7-max	2026-05-19	Alibaba	China
30	common_sense_qa_2	text-davinci-001	2022-01-27	OpenAI	United States of America
42	superglue	text-davinci-001	2022-01-27	OpenAI	United States of America
23	video_mme	video-SALMONN-2plus	2025-06-18	ByteDance	China
26	arc_ai2	Yi-6B	2023-11-02	01.AI	China

Here a list of best models for a given benchmark is shown. The dataset is from Epoch.AI. The dataset is updated regularly.

Certain models are robust across a series of benchmarks and therefore specific capabilities.

3 b: LLM input / Output

OpenRouter operates one of the largest AI inference platforms in the world, generating an authoritative empirical dataset relied upon by government agencies, academic researchers, major industry analysts, and global media outlets.

Here we use their dataset on the input and output modality of the models that they measure. For policymakers, by doing so, we can keep track of the expanding range of modalities and their popularity. This is an important consideration as you select benchmarks relevant to the LLMs in your organisation.

Conclusions

Choose AI models based on our values is possible via a set of current benchmarks. Much of the necessary benchmarks are available, but …

It is Nuanced:

Not all benchmarks are created equal.
Different domains have different capability needs.
Some benchmarks are more accessible than others.
We might run our own benchmarks for missing models.
It is only an estimate, but the higher (and more popular) the benchmarks, the more reliable the estimate.
There are many model variations (Size, fine-tuning, etc.), and the benchmarks capture only a subset of them.
Fine-tuning can improve models for our specific tasks, so we will have more models to test in the future.
Benchmarks become less reliable over time due to Saturation effects such as models trained on benchmark data.

AI alignment requires a nuanced understanding of the details for us to enact safely. However, the payoff is great.

We do need a community process. Here is a mockup of the dashboard that supports such a process.