INTRODUCTION

Narrative

This dashboard shows that it is practical to collect live benchmark data and provides visual intuition into the value of this data for model selection. By the end of the story, we conclude that AI alignment through benchmark comparisons is operationalizable and requires a nuanced understanding of the details to be enacted safely. The dashboard then points to a second dashboard that explains some nuances and provides a mockup of a potential community practice.

The live data sources were gathered from the sources acknowledged below. Through their efforts, this dashboard was made viable.

Acknowledgements

The live datasets are gathered from the following sources:

  1. Artificial.AI: Artificial Analysis performs intelligence, quality, performance and price benchmarking on AI models, inference API endpoints and systems. This section of our website describes our benchmarking methodology, including both our quality benchmarking and performance benchmarking.
  2. /oolong-tea-2026: Auto-updated daily snapshots of every Arena AI (formerly LMSYS Chatbot Arena) leaderboard in structured JSON.
  3. Epoch: Investigating the trajectory of AI for the benefit of society.

1 a: Coverage


Artificial Analysis (https://artificialanalysis.ai) provides its gathered benchmarks for free through a REST API. Their coverage spans a range of benchmarks (around 10) based on their own AGI benchmark. Artificial Analysis also provides a detailed dashboard that you may consider interacting with in your own research.

The graph is dynamic. Hovering over a point will display a bar with further information.

1 b: Best models

Model Domain Benchmark Details
GPT-5.5 (xhigh) Agentic tau_banking Link t2-Bench: benchmark for Tool-Agent-User interaction in real-world domains.
JT-35B-Flash Agentic tau2 Link t2-Bench: benchmark for Tool-Agent-User interaction in real-world domains.
Claude Fable 5 (Adaptive Reasoning, Max Effort, Opus 4.8 Fallback) Agentic terminalbench_hard Link Terminal-Bench: agentic benchmark evaluating agents on software engineering, sysadmin, and game-playing via a terminal.
Claude Opus 4.8 (Adaptive Reasoning, Max Effort) Agentic terminalbench_v2_1 Link Terminal-Bench: agentic benchmark evaluating agents on software engineering, sysadmin, and game-playing via a terminal.
Claude Fable 5 (Adaptive Reasoning, Max Effort, Opus 4.8 Fallback) Agentic terminalbench_v2_1 Link Terminal-Bench: agentic benchmark evaluating agents on software engineering, sysadmin, and game-playing via a terminal.
Claude Fable 5 (Adaptive Reasoning, Max Effort, Opus 4.8 Fallback) AGI artificial_analysis_intelligence_index Link Artificial Analysis Intelligence Index combines performance across ten evaluations: GDPval-AA, t2-Bench Telecom, Terminal-Bench Hard, SciCode, AA-LCR, AA-Omniscience, IFBench, Humanity’s Last Exam, GPQA Diamond, CritPt.
Claude Fable 5 (Adaptive Reasoning, Max Effort, Opus 4.8 Fallback) AGI hle Link Humanity’s Last Exam: 2,500 challenging questions across over a hundred subjects at the frontier of human knowledge.
Claude Fable 5 (Adaptive Reasoning, Max Effort, Opus 4.8 Fallback) Coding artificial_analysis_coding_index Link Artificial Analysis Intelligence Index combines performance across ten evaluations: GDPval-AA, t2-Bench Telecom, Terminal-Bench Hard, SciCode, AA-LCR, AA-Omniscience, IFBench, Humanity’s Last Exam, GPQA Diamond, CritPt.
Gemini 3 Pro Preview (high) Coding livecodebench Link LiveCodeBench: holistic, contamination-free evaluation of LLMs for code, updated continuously with new problems.
Claude Fable 5 (Adaptive Reasoning, Max Effort, Opus 4.8 Fallback) Coding scicode Link SciCode: models a realistic scientist workflow of identifying science concepts and transforming them into simulation code.
Grok 4.3 (medium) Instructions ifbench Link IFBench
GPT-5.2 Codex (xhigh) Long context Reasoning lcr Link AA-LCR: measures ability to extract and synthesise information from long-form documents ranging from 10k to 100k tokens.
GPT-5 (high) Maths aime Link AIME 2025: all 30 problems from the 2025 American Invitational Mathematics Examination.
GPT-5.2 (xhigh) Maths aime_25 Link AIM25
GPT-5.2 (xhigh) Maths artificial_analysis_math_index Link Artificial Analysis Intelligence Index combines performance across ten evaluations: GDPval-AA, t2-Bench Telecom, Terminal-Bench Hard, SciCode, AA-LCR, AA-Omniscience, IFBench, Humanity’s Last Exam, GPQA Diamond, CritPt.
Gemini 3.1 Pro Preview Reasoning gpqa Link GPQA: 448 multiple-choice questions written by domain experts in biology, physics, and chemistry.
Gemini 3 Pro Preview (high) Reasoning mmlu_pro Link MMLU-Pro: a more challenging and robust benchmark for language models across 12K complex questions in various disciplines.
GPT-5 (high) Saturated math_500 Link MATH-500: 500 problems spanning algebra, geometry, number theory, and probability, requiring step-by-step solutions.

This table shows the best performing models across all benchmarks.

1 c: Finding the balance


You can see that some models perform well on the benchmark at considerably less cost than the best performing model.

1 d: Exploring


Lots of actionable information on one graph.

Click on a point to reveal the model name and benchmark. The click also fixs all the datpoints for a given benchmark in focus. Click again to reset the view.

2 a: Human Judgement

Arena models_benchmarked best_model ranked_second_model
agent 25 Claude Fable 5 (High) GPT 5.5 (xHigh)
code 20 claude-fable-5 claude-opus-4-7-thinking
document 29 claude-opus-4-6 claude-opus-4-6-thinking
image-edit 49 gpt-image-2 (medium) mai-image-2.5
image-to-video 41 gemini-omni-flash dreamina-seedance-2.0-720p
search 31 claude-opus-4-6-search gpt-5.5-search
text-to-image 70 gpt-image-2 (medium) reve-2.0
text-to-video 41 gemini-omni-flash dreamina-seedance-2.0-720p
text 20 claude-fable-5 claude-opus-4-6-thinking
video-edit 6 dreamina-seedance-2.0-720p happyhorse-1.0
vision 40 claude-opus-4-7-thinking claude-fable-5

Arena’s allow humans to choose (online) between different models on a given task. The table shows the results of a human judgement benchmarks.

2 b: Ranking


Opensource models are ranked on a leaderboard based on human judgements. Some of the Open Source models are near the top of the leaderboard and might be useful once we take into account the need for Digital Sovereignty or keeping your data within your own data center.

2 b: Ranking Safety


CISCO LLM Security Leadership board

Comprehensive model safety and security rankings, including single-turn score, multi-turn score, and detailed metrics.

See

Other available Security benchmarks:

2 c: Updating and Routing

id open_weight added_at url
kimi-k2.7-code TRUE 2026-06-12 23:25:40 Moonshot AI
mimo-v2.5-pro TRUE 2026-06-10 21:36:16 Xiaomi
mimo-v2.5 TRUE 2026-06-10 21:36:16 Xiaomi
diffusiongemma-26b-a4b-it TRUE 2026-06-10 21:12:04 Google
claude-fable-5 FALSE 2026-06-09 20:20:02 Anthropic
gemma-4-12b-it TRUE 2026-06-09 17:39:52 Google
minimax-m3 TRUE 2026-06-05 21:54:39 MiniMax
mai-thinking-1 FALSE 2026-06-05 21:54:38 Microsoft
mai-code-1-flash FALSE 2026-06-05 21:54:38 Microsoft
nova-2-sonic FALSE 2026-06-05 21:54:33 Amazon

LLM-Stats provides a live feed of model updates, including newly added models to the ecosystem. We can use this type of information to keep our benchmarks up to date and to identify promising new models for testing.

LLM-Stats also provides a routing service that automatically directs API calls to the best-performing model. We can use such a service to ensure that we always use the near best available model for specific capabilities. For a more complex community scorecard, it would require a discussion with a provider of routing services.

3 a: More information

Benchmark Model.version Release.date Organization Country
33 hella_swag ada OpenAI United States of America
37 mmlu amazon.nova-lite-v1:0 2024-12-03 Amazon United States of America
44 trivia_qa claude-2.0 2023-07-11 Anthropic United States of America
29 cad_eval claude-3-5-haiku-20241022 2024-10-22 Anthropic United States of America
43 the_agent_company claude-3-5-sonnet-20241022 2024-10-22 Anthropic United States of America
11 weirdml claude-fable-5_high 2026-06-09 Anthropic United States of America
15 simplebench claude-fable-5_max 2026-06-09 Anthropic United States of America
17 metr_time_horizons claude-mythos-preview-early
31 cybench claude-opus-4-6_unknown 2026-02-05 Anthropic United States of America
21 gso claude-opus-4-7 2026-04-16 Anthropic United States of America
19 terminalbench claude-opus-4-7_unknown 2026-04-16 Anthropic United States of America
22 webdev_arena claude-opus-4-7_unknown 2026-04-16 Anthropic United States of America
5 frontiermath claude-opus-4-8_max 2026-05-28 Anthropic United States of America
6 frontiermath_tier_4 claude-opus-4-8_max 2026-05-28 Anthropic United States of America
20 posttrainbench claude-opus-4-8_max 2026-05-28 Anthropic United States of America
36 live_bench gemini-2.5-pro-exp-03-25 2025-03-25 Google DeepMind United States of America
14 geobench gemini-3-flash-preview 2025-12-17 Google DeepMind United States of America
10 balrog gemini-3-pro-preview 2025-11-18 Google DeepMind United States of America
12 vpct gemini-3-pro-preview 2025-11-18 Google DeepMind United States of America
25 arc_agi gemini-3.1-pro-preview 2026-02-19 Google DeepMind United States of America
48 hle gemini-3.1-pro-preview 2026-02-19 Google DeepMind United States of America
3 swe_bench_verified gemini-3.5-flash_high 2026-05-19 Google United States of America
46 apex_agents gemini-3.5-flash_unknown 2026-05-19 Google United States of America
2 math_level_5 gpt-5-mini-2025-08-07_high 2025-08-07 OpenAI United States of America
16 gdpval gpt-5.2-2025-12-11_none 2025-12-11 OpenAI United States of America
47 arc_agi_2 gpt-5.5_xhigh 2026-04-23 OpenAI United States of America
9 aider_polyglot gpt-oss-120b_high 2025-08-05 OpenAI United States of America
40 piqa Llama-2-7b 2023-07-18 Meta AI United States of America
28 bool_q mpt-7b 2023-05-05 MosaicML United States of America
34 lambada mpt-7b 2023-05-05 MosaicML United States of America
35 lech_mazur_writing o3-2025-04-16_medium 2025-04-16 OpenAI United States of America
45 wino_grande PaLM 2-L 2023-05-17
27 bbh Qwen-1_8B 2023-11-30
1 gpqa_diamond qwen3.7-max 2026-05-19 Alibaba China
4 otis_mock_aime_2024_2025 qwen3.7-max 2026-05-19 Alibaba China
7 simpleqa_verified qwen3.7-max 2026-05-19 Alibaba China
8 chess_puzzles qwen3.7-max 2026-05-19 Alibaba China
49 epoch_capabilities_index qwen3.7-max 2026-05-19 Alibaba China
30 common_sense_qa_2 text-davinci-001 2022-01-27 OpenAI United States of America
42 superglue text-davinci-001 2022-01-27 OpenAI United States of America
23 video_mme video-SALMONN-2plus 2025-06-18 ByteDance China
26 arc_ai2 Yi-6B 2023-11-02 01.AI China

Here a list of best models for a given benchmark is shown. The dataset is from Epoch.AI. The dataset is updated regularly.

Certain models are robust across a series of benchmarks and therefore specific capabilities.

3 b: LLM input / Output


OpenRouter operates one of the largest AI inference platforms in the world, generating an authoritative empirical dataset relied upon by government agencies, academic researchers, major industry analysts, and global media outlets.

Here we use their dataset on the input and output modality of the models that they measure. For policymakers, by doing so, we can keep track of the expanding range of modalities and their popularity. This is an important consideration as you select benchmarks relevant to the LLMs in your organisation.

Conclusions

Choose AI models based on our values is possible via a set of current benchmarks. Much of the necessary benchmarks are available, but …

It is Nuanced:

AI alignment requires a nuanced understanding of the details for us to enact safely. However, the payoff is great.

We do need a community process. Here is a mockup of the dashboard that supports such a process.