This dashboard shows that it is practical to collect live benchmark data and provides visual intuition into the value of this data for model selection. By the end of the story, we conclude that AI alignment through benchmark comparisons is operationalizable and requires a nuanced understanding of the details to be enacted safely. The dashboard then points to a second dashboard that explains some nuances and provides a mockup of a potential community practice.
The live data sources were gathered from the sources acknowledged below. Through their efforts, this dashboard was made viable.
The live datasets are gathered from the following sources:
Artificial Analysis (https://artificialanalysis.ai) provides its gathered benchmarks for free through a REST API. Their coverage spans a range of benchmarks (around 10) based on their own AGI benchmark. Artificial Analysis also provides a detailed dashboard that you may consider interacting with in your own research.
The graph is dynamic. Hovering over a point will display a bar with further information.
| Model | Domain | Benchmark | Details |
|---|---|---|---|
| GPT-5.5 (xhigh) | Agentic | tau_banking | Link t2-Bench: benchmark for Tool-Agent-User interaction in real-world domains. |
| JT-35B-Flash | Agentic | tau2 | Link t2-Bench: benchmark for Tool-Agent-User interaction in real-world domains. |
| Claude Fable 5 (Adaptive Reasoning, Max Effort, Opus 4.8 Fallback) | Agentic | terminalbench_hard | Link Terminal-Bench: agentic benchmark evaluating agents on software engineering, sysadmin, and game-playing via a terminal. |
| Claude Opus 4.8 (Adaptive Reasoning, Max Effort) | Agentic | terminalbench_v2_1 | Link Terminal-Bench: agentic benchmark evaluating agents on software engineering, sysadmin, and game-playing via a terminal. |
| Claude Fable 5 (Adaptive Reasoning, Max Effort, Opus 4.8 Fallback) | Agentic | terminalbench_v2_1 | Link Terminal-Bench: agentic benchmark evaluating agents on software engineering, sysadmin, and game-playing via a terminal. |
| Claude Fable 5 (Adaptive Reasoning, Max Effort, Opus 4.8 Fallback) | AGI | artificial_analysis_intelligence_index | Link Artificial Analysis Intelligence Index combines performance across ten evaluations: GDPval-AA, t2-Bench Telecom, Terminal-Bench Hard, SciCode, AA-LCR, AA-Omniscience, IFBench, Humanity’s Last Exam, GPQA Diamond, CritPt. |
| Claude Fable 5 (Adaptive Reasoning, Max Effort, Opus 4.8 Fallback) | AGI | hle | Link Humanity’s Last Exam: 2,500 challenging questions across over a hundred subjects at the frontier of human knowledge. |
| Claude Fable 5 (Adaptive Reasoning, Max Effort, Opus 4.8 Fallback) | Coding | artificial_analysis_coding_index | Link Artificial Analysis Intelligence Index combines performance across ten evaluations: GDPval-AA, t2-Bench Telecom, Terminal-Bench Hard, SciCode, AA-LCR, AA-Omniscience, IFBench, Humanity’s Last Exam, GPQA Diamond, CritPt. |
| Gemini 3 Pro Preview (high) | Coding | livecodebench | Link LiveCodeBench: holistic, contamination-free evaluation of LLMs for code, updated continuously with new problems. |
| Claude Fable 5 (Adaptive Reasoning, Max Effort, Opus 4.8 Fallback) | Coding | scicode | Link SciCode: models a realistic scientist workflow of identifying science concepts and transforming them into simulation code. |
| Grok 4.3 (medium) | Instructions | ifbench | Link IFBench |
| GPT-5.2 Codex (xhigh) | Long context Reasoning | lcr | Link AA-LCR: measures ability to extract and synthesise information from long-form documents ranging from 10k to 100k tokens. |
| GPT-5 (high) | Maths | aime | Link AIME 2025: all 30 problems from the 2025 American Invitational Mathematics Examination. |
| GPT-5.2 (xhigh) | Maths | aime_25 | Link AIM25 |
| GPT-5.2 (xhigh) | Maths | artificial_analysis_math_index | Link Artificial Analysis Intelligence Index combines performance across ten evaluations: GDPval-AA, t2-Bench Telecom, Terminal-Bench Hard, SciCode, AA-LCR, AA-Omniscience, IFBench, Humanity’s Last Exam, GPQA Diamond, CritPt. |
| Gemini 3.1 Pro Preview | Reasoning | gpqa | Link GPQA: 448 multiple-choice questions written by domain experts in biology, physics, and chemistry. |
| Gemini 3 Pro Preview (high) | Reasoning | mmlu_pro | Link MMLU-Pro: a more challenging and robust benchmark for language models across 12K complex questions in various disciplines. |
| GPT-5 (high) | Saturated | math_500 | Link MATH-500: 500 problems spanning algebra, geometry, number theory, and probability, requiring step-by-step solutions. |
This table shows the best performing models across all benchmarks.
You can see that some models perform well on the benchmark at considerably less cost than the best performing model.
Lots of actionable information on one graph.
Click on a point to reveal the model name and benchmark. The click also fixs all the datpoints for a given benchmark in focus. Click again to reset the view.
| Arena | models_benchmarked | best_model | ranked_second_model |
|---|---|---|---|
| agent | 25 | Claude Fable 5 (High) | GPT 5.5 (xHigh) |
| code | 20 | claude-fable-5 | claude-opus-4-7-thinking |
| document | 29 | claude-opus-4-6 | claude-opus-4-6-thinking |
| image-edit | 49 | gpt-image-2 (medium) | mai-image-2.5 |
| image-to-video | 41 | gemini-omni-flash | dreamina-seedance-2.0-720p |
| search | 31 | claude-opus-4-6-search | gpt-5.5-search |
| text-to-image | 70 | gpt-image-2 (medium) | reve-2.0 |
| text-to-video | 41 | gemini-omni-flash | dreamina-seedance-2.0-720p |
| text | 20 | claude-fable-5 | claude-opus-4-6-thinking |
| video-edit | 6 | dreamina-seedance-2.0-720p | happyhorse-1.0 |
| vision | 40 | claude-opus-4-7-thinking | claude-fable-5 |
Arena’s allow humans to choose (online) between different models on a given task. The table shows the results of a human judgement benchmarks.
Opensource models are ranked on a leaderboard based on human judgements. Some of the Open Source models are near the top of the leaderboard and might be useful once we take into account the need for Digital Sovereignty or keeping your data within your own data center.
CISCO LLM Security Leadership board
Comprehensive model safety and security rankings, including single-turn score, multi-turn score, and detailed metrics.
Other available Security benchmarks:
| id | open_weight | added_at | url |
|---|---|---|---|
| kimi-k2.7-code | TRUE | 2026-06-12 23:25:40 | Moonshot AI |
| mimo-v2.5-pro | TRUE | 2026-06-10 21:36:16 | Xiaomi |
| mimo-v2.5 | TRUE | 2026-06-10 21:36:16 | Xiaomi |
| diffusiongemma-26b-a4b-it | TRUE | 2026-06-10 21:12:04 | |
| claude-fable-5 | FALSE | 2026-06-09 20:20:02 | Anthropic |
| gemma-4-12b-it | TRUE | 2026-06-09 17:39:52 | |
| minimax-m3 | TRUE | 2026-06-05 21:54:39 | MiniMax |
| mai-thinking-1 | FALSE | 2026-06-05 21:54:38 | Microsoft |
| mai-code-1-flash | FALSE | 2026-06-05 21:54:38 | Microsoft |
| nova-2-sonic | FALSE | 2026-06-05 21:54:33 | Amazon |
LLM-Stats provides a live feed of model updates, including newly added models to the ecosystem. We can use this type of information to keep our benchmarks up to date and to identify promising new models for testing.
LLM-Stats also provides a routing service that automatically directs API calls to the best-performing model. We can use such a service to ensure that we always use the near best available model for specific capabilities. For a more complex community scorecard, it would require a discussion with a provider of routing services.
| Benchmark | Model.version | Release.date | Organization | Country | |
|---|---|---|---|---|---|
| 33 | hella_swag | ada | OpenAI | United States of America | |
| 37 | mmlu | amazon.nova-lite-v1:0 | 2024-12-03 | Amazon | United States of America |
| 44 | trivia_qa | claude-2.0 | 2023-07-11 | Anthropic | United States of America |
| 29 | cad_eval | claude-3-5-haiku-20241022 | 2024-10-22 | Anthropic | United States of America |
| 43 | the_agent_company | claude-3-5-sonnet-20241022 | 2024-10-22 | Anthropic | United States of America |
| 11 | weirdml | claude-fable-5_high | 2026-06-09 | Anthropic | United States of America |
| 15 | simplebench | claude-fable-5_max | 2026-06-09 | Anthropic | United States of America |
| 17 | metr_time_horizons | claude-mythos-preview-early | |||
| 31 | cybench | claude-opus-4-6_unknown | 2026-02-05 | Anthropic | United States of America |
| 21 | gso | claude-opus-4-7 | 2026-04-16 | Anthropic | United States of America |
| 19 | terminalbench | claude-opus-4-7_unknown | 2026-04-16 | Anthropic | United States of America |
| 22 | webdev_arena | claude-opus-4-7_unknown | 2026-04-16 | Anthropic | United States of America |
| 5 | frontiermath | claude-opus-4-8_max | 2026-05-28 | Anthropic | United States of America |
| 6 | frontiermath_tier_4 | claude-opus-4-8_max | 2026-05-28 | Anthropic | United States of America |
| 20 | posttrainbench | claude-opus-4-8_max | 2026-05-28 | Anthropic | United States of America |
| 36 | live_bench | gemini-2.5-pro-exp-03-25 | 2025-03-25 | Google DeepMind | United States of America |
| 14 | geobench | gemini-3-flash-preview | 2025-12-17 | Google DeepMind | United States of America |
| 10 | balrog | gemini-3-pro-preview | 2025-11-18 | Google DeepMind | United States of America |
| 12 | vpct | gemini-3-pro-preview | 2025-11-18 | Google DeepMind | United States of America |
| 25 | arc_agi | gemini-3.1-pro-preview | 2026-02-19 | Google DeepMind | United States of America |
| 48 | hle | gemini-3.1-pro-preview | 2026-02-19 | Google DeepMind | United States of America |
| 3 | swe_bench_verified | gemini-3.5-flash_high | 2026-05-19 | United States of America | |
| 46 | apex_agents | gemini-3.5-flash_unknown | 2026-05-19 | United States of America | |
| 2 | math_level_5 | gpt-5-mini-2025-08-07_high | 2025-08-07 | OpenAI | United States of America |
| 16 | gdpval | gpt-5.2-2025-12-11_none | 2025-12-11 | OpenAI | United States of America |
| 47 | arc_agi_2 | gpt-5.5_xhigh | 2026-04-23 | OpenAI | United States of America |
| 9 | aider_polyglot | gpt-oss-120b_high | 2025-08-05 | OpenAI | United States of America |
| 40 | piqa | Llama-2-7b | 2023-07-18 | Meta AI | United States of America |
| 28 | bool_q | mpt-7b | 2023-05-05 | MosaicML | United States of America |
| 34 | lambada | mpt-7b | 2023-05-05 | MosaicML | United States of America |
| 35 | lech_mazur_writing | o3-2025-04-16_medium | 2025-04-16 | OpenAI | United States of America |
| 45 | wino_grande | PaLM 2-L | 2023-05-17 | ||
| 27 | bbh | Qwen-1_8B | 2023-11-30 | ||
| 1 | gpqa_diamond | qwen3.7-max | 2026-05-19 | Alibaba | China |
| 4 | otis_mock_aime_2024_2025 | qwen3.7-max | 2026-05-19 | Alibaba | China |
| 7 | simpleqa_verified | qwen3.7-max | 2026-05-19 | Alibaba | China |
| 8 | chess_puzzles | qwen3.7-max | 2026-05-19 | Alibaba | China |
| 49 | epoch_capabilities_index | qwen3.7-max | 2026-05-19 | Alibaba | China |
| 30 | common_sense_qa_2 | text-davinci-001 | 2022-01-27 | OpenAI | United States of America |
| 42 | superglue | text-davinci-001 | 2022-01-27 | OpenAI | United States of America |
| 23 | video_mme | video-SALMONN-2plus | 2025-06-18 | ByteDance | China |
| 26 | arc_ai2 | Yi-6B | 2023-11-02 | 01.AI | China |
Here a list of best models for a given benchmark is shown. The dataset is from Epoch.AI. The dataset is updated regularly.
Certain models are robust across a series of benchmarks and therefore specific capabilities.
OpenRouter operates one of the largest AI inference platforms in the world, generating an authoritative empirical dataset relied upon by government agencies, academic researchers, major industry analysts, and global media outlets.
Here we use their dataset on the input and output modality of the models that they measure. For policymakers, by doing so, we can keep track of the expanding range of modalities and their popularity. This is an important consideration as you select benchmarks relevant to the LLMs in your organisation.
Choose AI models based on our values is possible via a set of current benchmarks. Much of the necessary benchmarks are available, but …
It is Nuanced:
AI alignment requires a nuanced understanding of the details for us to enact safely. However, the payoff is great.
We do need a community process. Here is a mockup of the dashboard that supports such a process.