Ranked field report
Leaderboard
Final Deep Research Arena Elo ratings across 1303 matches. Select a row for inspected context, or open a model dossier for the full breakdown.
15
Models
1303
Matches
1205
Top Elo
524
Spread
| # ↑ | Model | Elo | L / T | Win % | ||
|---|---|---|---|---|---|---|
| 01 | Claude Opus 4.6Anthropic | 1205.3 | 144 | 88 | 52 / 4 | 61.1% |
| 02 | Gemini 3.1 ProGoogle | 1192.2 | 113 | 75 | 38 / 0 | 66.4% |
| 03 | GPT 5.4OpenAI | 1169.7 | 116 | 68 | 46 / 2 | 58.6% |
| 04 | o3OpenAI | 1160.1 | 178 | 101 | 76 / 1 | 56.7% |
| 05 | GPT-5.1OpenAI | 1134.7 | 205 | 105 | 99 / 1 | 51.2% |
| 06 | Gemini 2.5 ProGoogle | 1102.7 | 178 | 86 | 90 / 2 | 48.3% |
| 07 | Grok 4xAI | 1040.8 | 180 | 94 | 86 / 0 | 52.2% |
| 08 | Claude Opus 4.1Anthropic | 1005.4 | 241 | 125 | 112 / 4 | 51.9% |
| 09 | Kimi K2Moonshot AI | 971.3 | 267 | 133 | 128 / 6 | 49.8% |
| 10 | Sonar ProPerplexity | 952.5 | 239 | 105 | 132 / 2 | 43.9% |
| 11 | DeepSeek V3.2DeepSeek | 944.6 | 180 | 95 | 72 / 13 | 52.8% |
| 12 | GLM-4.7Zhipu AI | 912.1 | 121 | 58 | 53 / 10 | 47.9% |
| 13 | Qwen3-235BAlibaba | 804.8 | 179 | 67 | 107 / 5 | 37.4% |
| 14 | Seed 1.6ByteDance | 722.7 | 149 | 42 | 101 / 6 | 28.2% |
| 15 | Sonar Reasoning ProPerplexity | 681.3 | 116 | 31 | 81 / 4 | 26.7% |
Tournament Overview
Claude Opus 4.6
1205.3
#1 Elo Rating
15
Models
1303
Matches
524
Elo Spread (1st – Last)
Elo Distribution
Claude Opus 4.6
1205
Gemini 3.1 Pro
1192
GPT 5.4
1170
o3
1160
GPT-5.1
1135
Gemini 2.5 Pro
1103
Grok 4
1041
Claude Opus 4.1
1005
Kimi K2
971
Sonar Pro
952
DeepSeek V3.2
945
GLM-4.7
912
Qwen3-235B
805
Seed 1.6
723
Sonar Reasoning Pro
681
Tournament Overview
Claude Opus 4.6
1205.3
#1 Elo Rating
15
Models
1303
Matches
524
Elo Spread (1st – Last)
Elo Distribution
Claude Opus 4.6
1205
Gemini 3.1 Pro
1192
GPT 5.4
1170
o3
1160
GPT-5.1
1135
Gemini 2.5 Pro
1103
Grok 4
1041
Claude Opus 4.1
1005
Kimi K2
971
Sonar Pro
952
DeepSeek V3.2
945
GLM-4.7
912
Qwen3-235B
805
Seed 1.6
723
Sonar Reasoning Pro
681
Head-to-Head Win Rates
Observed win rate from the row model's perspective.
| vs | Claude Opus 4.6 | Gemini 3.1 Pro | GPT 5.4 | o3 | GPT-5.1 | Gemini 2.5 Pro | Grok 4 | Claude Opus 4.1 | Kimi K2 | Sonar Pro | DeepSeek V3.2 | GLM-4.7 | Qwen3-235B | Seed 1.6 | Sonar Reasoning Pro |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Claude Opus 4.6 | — | 64 | — | 31 | 66 | 69 | — | — | 76 | — | — | — | — | — | — |
Gemini 3.1 Pro | 36 | — | — | — | 67 | — | — | — | 76 | 86 | — | — | — | — | — |
GPT 5.4 | — | — | — | 38 | 59 | 69 | — | 69 | — | — | — | — | — | — | — |
o3 | 66 | — | 62 | — | 40 | 40 | 63 | 70 | — | — | — | — | — | — | — |
GPT-5.1 | 34 | 33 | 38 | 60 | — | 60 | 57 | 73 | — | — | — | — | — | — | — |
Gemini 2.5 Pro | 24 | — | 31 | 60 | 40 | — | 70 | — | — | 63 | — | — | — | — | — |
Grok 4 | — | — | — | 37 | 43 | 30 | — | — | 67 | 50 | — | — | 87 | — | — |
Claude Opus 4.1 | — | — | 28 | 30 | 27 | — | — | — | 53 | 50 | 77 | 0 | 77 | 77 | — |
Kimi K2 | 21 | 24 | — | — | — | — | 33 | 47 | — | 63 | 47 | 57 | — | 77 | 79 |
Sonar Pro | — | 14 | — | — | — | 37 | 50 | 47 | 33 | — | 50 | 60 | 60 | — | — |
DeepSeek V3.2 | — | — | — | — | — | — | — | 23 | 47 | 50 | — | 40 | 73 | 83 | — |
GLM-4.7 | — | — | — | — | — | — | — | 0 | 37 | 40 | 40 | — | — | — | 79 |
Qwen3-235B | — | — | — | — | — | — | 13 | 23 | — | 40 | 20 | — | — | 63 | 66 |
Seed 1.6 | — | — | — | — | — | — | — | 23 | 23 | — | 7 | — | 33 | — | 55 |
Sonar Reasoning Pro | — | — | — | — | — | — | — | — | 21 | — | — | 21 | 28 | 38 | — |