DR-Arena: Automated Evaluation for Deep Research Agents

Abstract

As Large Language Models (LLMs) increasingly operate as Deep Research (DR) Agents capable of autonomous investigation and information synthesis, reliable evaluation of their task performance has become a critical bottleneck. Current benchmarks predominantly rely on static datasets, which suffer from several limitations: limited task generality, temporal misalignment, and data contamination.

To address these, we introduce DR-Arena, a fully automated evaluation framework that pushes DR agents to their capability limits through dynamic investigation. DR-Arena constructs real-time Information Trees from fresh web trends to ensure the evaluation rubric is synchronized with the live world state, and employs an automated Examiner to generate structured tasks testing two orthogonal capabilities: Deep reasoning and Wide coverage. DR-Arena further adopts Adaptive Evolvement Loop, a state-machine controller that dynamically escalates task complexity based on real-time performance, demanding deeper deduction or wider aggregation until a decisive capability boundary emerges.

Experiments with six advanced DR agents demonstrate that DR-Arena achieves a Spearman correlation of 0.94 with the LMSYS Search Arena leaderboard. This represents the state-of-the-art alignment with human preferences without any manual efforts, validating DR-Arena as a reliable alternative for costly human adjudication.

🏆 DR-Arena Leaderboard

Rank	Model	DR-Arena ELO
1	GPT-5.1-Search	1084
2	Gemini-2.5-Pro-Grounding	1054
3	o3-Search	1041
4	Grok-4-Search	958
5	Perplexity-Sonar-Pro-High	942
6	Claude-Opus-4.1-Search	921

State-of-the-Art Alignment: DR-Arena achieves a Spearman Correlation of 0.94 with the human-verified LMSYS Search Arena rankings (Dec 3, 2025 version).

* Updated as of January 2026.

Methodology

Figure 1: Overview of DR-Arena. The system operates as a closed-loop ecosystem with three stages: Dynamic Tree Construction, Automated Task Generation, and the Adaptive Evolvement Loop.

1. Automated Task Generation

To ensure task diversity, the Examiner samples a seed topic from Google Trends and constructs an information tree by scraping high-quality informative websites. The tree is expanded via Depth Expansion (for reasoning chains) and Width Expansion (for sibling aggregations). The Examiner then generates "Deep & Wide" questions that require traversing this topology, strictly avoiding data contamination.

Figure 2: Task Generation. The Examiner transforms topological web structures into complex research queries.

2. Evidence-Based Judgement

Evaluating open-ended reports is inherently challenging. DR-Arena employs the Examiner as a Judge using a strict two-stage protocol:

Hard Constraints: Verifying against the generated Checklist-Depth (Logic) and Checklist-Width (Data Completeness). Critical errors result in immediate penalties.
Soft Constraints: Assessing user experience aspects such as presentation quality, formatting, and information density.

Based on these constraints, the system assigns a tiered verdict (Much Better, Better, or Tie) and diagnoses the specific failure type of the losing agent to guide future rounds.

Table 1: Taxonomy of Failure Types

Failure Tag	Diagnostic Criteria
DEEP	Logic Failure. Failed to identify the correct core entity due to a broken reasoning chain.
WIDE	Coverage Failure. Failed to aggregate specific attribute details (Data Gap).
BOTH	Systemic Failure. Failed on both logical identification and factual completeness.
NONE	Soft Gap. The loss was determined solely by soft filters (formatting or utility preferences).

3. Adaptive Evolvement Loop

After each round, the Examiner operates on a targeted probing strategy. If agents reach a stalemate, the system intervenes to amplify differences:

High-Quality Tie: The task is too easy. The system triggers a Pressure Test, increasing both Depth ($D$) and Width ($W$) to locate the capability ceiling.
Marginal Win: The system aggressively targets the loser's specific weakness (probing either Depth or Width) to force a decisive breakdown.

Table 2: The Evolvement Loop Transition Matrix

Adjudication Verdict	Diagnostic Signal	Evolution Action	Strategic Rationale
Tie (High Quality)	N/A	Pressure Test (D ↑ 1 & W ↑ 1)	Current task too easy; find ceiling.
Tie (Low Quality)	N/A	Backtrack (Move to Parent)	Current task too hard; re-establish baseline.
Winner Decided	DEEP (Logic Failure)	Probe Depth (D ↑ 1)	Challenge loser's reasoning capabilities.
	WIDE (Retrieval Failure)	Probe Width (W ↑ 1)	Challenge loser's information coverage.
	BOTH / NONE	Pressure Test (D ↑ 1 & W ↑ 1)	Ambiguous failure; increase difficulty.

This adaptive mechanism ensures the system efficiently converges to a verdict by continuously pushing agents toward their specific breakdown points, acting as an efficient sorting algorithm for AI capabilities.

Experimental Analysis

Human Alignment

DR-Arena achieves superior alignment with human preferences compared to existing benchmarks, accurately recovering the hierarchy of the LMSYS Search Arena. The linear mapping confirms that our automated metrics effectively proxy human preference without manual annotation.

Adaptive Efficiency

The system converges rapidly for high-divergence pairs. We observe a significant negative correlation (r = -0.61) between skill gap and rounds, confirming that the loop acts as an efficient sorting algorithm, concentrating computational resources on distinguishing closely matched models.

Cognitive Profiles: Error Distribution

To provide a closer look into each model's performance profile, we dissect failure types in Figure 3. We define the Failure Rate as the ratio of non-winning rounds (including losses and ties) to the total number of rounds. Ordering models by this metric reveals a clear robustness hierarchy, with GPT-5.1-Search demonstrating the highest stability with a minimum failure rate of 50.17%.

Beyond aggregate rates, the normalized failure distribution exposes architectural trade-offs:

GPT-5.1-Search exhibits symmetric errors (27% Deep, 26% Wide), suggesting balanced capabilities without structural bottlenecks.
o3-Search shows asymmetry favoring logical deduction: despite higher total failures, it has the lowest DEEP failure rate (19%), with limitations primarily in coverage breadth (WIDE: 30%).
Models like PPL-Sonar-Pro and Grok-4 display the inverse pattern, with failures skewed toward DEEP reasoning deficits, indicating effective retrieval but inconsistent logical reasoning.

Detailed Dataset Diagnostics

A macro-level statistical analysis of the full tournament dataset, covering 789 unique interaction rounds.

Verdict & Failure Modes

The "Better" verdict (49.9%) significantly outnumbers "Much Better" (32.8%), indicating that top-tier agents often differentiate themselves through marginal improvements rather than catastrophic failures.

Failure types are remarkably balanced between DEEP (32.3%) and WIDE (30.0%). This validates that our "Deep & Wide" strategy successfully exerts equal pressure on reasoning and aggregation capabilities.

Topological Complexity

Tree Depth: While peaking at Depth 2, the distribution has a long tail extending to Depth 8, confirming the system's ability to generate long-horizon deduction tasks.

Width Constraint: A notable portion of tasks requires synthesizing 4 to 7 distinct data points, effectively testing context window management and retrieval recall.

DR-Arena: an Automated Evaluation Framework for Deep Research Agents