DR-Arena: an Automated Evaluation Framework for Deep Research Agents

Yiwen Gao1*, Ruochen Zhao2*, Yang Deng3, Wenxuan Zhang4†
1National University of Singapore, 2Nanyang Technological University, 3Singapore Management University, 4Singapore University of Technology and Design

Abstract

As Large Language Models (LLMs) increasingly operate as Deep Research (DR) Agents capable of autonomous investigation and information synthesis, reliable evaluation of their task performance has become a critical bottleneck. Current benchmarks predominantly rely on static datasets, which suffer from several limitations: limited task generality, temporal misalignment, and data contamination.

To address these, we introduce DR-Arena, a fully automated evaluation framework that pushes DR agents to their capability limits through dynamic investigation. DR-Arena constructs real-time Information Trees from fresh web trends to ensure the evaluation rubric is synchronized with the live world state, and employs an automated Examiner to generate structured tasks testing two orthogonal capabilities: Deep reasoning and Wide coverage. DR-Arena further adopts Adaptive Evolvement Loop, a state-machine controller that dynamically escalates task complexity based on real-time performance, demanding deeper deduction or wider aggregation until a decisive capability boundary emerges.

Experiments with six advanced DR agents demonstrate that DR-Arena achieves a Spearman correlation of 0.94 with the LMSYS Search Arena leaderboard. This represents the state-of-the-art alignment with human preferences without any manual efforts, validating DR-Arena as a reliable alternative for costly human adjudication.

🏆 DR-Arena Leaderboard

Rank Model DR-Arena ELO
1 GPT-5.1-Search 1084
2 Gemini-2.5-Pro-Grounding 1054
3 o3-Search 1041
4 Grok-4-Search 958
5 Perplexity-Sonar-Pro-High 942
6 Claude-Opus-4.1-Search 921
State-of-the-Art Alignment: DR-Arena achieves a Spearman Correlation of 0.94 with the human-verified LMSYS Search Arena rankings (Dec 3, 2025 version).

* Updated as of January 2026.

Methodology

Overview of the DR-Arena Framework

Figure 1: Overview of DR-Arena. The system operates as a closed-loop ecosystem with three stages: Dynamic Tree Construction, Automated Task Generation, and the Adaptive Evolvement Loop.

1. Automated Task Generation

To ensure task diversity, the Examiner samples a seed topic from Google Trends and constructs an information tree by scraping high-quality informative websites. The tree is expanded via Depth Expansion (for reasoning chains) and Width Expansion (for sibling aggregations). The Examiner then generates "Deep & Wide" questions that require traversing this topology, strictly avoiding data contamination.

Automated Task Generation

Figure 2: Task Generation. The Examiner transforms topological web structures into complex research queries.

2. Evidence-Based Judgement

Evaluating open-ended reports is inherently challenging. DR-Arena employs the Examiner as a Judge using a strict two-stage protocol:

  • Hard Constraints: Verifying against the generated Checklist-Depth (Logic) and Checklist-Width (Data Completeness). Critical errors result in immediate penalties.
  • Soft Constraints: Assessing user experience aspects such as presentation quality, formatting, and information density.

Based on these constraints, the system assigns a tiered verdict (Much Better, Better, or Tie) and diagnoses the specific failure type of the losing agent to guide future rounds.

Table 1: Taxonomy of Failure Types
Failure Tag Diagnostic Criteria
DEEP Logic Failure. Failed to identify the correct core entity due to a broken reasoning chain.
WIDE Coverage Failure. Failed to aggregate specific attribute details (Data Gap).
BOTH Systemic Failure. Failed on both logical identification and factual completeness.
NONE Soft Gap. The loss was determined solely by soft filters (formatting or utility preferences).

3. Adaptive Evolvement Loop

After each round, the Examiner operates on a targeted probing strategy. If agents reach a stalemate, the system intervenes to amplify differences:

  • High-Quality Tie: The task is too easy. The system triggers a Pressure Test, increasing both Depth ($D$) and Width ($W$) to locate the capability ceiling.
  • Marginal Win: The system aggressively targets the loser's specific weakness (probing either Depth or Width) to force a decisive breakdown.
Table 2: The Evolvement Loop Transition Matrix
Adjudication Verdict Diagnostic Signal Evolution Action Strategic Rationale
Tie (High Quality) N/A Pressure Test
(D ↑ 1 & W ↑ 1)
Current task too easy; find ceiling.
Tie (Low Quality) N/A Backtrack
(Move to Parent)
Current task too hard; re-establish baseline.
Winner Decided DEEP (Logic Failure) Probe Depth (D ↑ 1) Challenge loser's reasoning capabilities.
WIDE (Retrieval Failure) Probe Width (W ↑ 1) Challenge loser's information coverage.
BOTH / NONE Pressure Test
(D ↑ 1 & W ↑ 1)
Ambiguous failure; increase difficulty.

This adaptive mechanism ensures the system efficiently converges to a verdict by continuously pushing agents toward their specific breakdown points, acting as an efficient sorting algorithm for AI capabilities.

Experimental Analysis

Human Alignment

Elo Correlation

DR-Arena achieves superior alignment with human preferences compared to existing benchmarks, accurately recovering the hierarchy of the LMSYS Search Arena. The linear mapping confirms that our automated metrics effectively proxy human preference without manual annotation.

Adaptive Efficiency

Rounds vs Skill Gap

The system converges rapidly for high-divergence pairs. We observe a significant negative correlation (r = -0.61) between skill gap and rounds, confirming that the loop acts as an efficient sorting algorithm, concentrating computational resources on distinguishing closely matched models.

Cognitive Profiles: Error Distribution

Error Distribution

Figure 3: Per-Model Performance Profile. Models are ordered by their overall failure rates (at the top).

To provide a closer look into each model's performance profile, we dissect failure types in Figure 3. We define the Failure Rate as the ratio of non-winning rounds (including losses and ties) to the total number of rounds. Ordering models by this metric reveals a clear robustness hierarchy, with GPT-5.1-Search demonstrating the highest stability with a minimum failure rate of 50.17%.

Beyond aggregate rates, the normalized failure distribution exposes architectural trade-offs:

  • GPT-5.1-Search exhibits symmetric errors (27% Deep, 26% Wide), suggesting balanced capabilities without structural bottlenecks.
  • o3-Search shows asymmetry favoring logical deduction: despite higher total failures, it has the lowest DEEP failure rate (19%), with limitations primarily in coverage breadth (WIDE: 30%).
  • Models like PPL-Sonar-Pro and Grok-4 display the inverse pattern, with failures skewed toward DEEP reasoning deficits, indicating effective retrieval but inconsistent logical reasoning.

Detailed Dataset Diagnostics

A macro-level statistical analysis of the full tournament dataset, covering 789 unique interaction rounds.

Verdict & Failure Modes
Verdict Distribution

The "Better" verdict (49.9%) significantly outnumbers "Much Better" (32.8%), indicating that top-tier agents often differentiate themselves through marginal improvements rather than catastrophic failures.

Failure types are remarkably balanced between DEEP (32.3%) and WIDE (30.0%). This validates that our "Deep & Wide" strategy successfully exerts equal pressure on reasoning and aggregation capabilities.

Topological Complexity
Topological Complexity

Tree Depth: While peaking at Depth 2, the distribution has a long tail extending to Depth 8, confirming the system's ability to generate long-horizon deduction tasks.

Width Constraint: A notable portion of tasks requires synthesizing 4 to 7 distinct data points, effectively testing context window management and retrieval recall.