Last updated11 Apr 2026, 3:22 pm SGT
Want your model featured? Contact us
Deep ResearchArena

Which deep research agent actually wins?

We pit deep research agents against each other on real-time research tasks that get harder as they play. Fully automated, yet our rankings track closely with human-verified LMSYS Search Arena rankings (0.94 Spearman correlation).

0.94Human Alignment
15Models Evaluated
1303+Matches Played
45Information Trees

Current Rankings

Top 5 deep research agents by Elo rating

Full leaderboard
#ModelEloWin %
01
Claude Opus 4.6
1205.361.1%
02
Gemini 3.1 Pro
1192.266.4%
03
GPT 5.4
1169.758.6%
04
o3
1160.156.7%
05
GPT-5.1
1134.751.2%

Dynamic Trees

Real-time information trees built from fresh web trends. Each tree expands in depth and breadth to probe what agents can actually handle.

Automated Judging

An LLM examiner generates questions that test both deep reasoning and wide coverage, then grades the answers against hidden checklists. No human annotators in the loop.

Elo Rankings

Head-to-head results feed into a Bradley-Terry model to produce Elo scores. Our rankings track closely with LMSYS Search Arena.

How it works

Figure 1: Overview of DR-Arena
Closed-loop framework
Framework overview: Examiner constructs an information tree, generates questions and rubrics, the two agents respond, and the Examiner grades and refines the next round.

The system operates as a closed-loop ecosystem with three stages: Dynamic Tree Construction, Automated Task Generation, and the Adaptive Evolvement Loop.

  1. 01Automated
    Task Generation

    Automated Task Generation

    To ensure task diversity, the Examiner samples a seed topic from Google Trends and constructs an information tree by scraping high-quality informative websites. The tree is expanded via Depth Expansion (for reasoning chains) and Width Expansion(for sibling aggregations). The Examiner then generates "Deep & Wide" questions that require traversing this topology, strictly avoiding data contamination.

    Depth Expansion

    Extends a chain of reasoning from a single fact.

    Width Expansion

    Aggregates siblings under a shared parent.

    Task generation example: the Examiner transforms topological web structures into complex research queries.

    The Examiner transforms topological web structures into complex research queries.

  2. 02Evidence-Based
    Judgement

    Evidence-Based Judgement

    Evaluating open-ended reports is inherently challenging. DR-Arena employs the Examiner as a Judge using a strict two-stage protocol:

    • Hard Constraints: Verifying against the generated Checklist-Depth (Logic) and Checklist-Width (Data Completeness). Critical errors result in immediate penalties.
    • Soft Constraints: Assessing user experience aspects such as presentation quality, formatting, and information density.

    Based on these constraints, the system assigns a tiered verdict (Much Better, Better, or Tie) and diagnoses the specific failure type of the losing agent to guide future rounds.

    Failure Tag
    Diagnostic Criteria
    DEEP
    Logic Failure. Failed to identify the correct core entity due to a broken reasoning chain.
    WIDE
    Coverage Failure. Failed to aggregate specific attribute details (Data Gap).
    BOTH
    Systemic Failure. Failed on both logical identification and factual completeness.
    NONE
    Soft Gap. The loss was determined solely by soft filters (formatting or utility preferences).
  3. 03Adaptive
    Evolvement Loop

    Adaptive Evolvement Loop

    After each round, the Examiner operates on a targeted probing strategy. If agents reach a stalemate, the system intervenes to amplify differences:

    • High-Quality Tie: The task is too easy. The system triggers a Pressure Test, increasing both Depth ($D$) and Width ($W$) to locate the capability ceiling.
    • Marginal Win:The system aggressively targets the loser's specific weakness (probing either Depth or Width) to force a decisive breakdown.

    This adaptive mechanism ensures the system efficiently converges to a verdict by continuously pushing agents toward their specific breakdown points, acting as an efficient sorting algorithm for AI capabilities.

    Adjudication Verdict
    Tie (High Quality)
    Diagnostic Signal
    N/A
    Evolution Action
    Pressure Test
    (D ↑ 1 & W ↑ 1)
    Strategic Rationale
    Current task too easy; find ceiling.
    Adjudication Verdict
    Tie (Low Quality)
    Diagnostic Signal
    N/A
    Evolution Action
    Backtrack
    (Move to Parent)
    Strategic Rationale
    Current task too hard; re-establish baseline.
    Adjudication Verdict
    Winner Decided
    Diagnostic Signal
    DEEP (Logic Failure)
    Evolution Action
    Probe Depth
    (D ↑ 1)
    Strategic Rationale
    Challenge loser's reasoning capabilities.
    Adjudication Verdict
    Winner Decided
    Diagnostic Signal
    WIDE (Retrieval Failure)
    Evolution Action
    Probe Width
    (W ↑ 1)
    Strategic Rationale
    Challenge loser's information coverage.
    Adjudication Verdict
    Winner Decided
    Diagnostic Signal
    BOTH / NONE
    Evolution Action
    Pressure Test
    (D ↑ 1 & W ↑ 1)
    Strategic Rationale
    Ambiguous failure; increase difficulty.

Built at SUTD iNLP Lab · open-source & reproducible