GPT-5.1 vs o3
tree_0016 · Software Developers, Quality Assurance Analysts, and Testers : Occupational Outlook Handbook: : U.S. Bureau of Labor Statistics
Timeline
Arrow keys or j/k move between rounds.
Round Context
Software Developers, Quality Assurance Analysts, and Testers / Occupational Outlook Handbook: / U.S. Bureau of Labor Statistics
Field of degree / Occupational Outlook Handbook: / U.S. Bureau of Labor Statistics
Using the U.S. Bureau of Labor Statistics' Occupational Outlook Handbook projections for the 2024–2034 period, identify two distinct computer-related occupations that share a projected job outlook growth rate of exactly 15% ('Much faster than average'). These two occupations must be distinguished by their required work experience in a related occupation: one lists 'None' while the other lists '5 years or more'. For each of these two identified occupations, provide the 2024 median annual pay, the total projected numeric employment change over the decade, and the estimated average number of annual openings.
Answer length: 200-300 words.
Show hidden checklists
- Identify Entity 1: Software Developers, Quality Assurance Analysts, and Testers (matches 15% growth, 0 experience)
- Identify Entity 2: Computer and Information Systems Managers (matches 15% growth, 5+ years experience)
- Software Developers/QA: 2024 Median Pay is approx $131,450
- Software Developers/QA: Projected Employment Change is approx 287,900
- Software Developers/QA: Annual Openings is approx 129,200
- CIS Managers: 2024 Median Pay is approx $171,200
- CIS Managers: Projected Employment Change is approx 101,600
- CIS Managers: Annual Openings is approx 55,600
The question requires Deep reasoning to filter occupations within the BLS handbook based on a specific combination of shared attributes (15% growth) and contrasting attributes (experience level). It requires Wide aggregation to retrieve three distinct statistical data points (pay, change, openings) for both identified entities.
Judgment
Both agents failed the 'Deep Logic' check by missing the first entity specified in the Ground Truth Checklist (Software Developers) and instead substituting other occupations (Research Scientists for Agent A, Support Specialists for Agent B). Both agents appeared to hallucinate the '15%' growth rate for these substitutes to fit the prompt's constraint. However, Agent A is the clear winner based on Data Accuracy for the second entity (CIS Managers), which both agents correctly identified. Agent A's reported figures for CIS Managers (e.g., ~51,800 annual openings) were much closer to the Ground Truth Checklist (~55,600) than Agent B's figures (34,700 openings, a significant error). Furthermore, Agent A's substitute for the first entity (Research Scientists) is a high-growth role in reality (aligning with the 'Much faster than average' label), whereas Agent B's substitute (Support Specialists) is typically a moderate-growth role, making Agent B's claim of 'Much faster than average' a qualitative hallucination. Agent B had better formatting, but the data inaccuracies make it the loser.
GPT-5.1
OpenAI
o3
OpenAI