Agent harnesses meaningfully shape a model’s behavior, but there are few quantitative evals of the code agent harnesses used by developers. To our knowledge, we present the first.
Takeaways
- AI labs’ commercial code agents underperform their models’ self-reported SWE-bench Verified scores at >70% accuracy.
- Claude Code with Claude Sonnet 4.5 achieves the best performance, but it’s expensive via the API.
- Anthropic’s subscription pricing for Claude models is far more generous.
- Codex CLI with GPT 5 is behind the frontier, but it strikes a balance between performance, cost, and speed.
- Cursor CLI with Grok Code Fast 1 lags in performance, but it offers an astoundingly low price.
- Cursor CLI lacks fine-grained trajectory logging at the time of publication, so cost breakdowns by instance difficulty and tool use analytics are omitted.
- Gemini CLI with Gemini 2.5 Pro has no advantage in performance, cost, or speed.
Leaderboard
| 64.0% | $1.193 | 542.6s | |
| 57.4% | $0.126 | 135.6s | |
| 50.6% | $0.038 | 256.1s | |
| 42.4% | $0.403 | 187.4s |
Accuracy
64.0%
Cost / Task
$1.193
Latency
542.6s
Accuracy
57.4%
Cost / Task
$0.126
Latency
135.6s
Accuracy
50.6%
Cost / Task
$0.038
Latency
256.1s
Accuracy
42.4%
Cost / Task
$0.403
Latency
187.4s
Difficulty:
Tasks are sourced from SWE-bench Verified, which tests agents on 500 GitHub issues from 12 open-source Python repositories: Astropy, Django, Matplotlib, seaborn, Flask, Requests, PyData, Pylint, pytest, scikit-learn, Sphinx, and SymPy.
Task Resolution by Code Agent
Cells are tasks for the selected code agent. Color encodes repository; solid = resolved, translucent = failed. Click a task to see its details.
Showing 1–40 of 500
SWE-bench Verified Instance Distribution
Time to fix
<15 minutes
194
15 minutes to 1 hour
261
1–4 hours
42
>4 hours
3
Number of issues
Results
Claude Code + Claude Sonnet 4.5
Codex CLI + GPT 5
Gemini CLI + Gemini 2.5 Pro
Cursor CLI + Grok Code Fast 1
- No lab’s agentic programming tool is able to complete long-horizon (>4 hour) instances.
- Cursor CLI with Grok Code Fast 1 and Gemini CLI with Gemini 2.5 Pro spend exponentially more time on long-horizon tasks than short-horizon and medium-horizon tasks.
- Claude Code with Claude Sonnet 4.5 and Codex CLI with GPT 5 spend just marginally more time.
- Claude Code with Claude Sonnet 4.5 and Codex CLI with GPT 5 cost more solving SWE-bench Verified’s 1-4 hour tasks than its >4 hour tasks.
Tool Use Analysis
Claude Code + Claude Sonnet 4.5
Codex CLI + GPT 5
Gemini CLI + Gemini 2.5 Pro
Claude Code with Claude Sonnet 4.5 Tool Mapping
| Action | Agent Tool |
|---|---|
| Read | Read |
| Write | Edit, MultiEdit |
| Execute | Bash |
| Search | Grep, Glob |
Codex CLI with GPT 5 Tool Mapping
| Action | Agent Shell Command |
|---|---|
| Read | sed, nl, cat, awk |
| Write | apply_patch, applypatch |
| Execute | python, pytest, cd |
| Search | rg, ls |
Gemini CLI with Gemini 2.5 Pro Tool Mapping
| Action | Agent Tool |
|---|---|
| Read | read_file, read_many_files |
| Write | write_file, replace |
| Execute | run_shell_command |
| Search | glob, google_web_search, list_directory, search_file_content |
- Claude Code with Claude Sonnet 4.5 is the most proactive code agent. It has an especially strong preference to run tests with execute tool calls.
10 Randomly Sampled Claude Code with Claude Sonnet 4.5 Execute Tool Calls
python -m pytest /testbed/lib/matplotlib/tests/test_pickle.py /testbed/lib/matplotlib/tests/test_legend.py /testbed/lib/matplotlib/tests/test_offsetbox.py -x -q 2>&1 | tail -30grep -n "def test_ordering" tests/invalid_models_tests/test_models.py | head -20grep -l "test_ordering" tests/invalid_models_tests/test_models.pypython -m pytest /testbed/lib/matplotlib/tests/test_pickle.py -xvspython /testbed/test_final_verification.pypython test_empty_name.pypython /testbed/test_autodetector_integration.pypython tests/runtests.py template_tests -v 2 2>&1 | grep -A 5 "FAILED\|FAIL:"./tests/runtests.py migrations.test_optimizer migrations.test_autodetector --parallel 1 -v0 2>&1 | tail -10python -m pytest /testbed/sklearn/feature_selection/tests/ -v -k "cv" 2>&1 | head -100
- Codex CLI with GPT 5 executes far less code than Claude Code with Claude Sonnet 4.5.
- Gemini CLI with Gemini 2.5 Pro pulls in little context via reading the codebase, but it writes about the same amount of code as Claude Code with Claude Sonnet 4.5 and Codex CLI with GPT 5.
Methodology
- Code agents run in SWE-bench Verified task-specific docker containers with our user prompt. The latest code agent version at the time of publication is used.
User Prompt
<uploaded_files> {location} </uploaded_files> I've uploaded a python code repository in the directory {location} (not in /tmp/inputs). Consider the following PR description: <pr_description> {pr_description} </pr_description> Can you help me implement the necessary changes to the repository so that the requirements specified in the <pr_description> are met? I've already taken care of all changes to any of the test files described in the <pr_description>. This means you DON'T have to modify the testing logic or any of the tests in any way! Your task is to make the minimal changes to non-tests files in the {location} directory to ensure the <pr_description> is satisfied. Follow these steps to resolve the issue: 1. As a first step, it might be a good idea to explore the repo to familiarize yourself with its structure. 2. Create a script to reproduce the error and execute it with `python <filename.py>`, to confirm the error 3. Edit the sourcecode of the repo to resolve the issue 4. Rerun your reproduce script and confirm that the error is fixed! 5. Think about edgecases and make sure your fix handles them as well Your thinking should be thorough and so it's fine if it's very long. - Code agents execute in headless mode on a 32 vCPU, 64 GiB memory EC2 instance with a 2 hour timeout. Latency is the runtime of the command which invokes the code agent. Cost is from API pricing.
- Accuracy (whether the task is resolved), latency, and cost is averaged across instances in visualizations. One pass was made on each instance.
- Code agent performance might be inflated by data contamination, since SWE-bench Verified tasks are public GitHub issues.