What's the best code agent?

Agent harnesses meaningfully shape a model’s behavior, but there are few quantitative evals of the code agent harnesses used by developers. To our knowledge, we present the first.

Takeaways

AI labs’ commercial code agents underperform their models’ self-reported SWE-bench Verified scores at >70% accuracy.
Claude Code with Claude Sonnet 4.5 achieves the best performance, but it’s expensive via the API.
- Anthropic’s subscription pricing for Claude models is far more generous.
Codex CLI with GPT 5 is behind the frontier, but it strikes a balance between performance, cost, and speed.
Cursor CLI with Grok Code Fast 1 lags in performance, but it offers an astoundingly low price.
- Cursor CLI lacks fine-grained trajectory logging at the time of publication, so cost breakdowns by instance difficulty and tool use analytics are omitted.
Gemini CLI with Gemini 2.5 Pro has no advantage in performance, cost, or speed.

Leaderboard


Claude Code + Claude Sonnet 4.5	64.0%	$1.193	542.6s
Codex CLI + GPT 5	57.4%	$0.126	135.6s
Cursor CLI + Grok Code Fast 1	50.6%	$0.038	256.1s
Gemini CLI + Gemini 2.5 Pro	42.4%	$0.403	187.4s

Claude Code + Claude Sonnet 4.5

Accuracy

64.0%

Cost / Task

$1.193

Latency

542.6s

Codex CLI + GPT 5

Accuracy

57.4%

Cost / Task

$0.126

Latency

135.6s

Cursor CLI + Grok Code Fast 1

Accuracy

50.6%

Cost / Task

$0.038

Latency

256.1s

Gemini CLI + Gemini 2.5 Pro

Accuracy

42.4%

Cost / Task

$0.403

Latency

187.4s

Difficulty:

Tasks are sourced from SWE-bench Verified, which tests agents on 500 GitHub issues from 12 open-source Python repositories: Astropy, Django, Matplotlib, seaborn, Flask, Requests, PyData, Pylint, pytest, scikit-learn, Sphinx, and SymPy.

Task Resolution by Code Agent

Cells are tasks for the selected code agent. Color encodes repository; solid = resolved, translucent = failed. Click a task to see its details.

Showing 1–40 of 500

SWE-bench Verified Instance Distribution

Time to fix

<15 minutes

194

15 minutes to 1 hour

261

1–4 hours

>4 hours

Number of issues

Results

Y:X:

Claude Code + Claude Sonnet 4.5

Codex CLI + GPT 5

Gemini CLI + Gemini 2.5 Pro

Cursor CLI + Grok Code Fast 1

Model:

Metric:

No lab’s agentic programming tool is able to complete long-horizon (>4 hour) instances.
Cursor CLI with Grok Code Fast 1 and Gemini CLI with Gemini 2.5 Pro spend exponentially more time on long-horizon tasks than short-horizon and medium-horizon tasks.
- Claude Code with Claude Sonnet 4.5 and Codex CLI with GPT 5 spend just marginally more time.
Claude Code with Claude Sonnet 4.5 and Codex CLI with GPT 5 cost more solving SWE-bench Verified’s 1-4 hour tasks than its >4 hour tasks.

Tool Use Analysis

Difficulty:

Claude Code + Claude Sonnet 4.5

Codex CLI + GPT 5

Gemini CLI + Gemini 2.5 Pro

Claude Code with Claude Sonnet 4.5 Tool Mapping

Action	Agent Tool
Read	Read
Write	Edit, MultiEdit
Execute	Bash
Search	Grep, Glob

Codex CLI with GPT 5 Tool Mapping

Action	Agent Shell Command
Read	sed, nl, cat, awk
Write	apply_patch, applypatch
Execute	python, pytest, cd
Search	rg, ls

Gemini CLI with Gemini 2.5 Pro Tool Mapping

Action	Agent Tool
Read	read_file, read_many_files
Write	write_file, replace
Execute	run_shell_command
Search	glob, google_web_search, list_directory, search_file_content

Claude Code with Claude Sonnet 4.5 is the most proactive code agent. It has an especially strong preference to run tests with execute tool calls.
10 Randomly Sampled Claude Code with Claude Sonnet 4.5 Execute Tool Calls
- python -m pytest /testbed/lib/matplotlib/tests/test_pickle.py /testbed/lib/matplotlib/tests/test_legend.py /testbed/lib/matplotlib/tests/test_offsetbox.py -x -q 2>&1 | tail -30
- grep -n "def test_ordering" tests/invalid_models_tests/test_models.py | head -20
- grep -l "test_ordering" tests/invalid_models_tests/test_models.py
- python -m pytest /testbed/lib/matplotlib/tests/test_pickle.py -xvs
- python /testbed/test_final_verification.py
- python test_empty_name.py
- python /testbed/test_autodetector_integration.py
- python tests/runtests.py template_tests -v 2 2>&1 | grep -A 5 "FAILED\|FAIL:"
- ./tests/runtests.py migrations.test_optimizer migrations.test_autodetector --parallel 1 -v0 2>&1 | tail -10
- python -m pytest /testbed/sklearn/feature_selection/tests/ -v -k "cv" 2>&1 | head -100
Codex CLI with GPT 5 executes far less code than Claude Code with Claude Sonnet 4.5.
Gemini CLI with Gemini 2.5 Pro pulls in little context via reading the codebase, but it writes about the same amount of code as Claude Code with Claude Sonnet 4.5 and Codex CLI with GPT 5.

Methodology

Code agents run in SWE-bench Verified task-specific docker containers with our user prompt. The latest code agent version at the time of publication is used.

User Prompt


<uploaded_files>

{location}

</uploaded_files>

I've uploaded a python code repository in the directory {location} (not in /tmp/inputs). Consider the following PR description:

  

<pr_description>

{pr_description}

</pr_description>

  

Can you help me implement the necessary changes to the repository so that the requirements specified in the <pr_description> are met?

I've already taken care of all changes to any of the test files described in the <pr_description>. This means you DON'T have to modify the testing logic or any of the tests in any way!

  

Your task is to make the minimal changes to non-tests files in the {location} directory to ensure the <pr_description> is satisfied.

  

Follow these steps to resolve the issue:

1. As a first step, it might be a good idea to explore the repo to familiarize yourself with its structure.

2. Create a script to reproduce the error and execute it with `python <filename.py>`, to confirm the error

3. Edit the sourcecode of the repo to resolve the issue

4. Rerun your reproduce script and confirm that the error is fixed!

5. Think about edgecases and make sure your fix handles them as well

  

Your thinking should be thorough and so it's fine if it's very long.

Code agents execute in headless mode on a 32 vCPU, 64 GiB memory EC2 instance with a 2 hour timeout. Latency is the runtime of the command which invokes the code agent. Cost is from API pricing.
Accuracy (whether the task is resolved), latency, and cost is averaged across instances in visualizations. One pass was made on each instance.
Code agent performance might be inflated by data contamination, since SWE-bench Verified tasks are public GitHub issues.