Hanabi CapitalHanabi Capital presents...

What’s the best code agent?

Evaluating popular agentic programming tools on 500 real-world GitHub issues.

10/6/2025

Agent harnesses meaningfully shape a model’s behavior, but there are few quantitative evals of the code agent harnesses used by developers. To our knowledge, we present the first.

Takeaways

Leaderboard

Claude Code + Claude Sonnet 4.5
Accuracy
64.0%
Cost / Task
$1.193
Latency
542.6s
Codex CLI + GPT 5
Accuracy
57.4%
Cost / Task
$0.126
Latency
135.6s
Cursor CLI + Grok Code Fast 1
Accuracy
50.6%
Cost / Task
$0.038
Latency
256.1s
Gemini CLI + Gemini 2.5 Pro
Accuracy
42.4%
Cost / Task
$0.403
Latency
187.4s
Difficulty:

Tasks are sourced from SWE-bench Verified, which tests agents on 500 GitHub issues from 12 open-source Python repositories: Astropy, Django, Matplotlib, seaborn, Flask, Requests, PyData, Pylint, pytest, scikit-learn, Sphinx, and SymPy.

Task Resolution by Code Agent

Cells are tasks for the selected code agent. Color encodes repository; solid = resolved, translucent = failed. Click a task to see its details.
Showing 140 of 500
SWE-bench Verified Instance Distribution
Time to fix
<15 minutes
194
15 minutes to 1 hour
261
1–4 hours
42
>4 hours
3
Number of issues

Results

Claude Code + Claude Sonnet 4.5
Codex CLI + GPT 5
Gemini CLI + Gemini 2.5 Pro
Cursor CLI + Grok Code Fast 1
  • No lab’s agentic programming tool is able to complete long-horizon (>4 hour) instances.
  • Cursor CLI with Grok Code Fast 1 and Gemini CLI with Gemini 2.5 Pro spend exponentially more time on long-horizon tasks than short-horizon and medium-horizon tasks.
    • Claude Code with Claude Sonnet 4.5 and Codex CLI with GPT 5 spend just marginally more time.
  • Claude Code with Claude Sonnet 4.5 and Codex CLI with GPT 5 cost more solving SWE-bench Verified’s 1-4 hour tasks than its >4 hour tasks.

Tool Use Analysis

Claude Code + Claude Sonnet 4.5
Codex CLI + GPT 5
Gemini CLI + Gemini 2.5 Pro
Claude Code with Claude Sonnet 4.5 Tool Mapping
ActionAgent Tool
ReadRead
WriteEdit, MultiEdit
ExecuteBash
SearchGrep, Glob
Codex CLI with GPT 5 Tool Mapping
ActionAgent Shell Command
Readsed, nl, cat, awk
Writeapply_patch, applypatch
Executepython, pytest, cd
Searchrg, ls
Gemini CLI with Gemini 2.5 Pro Tool Mapping
ActionAgent Tool
Readread_file, read_many_files
Writewrite_file, replace
Executerun_shell_command
Searchglob, google_web_search, list_directory, search_file_content
  • Claude Code with Claude Sonnet 4.5 is the most proactive code agent. It has an especially strong preference to run tests with execute tool calls.
    10 Randomly Sampled Claude Code with Claude Sonnet 4.5 Execute Tool Calls
    • python -m pytest /testbed/lib/matplotlib/tests/test_pickle.py /testbed/lib/matplotlib/tests/test_legend.py /testbed/lib/matplotlib/tests/test_offsetbox.py -x -q 2>&1 | tail -30
    • grep -n "def test_ordering" tests/invalid_models_tests/test_models.py | head -20
    • grep -l "test_ordering" tests/invalid_models_tests/test_models.py
    • python -m pytest /testbed/lib/matplotlib/tests/test_pickle.py -xvs
    • python /testbed/test_final_verification.py
    • python test_empty_name.py
    • python /testbed/test_autodetector_integration.py
    • python tests/runtests.py template_tests -v 2 2>&1 | grep -A 5 "FAILED\|FAIL:"
    • ./tests/runtests.py migrations.test_optimizer migrations.test_autodetector --parallel 1 -v0 2>&1 | tail -10
    • python -m pytest /testbed/sklearn/feature_selection/tests/ -v -k "cv" 2>&1 | head -100
  • Codex CLI with GPT 5 executes far less code than Claude Code with Claude Sonnet 4.5.
  • Gemini CLI with Gemini 2.5 Pro pulls in little context via reading the codebase, but it writes about the same amount of code as Claude Code with Claude Sonnet 4.5 and Codex CLI with GPT 5.

Methodology

  • Code agents run in SWE-bench Verified task-specific docker containers with our user prompt. The latest code agent version at the time of publication is used.
    User Prompt
    
    <uploaded_files>
    
    {location}
    
    </uploaded_files>
    
    I've uploaded a python code repository in the directory {location} (not in /tmp/inputs). Consider the following PR description:
    
      
    
    <pr_description>
    
    {pr_description}
    
    </pr_description>
    
      
    
    Can you help me implement the necessary changes to the repository so that the requirements specified in the <pr_description> are met?
    
    I've already taken care of all changes to any of the test files described in the <pr_description>. This means you DON'T have to modify the testing logic or any of the tests in any way!
    
      
    
    Your task is to make the minimal changes to non-tests files in the {location} directory to ensure the <pr_description> is satisfied.
    
      
    
    Follow these steps to resolve the issue:
    
    1. As a first step, it might be a good idea to explore the repo to familiarize yourself with its structure.
    
    2. Create a script to reproduce the error and execute it with `python <filename.py>`, to confirm the error
    
    3. Edit the sourcecode of the repo to resolve the issue
    
    4. Rerun your reproduce script and confirm that the error is fixed!
    
    5. Think about edgecases and make sure your fix handles them as well
    
      
    
    Your thinking should be thorough and so it's fine if it's very long.
    
  • Code agents execute in headless mode on a 32 vCPU, 64 GiB memory EC2 instance with a 2 hour timeout. Latency is the runtime of the command which invokes the code agent. Cost is from API pricing.
  • Accuracy (whether the task is resolved), latency, and cost is averaged across instances in visualizations. One pass was made on each instance.
  • Code agent performance might be inflated by data contamination, since SWE-bench Verified tasks are public GitHub issues.