Hanabi CapitalHanabi Capital presents...

What’s the best code agent?

Evaluating popular agentic programming tools on SWE-bench Verified.

Updated: 9/26/2025

Leaderboard

Claude Code + Claude Sonnet 4
Accuracy
58.8%
Cost / Task
$0.837
Latency
473.1s
Codex CLI + GPT 5
Accuracy
57.4%
Cost / Task
$0.126
Latency
135.6s
Cursor CLI + Grok Code Fast 1
Accuracy
50.6%
Cost / Task
$0.038
Latency
256.1s
Gemini CLI + Gemini 2.5 Pro
Accuracy
42.4%
Cost / Task
$0.403
Latency
187.4s
Difficulty:
  • Frontier labs’ commercial code agents underperform their models’ self-reported SWE-bench Verified scores at >70% accuracy.
  • Claude Code with Claude Sonnet 4 achieves the best performance by a hair’s breadth, but it’s far more expensive than its alternatives.
  • Codex CLI with GPT 5 offers frontier performance at a fraction of Claude’s cost, while scoring the fastest times across the board.
  • Cursor CLI with Grok Code Fast 1 is a respectable entrant. It achieves solid performance and latency with an astoundingly low price.
    • Cursor CLI lacks fine-grained trajectory logging at the time of publication, so cost breakdowns by instance difficulty and tool use analytics are omitted.
  • Gemini CLI with Gemini 2.5 Pro lags the herd—it offers no advantage in performance, price, or speed.

Instance Resolution by Code Agent

Cells are tasks for the selected code agent. Color encodes repository; solid = resolved, translucent = failed. Click a task to see its details.
Showing 140 of 500

SWE-bench Verified tests LLMs on 500 GitHub issues sourced from 12 open-source Python repositories: Astropy, Django, Matplotlib, seaborn, Flask, Requests, PyData, Pylint, pytest, scikit-learn, Sphinx, and SymPy.

SWE-bench Verified Instance Distribution
Time to fix
<15 minutes
194
15 minutes to 1 hour
261
1–4 hours
42
>4 hours
3
Number of issues

Results

Claude Code + Claude Sonnet 4
Codex CLI + GPT 5
Gemini CLI + Gemini 2.5 Pro
Cursor CLI + Grok Code Fast 1
  • Claude Code with Claude Sonnet 4 is the only agentic programming tool which completes one long-horizon (>4 hour) instance.
  • Claude Code with Claude Sonnet 4 excels at short-horizon (<15 min - 1 hour) tasks, while Codex CLI with GPT 5 excels at medium-horizon (1-4 hour) tasks.
  • Cursor CLI with Grok Code Fast 1 and Gemini CLI with Gemini 2.5 Pro spend exponentially more time on long-horizon tasks than short-horizon and medium-horizon tasks.
    • Claude Code with Claude Sonnet 4 and Codex CLI with GPT 5 spend just marginally more time.
  • Claude Code with Claude Sonnet 4 and Codex CLI with GPT 5 cost more solving SWE-bench Verified’s 1-4 hour tasks than its >4 hour tasks.

Tool Use Analysis

Claude Code + Claude Sonnet 4
Codex CLI + GPT 5
Gemini CLI + Gemini 2.5 Pro
Claude Code with Claude Sonnet 4 Tool Mapping
ActionAgent Tool
ReadRead
WriteEdit, MultiEdit
ExecuteBash
SearchGrep, Glob
Codex CLI with GPT 5 Tool Mapping
ActionAgent Shell Command
Readsed, nl, cat, awk
Writeapply_patch, applypatch
Executepython, pytest, cd
Searchrg, ls
Gemini CLI with Gemini 2.5 Pro Tool Mapping
ActionAgent Tool
Readread_file, read_many_files
Writewrite_file, replace
Executerun_shell_command
Searchglob, google_web_search, list_directory, search_file_content
  • Claude Code with Claude Sonnet 4 is the most proactive code agent. It has an especially strong preference towards running scripts and test cases.
  • Codex CLI with GPT 5 searches and reads the codebase a similar amount as Claude Code with Claude Sonnet 4 does, but it writes and executes less code.
  • Gemini CLI with Gemini 2.5 Pro writes and executes code roughly as often as Codex CLI with GPT 5 does, but it pulls in much less context via searching and reading.
  • Claude Code with Claude Sonnet 4 and Codex CLI with GPT 5 make fewer tool calls on SWE-bench Verified’s >4 hour tasks relative to its 1-4 hour tasks.

Methodology

  • Code agents run in SWE-bench Verified task-specific docker containers with our user prompt. The latest code agent version at the time of publication is used.
    User Prompt
    
    <uploaded_files>
    
    {location}
    
    </uploaded_files>
    
    I've uploaded a python code repository in the directory {location} (not in /tmp/inputs). Consider the following PR description:
    
      
    
    <pr_description>
    
    {pr_description}
    
    </pr_description>
    
      
    
    Can you help me implement the necessary changes to the repository so that the requirements specified in the <pr_description> are met?
    
    I've already taken care of all changes to any of the test files described in the <pr_description>. This means you DON'T have to modify the testing logic or any of the tests in any way!
    
      
    
    Your task is to make the minimal changes to non-tests files in the {location} directory to ensure the <pr_description> is satisfied.
    
      
    
    Follow these steps to resolve the issue:
    
    1. As a first step, it might be a good idea to explore the repo to familiarize yourself with its structure.
    
    2. Create a script to reproduce the error and execute it with `python <filename.py>`, to confirm the error
    
    3. Edit the sourcecode of the repo to resolve the issue
    
    4. Rerun your reproduce script and confirm that the error is fixed!
    
    5. Think about edgecases and make sure your fix handles them as well
    
      
    
    Your thinking should be thorough and so it's fine if it's very long.
    
  • Code agents execute in headless mode on a 32 vCPU, 64 GiB memory EC2 instance with a 2 hour timeout. Latency is the runtime of the command which invokes the code agent.
  • Accuracy (whether the task is resolved), latency, and cost is averaged across instances in visualizations. One pass was made on each instance.
  • Code agent performance might be inflated by data contamination, since SWE-bench Verified tasks are public GitHub issues.