Hacker News

Senior SWE-Bench: open-source benchmark that assesses agents as senior engineers

Senior SWE-Bench

We treat agents like senior engineers, so why evaluate them like junior engineers?

Senior engineers build features without over-specified requirements

Senior SWE-Bench feature tasks have realistic instructions that read like natural language messages rather than over-specified requirements. To reliably evaluate these tasks, we introduce a validation agent which uses expert-designed recipes to write behavioral tests that adapt to submitted solutions.

Senior engineers solve bugs that require runtime investigation from behavioral reports

Senior SWE-Bench bug tasks reflect tricky user reports and focus on investigation, from starting services to debugging subtle runtime issues. They are sourced from PRs that needed significant runtime investigation to solve (e.g. logs, profiling data, reproduction steps).

Senior engineers ship the right code without being told to

Senior SWE-Bench scores tasteful solves by combining runtime correctness tests with several quality metrics based on observed codebase practices. In addition, verifiers and validation can test against load-bearing codebase practices that go unstated in instructions.

Leaderboard

# Model Effort Solve rate (pass@1)
1 Claude Opus 4.8 max 24.0%
Claude Sonnet 5 max 19.4%
2 GPT-5.5 xhigh 16.0%
3 Claude Opus 4.7 max 14.1%
4 GPT-5.4 xhigh 14.0%
5 GLM-5.2 max 12.5%
6 Kimi K2.6 default 8.2%
7 Claude Sonnet 4.6 high 8.2%
8 Gemini 3.1 Pro high 6.1%
9 Gemini 3.5 Flash medium 3.0%

The top-performing frontier models fail to complete tasks with senior-level correctness and taste over 75% of the time.

Tasks

Senior SWE-Bench tasks are sourced from PRs in repos spanning libraries to multi-service applications, authored by engineers with hundreds of commits in their respective repos. We focus on multi-phase, multi-stack feature PRs and bug/performance PRs with significant runtime investigation. For more on task design, read the blog post.

  • More naturally under-specified instructions – Senior SWE-Bench tasks reflect natural communication with agents, with a median instruction length 31% that of SWE-Bench Pro.
  • More diverse task scope – Senior SWE-Bench feature tasks can span across multiple services, with an average of 11 files touched per feature task.
  • Longer task horizon – Senior SWE-Bench tasks are designed to be long-horizon, requiring hundreds of steps for even the strongest agents.

Reference-solution SLOC & files are measured identically across all three benchmarks. Instruction length excludes harness boilerplate. Token and step counts for other benchmarks are based on their self-reported metrics.

Comments

No comments yet. Start the discussion.