Methodology

hibench is an open-source benchmark for the hidden default cost of coding agents. It answers one question: how much context does an agent load before it starts useful work?

Benchmark data last updated June 24, 2026 UTC

Capture pipeline

  1. 1

    Isolate

    Run each agent in Docker inside a fresh, empty Git repo.

  2. 2

    Intercept

    Point it at a local recorder with a dummy key — no upstream model call.

  3. 3

    Capture

    Send the prompt Hi and record the first outbound request body.

  4. 4

    Count

    Tokenize every field with o200k_base and break it down.

What gets measured

  • Total primary request body tokens
  • System and default developer context
  • Declared tools and tool-definition cost
  • Bundled skills and skill-definition cost
  • MCP and sub-agent declarations

Current coverage

16

agents benchmarked

1050

captured versions

Reproducibility

Every agent has a pinned Docker image, version catalog, and one canonical capture per release. Source, parsers, and export tables live on GitHub.

View rankings →

Benchmarked agents