Benchmarks · 2026-04-17

We tested it, so you don't have to pretend.

Three Claude models (Haiku 4.5, Sonnet 4.6, Opus 4.7). Two configs per model (baseline, +Hydrate). Seven sessions per run — the same prompts, in the same order, on a fresh workspace each time. No human in the loop. Real Anthropic API spend, summed below: ~$67.

3 models 2 configs 7 sessions / cell 60+ subprocess invocations $67 total spend

How we ran it

A Go harness (hydrate-benchmark) spawns one claude --print subprocess per session, captures the stream-json output, and tallies cost/tokens/turns from the terminal result event. Each cell gets its own workspace directory. Baseline cells have Hydrate env vars unset (the hooks early-exit on empty user). Hydrate cells use a per-cell SESHELL_BENCH_PROJECT slug so facts stay isolated across runs.

The seven prompts — Scenario A (landing page for "Pulse")

  1. Static marketing page for a fictional product "Pulse", plain HTML/CSS, hero + 3 features + footer
  2. Pricing table: Free / Pro ($9/mo) / Team ($29/mo), three features each
  3. /docs page with sidebar + four doc sections
  4. Convert the whole site to Astro, nav + footer as components
  5. Dark-mode toggle with CSS variables, respecting prefers-color-scheme
  6. Fix broken internal links; hero CTA → /#pricing
  7. Playwright e2e: nav renders, CTA target, pricing tiers, /docs sidebar

Scenario B — "build the Hydrate SaaS itself"

Seven sessions building Next.js 16 + Tailwind 4 + real Stripe sandbox integration. Opus-only, because Opus has the tool-use budget to handle the complexity. This is the scenario where the "refusal-to-work" mechanism shows up most clearly.

The numbers

Model Config Sessions shipped Cost (USD) Output tokens Behaviour
Haiku 4.5 baseline 2 / 7 $0.60 23,878 Refused to build in 5 of 7 sessions. Asked "what are the four doc sections?" instead of listing them.
Haiku 4.5 +Hydrate 7 / 7 $1.08 43,497 Executed every prompt. Cost up 79%, features up 82%.
Sonnet 4.6 baseline 7 / 7 $3.13 Clean baseline. The polish ceiling we compare everything else to.
Sonnet 4.6 +Hydrate 7 / 7 $2.68 Rerun after scope-hierarchy fix. Neutral stabiliser, modest cost reduction.
Opus 4.7 baseline (simple) 7 / 7 $11.49 114,566 Ships everything, re-explores every session.
Opus 4.7 +Hydrate (simple) 7 / 7 $4.92 51,282 57% less cost for identical scope. Hydrate displaces exploration tool calls.
Opus 4.7 baseline (complex) 3 / 7 $12.31 Refused s3 ("design proposal"), s4 ("need clarification"), s6 (1-turn ack for $3.88).
Opus 4.7 +Hydrate (complex) 6 / 7 $12.79 s3/s4/s6 all shipped. s6 became a full docs site with syntax highlighting.
Hybrid Sonnet s1 + Haiku+Hyd s2-7 7 / 7 $1.41 67,083 Sonnet seeds the architecture, Haiku propagates it via Hydrate memory. 45% of pure-Sonnet cost.

The four mechanisms

1. Reliability unlock — Haiku

Baseline Haiku refuses to build in 5 of 7 sessions. It asks clarifying questions instead — cheap to the token meter, worthless to the user. Only sessions 1 (no prior context) and 7 (forced by the Playwright requirement) produced code. With Hydrate, fact injection carries the decisions forward: "pricing tiers are Free/Pro/Team", "the four doc sections are…". Haiku stops asking and starts building.

2/7 → 7/7 · +79% cost · +82% output

2. Neutral stabiliser — Sonnet

Sonnet doesn't exhibit Haiku's refuse-to-work pattern. It builds cleanly with or without Hydrate. The +Hydrate run is modestly cheaper (−14%) and structurally identical. Earlier regression reports were traced to a cross-pilot fact-contamination bug (fixed by migration 074, scope hierarchy).

7/7 → 7/7 · −14% cost

3. Cost reducer / refusal converter — Opus

On simple scenarios Opus costs drop 57% with Hydrate — same 7/7 ship, ~45% of the tool calls. On complex scenarios Opus baseline starts refusing ambiguous prompts ("need clarification before I can build this", "design proposal"). Hydrate converts 3 of those 4 refusals into real work at near-identical cost per session.

simple: −57% cost · complex: 3/7 → 6/7 ship

4. Architecture propagator — Hybrid mode ★

Spend one Sonnet session at the front of a project to establish the design system. Then let Haiku+Hydrate carry it through the remaining sessions. Haiku reads back Sonnet's CSS var() tokens, component patterns, and conventions — and builds in-style for the rest of the project. $1.41 total. 45% of pure-Sonnet cost at Sonnet-tier output quality.

★ $1.41 total · Sonnet-tier design · Haiku-tier cost

What each model actually built

Every bench cell produced a real workspace. Screenshots and source links let you inspect what the models actually shipped, not just what they cost. Live previews are deployed per-cell to *.pages.dev (coming online shortly).

Scenario A — Haiku 4.5 baseline

2/7 sessions shipped. Panic-built everything in s7.

baseline
screenshot → haiku-baseline.png

Scenario A — Haiku 4.5 + Hydrate

7/7 sessions shipped. Every feature present, tests pass.

+Hydrate
screenshot → haiku-hydrate.png

Scenario B — Opus 4.7 baseline

3/7 shipped. Next.js 16 + Tailwind 4; partial Stripe wiring.

baseline
screenshot → hydrate-baseline-home.png

Hybrid — Sonnet s1 → Haiku+Hydrate s2–7

7/7 shipped. Sonnet-tier design with Haiku-tier economics.

★ flagship
screenshot → hybrid-v2-final.png

What this doesn't prove

Reproduce the benchmarks

The harness is open source. Clone, set an Anthropic key, point at a workspace directory, run. You'll spend ~$67 of API credit and see your own numbers within the hour.

git clone https://github.com/sedasoft/hydrate-benchmark
cd hydrate-benchmark
export ANTHROPIC_API_KEY=sk-ant-...
./bin/bench-runner --scenario-ids=a-landing-page --confirm-run