Scenario A — Haiku 4.5 baseline
2/7 sessions shipped. Panic-built everything in s7.
Three Claude models (Haiku 4.5, Sonnet 4.6, Opus 4.7). Two configs per model (baseline, +Hydrate). Seven sessions per run — the same prompts, in the same order, on a fresh workspace each time. No human in the loop. Real Anthropic API spend, summed below: ~$67.
A Go harness (hydrate-benchmark) spawns one claude --print subprocess per session,
captures the stream-json output, and tallies cost/tokens/turns from the terminal result event.
Each cell gets its own workspace directory. Baseline cells have Hydrate env vars unset (the hooks early-exit on empty
user). Hydrate cells use a per-cell SESHELL_BENCH_PROJECT slug so facts stay isolated across runs.
Seven sessions building Next.js 16 + Tailwind 4 + real Stripe sandbox integration. Opus-only, because Opus has the tool-use budget to handle the complexity. This is the scenario where the "refusal-to-work" mechanism shows up most clearly.
| Model | Config | Sessions shipped | Cost (USD) | Output tokens | Behaviour |
|---|---|---|---|---|---|
| Haiku 4.5 | baseline | 2 / 7 | $0.60 | 23,878 | Refused to build in 5 of 7 sessions. Asked "what are the four doc sections?" instead of listing them. |
| Haiku 4.5 | +Hydrate | 7 / 7 | $1.08 | 43,497 | Executed every prompt. Cost up 79%, features up 82%. |
| Sonnet 4.6 | baseline | 7 / 7 | $3.13 | — | Clean baseline. The polish ceiling we compare everything else to. |
| Sonnet 4.6 | +Hydrate | 7 / 7 | $2.68 | — | Rerun after scope-hierarchy fix. Neutral stabiliser, modest cost reduction. |
| Opus 4.7 | baseline (simple) | 7 / 7 | $11.49 | 114,566 | Ships everything, re-explores every session. |
| Opus 4.7 | +Hydrate (simple) | 7 / 7 | $4.92 | 51,282 | 57% less cost for identical scope. Hydrate displaces exploration tool calls. |
| Opus 4.7 | baseline (complex) | 3 / 7 | $12.31 | — | Refused s3 ("design proposal"), s4 ("need clarification"), s6 (1-turn ack for $3.88). |
| Opus 4.7 | +Hydrate (complex) | 6 / 7 | $12.79 | — | s3/s4/s6 all shipped. s6 became a full docs site with syntax highlighting. |
| Hybrid | Sonnet s1 + Haiku+Hyd s2-7 | 7 / 7 | $1.41 | 67,083 | Sonnet seeds the architecture, Haiku propagates it via Hydrate memory. 45% of pure-Sonnet cost. |
Baseline Haiku refuses to build in 5 of 7 sessions. It asks clarifying questions instead — cheap to the token meter, worthless to the user. Only sessions 1 (no prior context) and 7 (forced by the Playwright requirement) produced code. With Hydrate, fact injection carries the decisions forward: "pricing tiers are Free/Pro/Team", "the four doc sections are…". Haiku stops asking and starts building.
Sonnet doesn't exhibit Haiku's refuse-to-work pattern. It builds cleanly with or without Hydrate. The +Hydrate run is modestly cheaper (−14%) and structurally identical. Earlier regression reports were traced to a cross-pilot fact-contamination bug (fixed by migration 074, scope hierarchy).
On simple scenarios Opus costs drop 57% with Hydrate — same 7/7 ship, ~45% of the tool calls. On complex scenarios Opus baseline starts refusing ambiguous prompts ("need clarification before I can build this", "design proposal"). Hydrate converts 3 of those 4 refusals into real work at near-identical cost per session.
Spend one Sonnet session at the front of a project to establish the design system. Then let Haiku+Hydrate carry
it through the remaining sessions. Haiku reads back Sonnet's CSS var() tokens, component patterns,
and conventions — and builds in-style for the rest of the project. $1.41 total. 45% of pure-Sonnet cost
at Sonnet-tier output quality.
Every bench cell produced a real workspace. Screenshots and source links let you inspect what the models actually
shipped, not just what they cost. Live previews are deployed per-cell to *.pages.dev (coming online shortly).
2/7 sessions shipped. Panic-built everything in s7.
7/7 sessions shipped. Every feature present, tests pass.
3/7 shipped. Next.js 16 + Tailwind 4; partial Stripe wiring.
6/7 shipped. Checkout + webhook + full docs site + syntax highlighting.
7/7 shipped. Sonnet-tier design with Haiku-tier economics.
The harness is open source. Clone, set an Anthropic key, point at a workspace directory, run. You'll spend ~$67 of API credit and see your own numbers within the hour.
git clone https://github.com/sedasoft/hydrate-benchmark
cd hydrate-benchmark
export ANTHROPIC_API_KEY=sk-ant-...
./bin/bench-runner --scenario-ids=a-landing-page --confirm-run