BETA In open beta. Install live. Lock $5/mo for your first 12 months. See pricing →
Benchmarks · compact-survival/v1.0 · n=30 per tool · Hydrate v0.4.2, run 2026-05-17

Compact Survival v1.0: measured, n=30 per tool.

Two-agent-independent measurement of whether a memory tool restores in-flight task state across a session boundary. Hydrate v0.4.2 leads claude-mem by 4 percentage points, but confidence intervals overlap at this sample size, so read this as a directional lead, not a settled result.

Hydrate v0.4.2-2026-05-17
57%
[39% to 73%]
17 / 30 recovered
claude-mem rc-sanity-2026-05-15
53%
[36% to 70%]
16 / 30 recovered
Δ = 4 percentage points. Hydrate leads, but not statistically significant at n=30. The 95% confidence intervals overlap ([39 to 73] vs [36 to 70]). The honest framing is "4 percentage points higher recovery rate, with overlapping confidence intervals at this sample size." The next run is n=60.
Harness: compact-survival/v1.0 (0b40a25) Subject: claude-sonnet-4-6 + Hydrate v0.4.2 brew bottle in both phases Two-agent driver split (fairness control): Hydrate run driven by Claude Code (Sonnet 4.6) in the bench tmux pane; claude-mem run driven by Codex CLI (gpt-5.5 medium) in the vibe-codex pane. Same scenario, same model under test, same Docker image. Neither tool was measured by an agent from its own family.

Per complexity

Tool Complexity Recovered Wrong Rate 95% CI
Hydrate v0.4.2refactor9 / 10190%[60% to 98%]
Hydrate v0.4.2debug6 / 10460%[31% to 83%]
Hydrate v0.4.2design2 / 10820%[6% to 51%]
claude-memrefactor7 / 10370%[40% to 89%]
claude-memdesign5 / 10550%[24% to 76%]
claude-memdebug4 / 10640%[17% to 69%]

Hydrate v0.4.2 is ahead on refactor (90% vs 70%) and debug (60% vs 40%); claude-mem is ahead on design (50% vs 20%). Design is where extractive memory currently under-performs prose summaries on under-specified planning tasks; the design gap widened from the prior build, and we're investigating. This is not a uniform sweep and we don't claim it is.

How we ran this

  • What it measures: recovery of in-flight task state across a session boundary (Stop hook → SessionStart hook). The scenario after /clear, a terminal restart, or context-window overflow.
  • Recovery rule: a cell counts as recovered iff Phase 2 names the task key and the next step without asking "what was I doing?". Anything else is a miss.
  • What it does NOT measure: mid-session PreCompact recovery, a distinct scenario Hydrate handles with a separate hook.
  • n: 10 cells per tool per complexity (debug, design, refactor), total n=30 per tool.
  • Harness: compact-survival/v1.0, commit 0b40a25.
  • Model under test: claude-sonnet-4-6 in both Phase 1 and Phase 2 for every cell, regardless of tool. Hydrate side runs the published v0.4.2 brew bottle. Identical Claude Code container image across runs.
  • Scenario file: bench/compact-survival/scenario.yaml, with three real-feeling task types (refactor: struct rename with partial progress; debug: hypothesis formed and evidence captured; design: plan doc partially written).
  • Raw outputs preserved: Phase 2 verbatim transcripts per cell at bench/compact-survival/raw/{tool}-rc-sanity-2026-05-17/{complexity}/{n}.txt, freely auditable.

Canonical RESULTS: bench/compact-survival/RESULTS.md · Run mirror: RESULTS-rc-sanity-2026-05-17.md · Auditable raw transcripts: bench/compact-survival/raw/

Other measurements

Earlier measurements, kept for completeness. Methodology decided before measurement, negative results published, comparator versions pinned. Fairness rules: 04-benchmarks.md §1.14.

smoke

Post-MemPalace smoke: cross-model feature presence

n: 2 (Haiku), 1 (Sonnet), 2 (Opus) · source: bench/post-mempalace-smoke/RESULTS.md

Hydrate's memory-capture and injection features work end-to-end on every Anthropic model tier.

Model Layer 1 (data) Layer 2 (claude --print)
Haiku 4.56 / 65 / 6 (max-turns flake)
Sonnet 4.66 / 66 / 6
Opus 4.76 / 66 / 6

Smoke benchmark, verifies feature presence per model tier. See RESULTS.md for per-scenario detail.

Reproduce
bash bench/post-mempalace-smoke/run.sh --with-claude --record --model=<model>
paired run

scenario-lquery: canon compliance under Haiku

n: 1 per cell, 7 sessions each · source: bench/scenario-lquery/RESULTS.md

After correcting both the grader and the canon specification, Hydrate produces code that follows the architectural rules it was given.

Cell Grader pass (20 checks) Spend
haiku-baseline19 / 20$2.55
haiku-hydrate-v1 (pre-fix)12 / 20$3.26
haiku-hydrate-v2 (post-fix)16 / 20$2.69

Key finding: the apparent regression (12/20 vs 19/20) was a grader plus canon communication failure; the baseline was hand-rolling argv parsing while the canon required flag.NewFlagSet. After updating both the canon and the grader to penalise hand-rolling, Hydrate v2 reaches 16/20 and the gap shrinks to three points, fully explained by Haiku skipping two subcommand files. No Hydrate-injected fact produced worse code.

Paired run, smoke-level evidence. n=1 per cell. See RESULTS.md for the full grader breakdown and per-session cost.

Reproduce
bash bench/scenario-lquery/run.sh haiku-hydrate-v2 claude-haiku-4-5-20251001 hydrate
go build -o /tmp/lquery-grader ./bench/scenario-lquery/grader
/tmp/lquery-grader --json bench-output/lquery/haiku-hydrate-v2/project

Coming this week

in flight

Pack onboarding gauntlet

Source: 04-benchmarks.md §4.3

We're measuring how many turns a new developer needs to make their first correct change in an unfamiliar codebase, with and without a Hydration Pack. Results ship in the first post-launch update.

  • Four cells: cold-start, CLAUDE.md only, hpack only, CLAUDE.md + hpack
  • Five-task onboarding gauntlet: add feature, fix bug, change convention, write test, refactor
  • Grader: per-task AST/diff pass-fail
  • n = 10 per cell (minimum for a homepage number)
  • Headline metric: turns-to-first-correct-change, hpack vs CLAUDE.md alone

Launch suite: methodology stubs

Every benchmark below has its methodology committed before measurement begins. Status reflects where each is in the schedule.

#BenchmarkStatus
1 pre-prompt-vs-retrieval
Pre-prompt injection vs retrieval-on-demand
Planned: post-launch week 1
2 hydrate-crg-stack
Hydrate stacked on code-review-graph
Planned: post-launch week 1
3 pack-onboarding
Hydration Pack onboarding gauntlet
Measuring: results week 1
4 vs-mempalace-recall
Recall vs MemPalace
Planned: post-launch week 1
5 cross-tool-vs-mem0
Cross-tool memory vs Mem0
Planned: post-launch week 1
6 team-sync-conflict-load
Team sync under conflict load
Planned: post-launch week 2
7 pre-prompt-ablation
Pre-prompt injection ablation
Planned: post-launch week 2
8 memory-retention-curve
Memory retention curve
Started day 0; updating weekly
9 opus-sonnet-handoff
Opus → Sonnet handoff cost delta
Planned: post-launch week 2

Per-benchmark methodology lives in 04-benchmarks.md in the public repo.

Category-defining benchmarks

A separate workstream for benchmarks intended to set norms for the whole memory-tools category. The three v1 metrics below were specced before any competitor had run them; their methodology is fixed in the public repo. No numbers appear here until measured.

coming this week

DRS: Decision Recall Score

Spec: 05-category-benchmarks.md §2.1

Given a session history containing N documented decisions, the percentage retrieved when later prompts require those decisions to answer correctly. Where every other "memory" benchmark measures verbatim recall, DRS measures whether the tool actually surfaces the decisions a developer is paid to apply.

Why it matters: the practical thing a developer needs from session-to-session memory is to be reminded of the decisions they made last time, not the prose they typed. DRS is the metric closest to that. Grader is a deterministic answer key per decision (required concepts, forbidden concepts, acceptable synonyms), not a substance-judgement; the score is reproducible.

Sample size: n = 30 per tool. CI reported on every published score.

coming this week

PPLT: Pre-Prompt Latency Tax

Spec: 05-category-benchmarks.md §2.2

The additional time-to-first-token a memory tool imposes on a prompt, measured against a no-memory baseline. Defined as PPLT_ms = TTFT_with_tool_ms - TTFT_baseline_ms. Lower is better. The metric explicitly reports sub-components (hook_time_ms, retrieval_time_ms, model_ttft_ms) so a fast hook + slow storage cannot hide behind the combined number.

Why it matters: every existing memory comparison benchmarks accuracy or storage size. PPLT is the latency-of-knowing metric, the right one for interactive coding tools where each prompt-response cycle is a foreground activity.

Sample size: n = 100 per tool. p50 / p95 / p99 each reported.

coming this week

MHF: Model Handoff Fidelity

Spec: 05-category-benchmarks.md §2.3

After switching models mid-task (e.g. Opus to Sonnet), the percentage of the prior session's decisions that remain accessible and correctly applied in the next session. Procedure: plan a 20- decision feature in Opus, apply the tool's handoff method, then execute in Sonnet. Grader checks the 20 decisions for correct application in the implementation.

Why it matters: model switching for cost control is now standard developer practice, and no existing benchmark measures handoff quality. MHF defines the metric so the category has a way to score the workflow.

Sample size: n = 10 paired runs per tool.

Full specifications, including TCCT, OV, and CIT (v2, month 1 post-launch): 05-category-benchmarks.md.