Benchmarks · compact-survival/v1.0 · n=30 per tool · Hydrate v0.4.2, run 2026-05-17

Compact Survival v1.0: measured, n=30 per tool.

Two-agent-independent measurement of whether a memory tool restores in-flight task state across a session boundary. Hydrate v0.4.2 leads claude-mem by 4 percentage points, but confidence intervals overlap at this sample size, so read this as a directional lead, not a settled result.

Hydrate v0.4.2-2026-05-17

57%

[39% to 73%]

17 / 30 recovered

claude-mem rc-sanity-2026-05-15

53%

[36% to 70%]

16 / 30 recovered

Δ = 4 percentage points. Hydrate leads, but not statistically significant at n=30. The 95% confidence intervals overlap ([39 to 73] vs [36 to 70]). The honest framing is "4 percentage points higher recovery rate, with overlapping confidence intervals at this sample size." The next run is n=60.

Harness: compact-survival/v1.0 (0b40a25) Subject: claude-sonnet-4-6 + Hydrate v0.4.2 brew bottle in both phases Two-agent driver split (fairness control): Hydrate run driven by Claude Code (Sonnet 4.6) in the bench tmux pane; claude-mem run driven by Codex CLI (gpt-5.5 medium) in the vibe-codex pane. Same scenario, same model under test, same Docker image. Neither tool was measured by an agent from its own family.

Per complexity

Tool	Complexity	Recovered	Wrong	Rate	95% CI
Hydrate v0.4.2	refactor	9 / 10	1	90%	[60% to 98%]
Hydrate v0.4.2	debug	6 / 10	4	60%	[31% to 83%]
Hydrate v0.4.2	design	2 / 10	8	20%	[6% to 51%]
claude-mem	refactor	7 / 10	3	70%	[40% to 89%]
claude-mem	design	5 / 10	5	50%	[24% to 76%]
claude-mem	debug	4 / 10	6	40%	[17% to 69%]

Hydrate v0.4.2 is ahead on refactor (90% vs 70%) and debug (60% vs 40%); claude-mem is ahead on design (50% vs 20%). Design is where extractive memory currently under-performs prose summaries on under-specified planning tasks; the design gap widened from the prior build, and we're investigating. This is not a uniform sweep and we don't claim it is.

How we ran this

What it measures: recovery of in-flight task state across a session boundary (Stop hook → SessionStart hook). The scenario after /clear, a terminal restart, or context-window overflow.
Recovery rule: a cell counts as recovered iff Phase 2 names the task key and the next step without asking "what was I doing?". Anything else is a miss.
What it does NOT measure: mid-session PreCompact recovery, a distinct scenario Hydrate handles with a separate hook.
n: 10 cells per tool per complexity (debug, design, refactor), total n=30 per tool.
Harness: compact-survival/v1.0, commit 0b40a25.
Model under test: claude-sonnet-4-6 in both Phase 1 and Phase 2 for every cell, regardless of tool. Hydrate side runs the published v0.4.2 brew bottle. Identical Claude Code container image across runs.
Scenario file: bench/compact-survival/scenario.yaml, with three real-feeling task types (refactor: struct rename with partial progress; debug: hypothesis formed and evidence captured; design: plan doc partially written).
Raw outputs preserved: Phase 2 verbatim transcripts per cell at bench/compact-survival/raw/{tool}-rc-sanity-2026-05-17/{complexity}/{n}.txt, freely auditable.

Canonical RESULTS: bench/compact-survival/RESULTS.md · Run mirror: RESULTS-rc-sanity-2026-05-17.md · Auditable raw transcripts: bench/compact-survival/raw/

Verified headline results

The figures we stand behind, on the axes that matter for a local memory layer. Each row names exactly what it measures; we do not place a retrieval-recall number beside another tool's QA-accuracy number.

Benchmark	Result	What it measures
Token reduction (headline)	11.1% fewer output tokens, 6.3% lower cost, at quality parity	concision A/B, n=10: the same work for fewer tokens
Retrieval	R@10 = 0.86 (0.95 on multi-session)	LongMemEval, n=500: the right context lands in the top ten. Anchored against Cortex, not mem0's QA accuracy
Compression	99.2% (475.7M → 3.7M tokens)	distil pipeline across 22,483 sessions, before re-injection
Speed and footprint	sub-10 ms retrieval, no reranker, no GPU, no model download	pure Go, small enough to audit in an afternoon

Source: the benchmark scoreboard in the public README. Retrieval recall and end-to-end QA accuracy are different measurements, so we do not set our retrieval number beside a competitor's QA-accuracy number.

runtime · Codex

Concision ports to Codex: -23% lines of code at 12/12 parity

OpenAI Codex CLI 0.141.0, gpt-5.4 · source: runtimes/codex/results/metrics.json

The same weather task, single shot, with and without the /hydrate-yagni concision directive, both held to a 12/12 rubric so the delta is concision, not quality. The directive cut total emitted code from 809 to 623 lines (-23%) and output tokens from 10,545 to 9,395 (-11%).

Reported in lines and tokens, not dollars. This run used ChatGPT-subscription auth, so there is no clean per-token USD figure like the Claude weather-bench. Note input tokens were higher for the concision arm (the directive adds prompt and reasoning); the win is in emitted code size, not input.

weather-bench

Hydrate is not a single-shot cost lever. Concision is.

source: slash-commands/benchmarks/weather-bench

We say this plainly because the honest reading of weather-bench says it. A prompt-only run with Hydrate's hooks active passes the rubric 12/12, but it is the most expensive single-shot arm on the board. The lever that actually reduces single-shot spend is the concision directive, whose arms land far lower at the same 12/12. Do not read Hydrate as a one-shot token saver; its win is memory across sessions, measured below.

We do not rerank. An A/B on the bge-reranker showed no recall gain at 5 to 33 times the latency, so reranking stays off; the sub-10 ms retrieval figure above is the no-reranker path.

Other measurements

Earlier measurements, kept for completeness. Methodology decided before measurement, negative results published, comparator versions pinned. Fairness rules: 04-benchmarks.md §1.14.

smoke

Post-MemPalace smoke: cross-model feature presence

n: 2 (Haiku), 1 (Sonnet), 2 (Opus) · source: bench/post-mempalace-smoke/RESULTS.md

Hydrate's memory-capture and injection features work end-to-end on every Anthropic model tier.

Model	Layer 1 (data)	Layer 2 (claude --print)
Haiku 4.5	6 / 6	5 / 6 (max-turns flake)
Sonnet 4.6	6 / 6	6 / 6
Opus 4.7	6 / 6	6 / 6

Smoke benchmark, verifies feature presence per model tier. See RESULTS.md for per-scenario detail.

Reproduce

bash bench/post-mempalace-smoke/run.sh --with-claude --record --model=<model>

paired run

scenario-lquery: canon compliance under Haiku

n: 1 per cell, 7 sessions each · source: bench/scenario-lquery/RESULTS.md

After correcting both the grader and the canon specification, Hydrate produces code that follows the architectural rules it was given.

Cell	Grader pass (20 checks)	Spend
haiku-baseline	19 / 20	$2.55
haiku-hydrate-v1 (pre-fix)	12 / 20	$3.26
haiku-hydrate-v2 (post-fix)	16 / 20	$2.69

Key finding: the apparent regression (12/20 vs 19/20) was a grader plus canon communication failure; the baseline was hand-rolling argv parsing while the canon required flag.NewFlagSet. After updating both the canon and the grader to penalise hand-rolling, Hydrate v2 reaches 16/20 and the gap shrinks to three points, fully explained by Haiku skipping two subcommand files. No Hydrate-injected fact produced worse code.

Paired run, smoke-level evidence. n=1 per cell. See RESULTS.md for the full grader breakdown and per-session cost.

Reproduce

bash bench/scenario-lquery/run.sh haiku-hydrate-v2 claude-haiku-4-5-20251001 hydrate
go build -o /tmp/lquery-grader ./bench/scenario-lquery/grader
/tmp/lquery-grader --json bench-output/lquery/haiku-hydrate-v2/project

Coming this week

in flight

Pack onboarding gauntlet

Source: 04-benchmarks.md §4.3

We're measuring how many turns a new developer needs to make their first correct change in an unfamiliar codebase, with and without a Hydration Pack. Results ship in the first post-launch update.

Four cells: cold-start, CLAUDE.md only, hpack only, CLAUDE.md + hpack
Five-task onboarding gauntlet: add feature, fix bug, change convention, write test, refactor
Grader: per-task AST/diff pass-fail
n = 10 per cell (minimum for a homepage number)
Headline metric: turns-to-first-correct-change, hpack vs CLAUDE.md alone

How we compare against competitors

One rule governs every comparison on this site: compare on the same metric, against the right anchor. Most memory tools publish different metrics, so a raw number-next-to-number table is usually dishonest. The clearest trap is retrieval. We publish retrieval recall (R@10), the metric Cortex publishes. mem0 publishes end-to-end QA accuracy, a different thing, so our number and mem0's are not a head-to-head, and we anchor R@10 against Cortex instead.

Three lines we hold to: we never set our number beside a competitor's on a different metric; we never restate a competitor's self-reported figure as if we verified it; and we never run a head-to-head against a tool on a different layer, such as code structure or wire compression, where the honest measurement is running both.

Benchmark family	Metric	Honest anchor	Status
Coding-session survival	scenario-lquery (/20), compact-survival recovery	claude-mem; ClawMem and Hindsight next	measured vs claude-mem
Retrieval recall	R@10, MRR	Cortex, not mem0 (mem0 publishes QA accuracy)	LongMemEval done; LoCoMo specced
Cross-runtime memory	cross-vendor compact-survival recall	none: every rival is single-runtime	measured; expanding vendor pairs
IR ranking	NDCG@10 on BEIR	vstash (and ClawMem rerank claim)	planned
Orchestration	cross-family review catch-rate (1-vs-N)	same-model self-review baseline	in-tree, not yet published
Team propagation	pack-onboarding fidelity + attribution survival	Hindsight, claude-mem Pro	measured; vs-Hindsight next
Token economics	tokens-per-task, composed	Headroom, code-review-graph (run both)	planned

Measured figures live on the per-product comparison pages, not here. Where a row says planned or specced, no number appears anywhere on the site until it is run.

Launch suite: methodology stubs

Every benchmark below has its methodology committed before measurement begins. Status reflects where each is in the schedule.

#	Benchmark	Status
1	`pre-prompt-vs-retrieval` Pre-prompt injection vs retrieval-on-demand	Planned: post-launch week 1
2	`hydrate-crg-stack` Hydrate stacked on code-review-graph	Planned: post-launch week 1
3	`pack-onboarding` Hydration Pack onboarding gauntlet	Measuring: results week 1
4	`vs-mempalace-recall` Recall vs MemPalace	Planned: post-launch week 1
5	`cross-tool-vs-mem0` Cross-tool memory vs Mem0	Planned: post-launch week 1
6	`team-sync-conflict-load` Team sync under conflict load	Planned: post-launch week 2
7	`pre-prompt-ablation` Pre-prompt injection ablation	Planned: post-launch week 2
8	`memory-retention-curve` Memory retention curve	Started day 0; updating weekly
9	`opus-sonnet-handoff` Opus → Sonnet handoff cost delta	Planned: post-launch week 2

Per-benchmark methodology lives in 04-benchmarks.md in the public repo.

Category-defining benchmarks

A separate workstream for benchmarks intended to set norms for the whole memory-tools category. The three v1 metrics below were specced before any competitor had run them; their methodology is fixed in the public repo. No numbers appear here until measured.

coming this week

DRS: Decision Recall Score

Spec: 05-category-benchmarks.md §2.1

Given a session history containing N documented decisions, the percentage retrieved when later prompts require those decisions to answer correctly. Where every other "memory" benchmark measures verbatim recall, DRS measures whether the tool actually surfaces the decisions a developer is paid to apply.

Why it matters: the practical thing a developer needs from session-to-session memory is to be reminded of the decisions they made last time, not the prose they typed. DRS is the metric closest to that. Grader is a deterministic answer key per decision (required concepts, forbidden concepts, acceptable synonyms), not a substance-judgement; the score is reproducible.

Sample size: n = 30 per tool. CI reported on every published score.

coming this week

PPLT: Pre-Prompt Latency Tax

Spec: 05-category-benchmarks.md §2.2

The additional time-to-first-token a memory tool imposes on a prompt, measured against a no-memory baseline. Defined as PPLT_ms = TTFT_with_tool_ms - TTFT_baseline_ms. Lower is better. The metric explicitly reports sub-components (hook_time_ms, retrieval_time_ms, model_ttft_ms) so a fast hook + slow storage cannot hide behind the combined number.

Why it matters: every existing memory comparison benchmarks accuracy or storage size. PPLT is the latency-of-knowing metric, the right one for interactive coding tools where each prompt-response cycle is a foreground activity.

Sample size: n = 100 per tool. p50 / p95 / p99 each reported.

coming this week

MHF: Model Handoff Fidelity

Spec: 05-category-benchmarks.md §2.3

After switching models mid-task (e.g. Opus to Sonnet), the percentage of the prior session's decisions that remain accessible and correctly applied in the next session. Procedure: plan a 20- decision feature in Opus, apply the tool's handoff method, then execute in Sonnet. Grader checks the 20 decisions for correct application in the implementation.

Why it matters: model switching for cost control is now standard developer practice, and no existing benchmark measures handoff quality. MHF defines the metric so the category has a way to score the workflow.

Sample size: n = 10 paired runs per tool.

Full specifications, including TCCT, OV, and CIT (v2, month 1 post-launch): 05-category-benchmarks.md.