Two-agent-independent measurement of whether a memory tool
restores in-flight task state across a session boundary. Hydrate
v0.4.2 leads claude-mem by 4 percentage points, but
confidence intervals overlap at this sample size, so read this as a
directional lead, not a settled result.
Hydratev0.4.2-2026-05-17
57%
[39% to 73%]
17 / 30 recovered
claude-memrc-sanity-2026-05-15
53%
[36% to 70%]
16 / 30 recovered
Δ = 4 percentage points. Hydrate leads, but not statistically significant at n=30.
The 95% confidence intervals overlap ([39 to 73] vs [36 to 70]). The
honest framing is "4 percentage points higher recovery rate,
with overlapping confidence intervals at this sample size." The
next run is n=60.
Harness: compact-survival/v1.0 (0b40a25)Subject: claude-sonnet-4-6 + Hydrate v0.4.2 brew bottle in both phasesTwo-agent driver split (fairness control): Hydrate run driven by Claude Code (Sonnet 4.6) in the bench tmux pane; claude-mem run driven by Codex CLI (gpt-5.5 medium) in the vibe-codex pane. Same scenario, same model under test, same Docker image. Neither tool was measured by an agent from its own family.
Per complexity
Tool
Complexity
Recovered
Wrong
Rate
95% CI
Hydrate v0.4.2
refactor
9 / 10
1
90%
[60% to 98%]
Hydrate v0.4.2
debug
6 / 10
4
60%
[31% to 83%]
Hydrate v0.4.2
design
2 / 10
8
20%
[6% to 51%]
claude-mem
refactor
7 / 10
3
70%
[40% to 89%]
claude-mem
design
5 / 10
5
50%
[24% to 76%]
claude-mem
debug
4 / 10
6
40%
[17% to 69%]
Hydrate v0.4.2 is ahead on refactor (90% vs 70%) and
debug (60% vs 40%); claude-mem is ahead on design (50% vs 20%).
Design is where extractive memory currently under-performs prose
summaries on under-specified planning tasks; the design gap widened
from the prior build, and we're investigating. This is not a uniform
sweep and we don't claim it is.
How we ran this
What it measures: recovery of in-flight task state across a session boundary (Stop hook → SessionStart hook). The scenario after /clear, a terminal restart, or context-window overflow.
Recovery rule: a cell counts as recovered iff Phase 2 names the task key and the next step without asking "what was I doing?". Anything else is a miss.
What it does NOT measure: mid-session PreCompact recovery, a distinct scenario Hydrate handles with a separate hook.
n: 10 cells per tool per complexity (debug, design, refactor), total n=30 per tool.
Harness: compact-survival/v1.0, commit 0b40a25.
Model under test: claude-sonnet-4-6 in both Phase 1 and Phase 2 for every cell, regardless of tool. Hydrate side runs the published v0.4.2 brew bottle. Identical Claude Code container image across runs.
Scenario file:bench/compact-survival/scenario.yaml, with three real-feeling task types (refactor: struct rename with partial progress; debug: hypothesis formed and evidence captured; design: plan doc partially written).
Raw outputs preserved: Phase 2 verbatim transcripts per cell at bench/compact-survival/raw/{tool}-rc-sanity-2026-05-17/{complexity}/{n}.txt, freely auditable.
After correcting both the grader and the canon specification, Hydrate
produces code that follows the architectural rules it was given.
Cell
Grader pass (20 checks)
Spend
haiku-baseline
19 / 20
$2.55
haiku-hydrate-v1 (pre-fix)
12 / 20
$3.26
haiku-hydrate-v2 (post-fix)
16 / 20
$2.69
Key finding: the apparent regression (12/20 vs 19/20)
was a grader plus canon communication failure; the baseline was
hand-rolling argv parsing while the canon required
flag.NewFlagSet. After updating both the canon and the
grader to penalise hand-rolling, Hydrate v2 reaches 16/20 and the gap
shrinks to three points, fully explained by Haiku skipping two
subcommand files. No Hydrate-injected fact produced worse code.
Paired run, smoke-level evidence. n=1 per cell. See RESULTS.md for the
full grader breakdown and per-session cost.
We're measuring how many turns a new developer needs to make their
first correct change in an unfamiliar codebase, with and without a
Hydration Pack. Results ship in the first post-launch update.
Four cells: cold-start, CLAUDE.md only, hpack only, CLAUDE.md + hpack
Headline metric: turns-to-first-correct-change, hpack vs CLAUDE.md alone
Launch suite: methodology stubs
Every benchmark below has its methodology committed before measurement
begins. Status reflects where each is in the schedule.
#
Benchmark
Status
1
pre-prompt-vs-retrieval
Pre-prompt injection vs retrieval-on-demand
Planned: post-launch week 1
2
hydrate-crg-stack
Hydrate stacked on code-review-graph
Planned: post-launch week 1
3
pack-onboarding
Hydration Pack onboarding gauntlet
Measuring: results week 1
4
vs-mempalace-recall
Recall vs MemPalace
Planned: post-launch week 1
5
cross-tool-vs-mem0
Cross-tool memory vs Mem0
Planned: post-launch week 1
6
team-sync-conflict-load
Team sync under conflict load
Planned: post-launch week 2
7
pre-prompt-ablation
Pre-prompt injection ablation
Planned: post-launch week 2
8
memory-retention-curve
Memory retention curve
Started day 0; updating weekly
9
opus-sonnet-handoff
Opus → Sonnet handoff cost delta
Planned: post-launch week 2
Per-benchmark methodology lives in
04-benchmarks.md
in the public repo.
Category-defining benchmarks
A separate workstream for benchmarks intended to set norms for the whole
memory-tools category. The three v1 metrics below were specced before
any competitor had run them; their methodology is fixed in the public
repo. No numbers appear here until measured.
Given a session history containing N documented decisions, the
percentage retrieved when later prompts require those decisions to
answer correctly. Where every other "memory" benchmark measures
verbatim recall, DRS measures whether the tool actually surfaces the
decisions a developer is paid to apply.
Why it matters: the practical thing a developer
needs from session-to-session memory is to be reminded of the
decisions they made last time, not the prose they typed. DRS is the
metric closest to that. Grader is a deterministic answer key per
decision (required concepts, forbidden concepts, acceptable
synonyms), not a substance-judgement; the score is reproducible.
Sample size: n = 30 per tool. CI reported on every published score.
The additional time-to-first-token a memory tool imposes on a
prompt, measured against a no-memory baseline. Defined as
PPLT_ms = TTFT_with_tool_ms - TTFT_baseline_ms. Lower
is better. The metric explicitly reports sub-components
(hook_time_ms, retrieval_time_ms,
model_ttft_ms) so a fast hook + slow storage cannot
hide behind the combined number.
Why it matters: every existing memory comparison
benchmarks accuracy or storage size. PPLT is the latency-of-knowing
metric, the right one for interactive coding tools where each
prompt-response cycle is a foreground activity.
Sample size: n = 100 per tool. p50 / p95 / p99 each reported.
After switching models mid-task (e.g. Opus to Sonnet), the
percentage of the prior session's decisions that remain accessible
and correctly applied in the next session. Procedure: plan a 20-
decision feature in Opus, apply the tool's handoff method, then
execute in Sonnet. Grader checks the 20 decisions for correct
application in the implementation.
Why it matters: model switching for cost control
is now standard developer practice, and no existing benchmark
measures handoff quality. MHF defines the metric so the category
has a way to score the workflow.
Sample size: n = 10 paired runs per tool.
Full specifications, including TCCT, OV, and CIT (v2, month 1 post-launch):
05-category-benchmarks.md.