2026-05-07 · 8 min read

Testing memory features without paying the model

Most ways of testing an LLM feature end with "call the model and check what it said". That's slow and it's expensive, and when the test fails you don't know whether the model is confused or your code is broken. Hydrate's pipeline has a clean seam in the middle: a fact gets written, the fact gets injected into the model's context, the model uses the fact. You can verify the first half (did the fact actually reach the context?) without calling a model at all. That half catches the plumbing bugs for free. The second half only needs to run when the plumbing is clean, and it's cheap because by then you know the input is right. I ran this two-stage harness across six post-release features last week. The free stage caught five real bugs. The paid stage cost three cents.

Why two layers

The failure mode I want to avoid: shipping a feature, writing a test that calls Claude, paying for tokens, getting a confusing model response, and spending an hour debugging whether the issue is in the model or the wiring.

Layer 1 eliminates that. It runs without Claude at all: it calls context-preview directly, checks whether the expected text appears in the injection output, and exits. Free. Runs in 30 seconds. Catches every bug where the feature writes the fact correctly but the fact never reaches the model's context.

Layer 2 runs the same scenarios with claude --print on Haiku 4.5 and checks whether the model's response reflects what was injected. That costs money, so it runs after Layer 1 is clean.

The harness caught five bugs in Layer 1 before I spent a single token on Layer 2.

What I tested

Six scenarios across six containers, all on hydrate v0.2.0-incident-hardening:

Scenario	Container	Feature under test	Result
L1	`apex-tom`	Pinned canon injects on `UserPromptSubmit`	PASS
L2	`apex-tom`	`decisions scan --apply --pin` end to end	PASS
L3	`apex-dick`	Independent developer container fully wired	PASS
L4	`apex-tom`	`ingest --source=dir --apply` end to end	PASS
L5	`hydrate-team-alice`	Team-virtual container fully wired	PASS
L6	`apex-tom`	Area-tagged fact still injects (non-regression)	PASS

12 / 12 scenarios PASS across both layers. Total LLM cost: ~$0.03.

What each scenario proves

L1: Pinned canon

Added a canon fact to apex-tom: "Apex API responses MUST be wrapped in {data, error}". Layer 1 checked that the pattern appeared in additionalContext. Layer 2 asked Claude to add a new endpoint and verified the response used the wrapper.

This is the base case: canon add → SQLite → context-preview → hook output → model receives it → model uses it.

L2: Decisions scan

Wrote a # WHY: marker into internal/auth/jwt.go: JWT chosen over sessions; API must be stateless for k8s scaling. Ran hydrate decisions scan --apply --pin --project=apex. Layer 1 checked for stateless|JWT|k8s in injection. Layer 2 asked "Why is the auth stateless?" and checked the response matched.

The extraction path is pure regex, no LLM call. The fact lands in canon and injects on every session. The whole workstream verified end to end at zero extraction cost.

L3: Independent developer container

Added the same {data, error} canon directly to apex-dick's container and verified injection independently. Dick's container has its own hydrate-server, its own auth, its own store. This confirms that the v0.2.0 deploy is consistent across developer identities, not just one container.

Note: cross-developer team-sync propagation (Tom pushes, Dick pulls) is a separate scenario not yet in this harness. L3 establishes that Dick's environment is ready for it.

L4: Ingest from directory

Wrote We use Postgres 15... to a markdown file in /tmp/smoke-ingest/ and ran hydrate ingest --source=dir --apply /tmp/smoke-ingest. Layer 1 checked for Postgres|postgres in injection. Layer 2 asked "What database does this project use, and what version?" and checked for Postgres.3015 in the response.

The ingest path now has a source=dir tag that round-trips through extraction, store, and injection without being silently dropped. That tag was the thing I was most uncertain about after the schema changes; L4 confirms it works.

L5: Team-virtual container

Added a TypeScript strict-mode canon to hydrate-team-alice, asked Claude for tsconfig guidance, and verified the response included the strictness requirement. Confirms that the team-virtual containers (alice, bob, carol) are correctly registered with their own enterprise identities and can inject canon into live Haiku sessions.

L6: Areas don't break injection

Took an existing fact, ran hydrate area move <fact-id> --to=auth, and confirmed the fact still appeared in injection on the same prompt. Areas are navigation-only in v1; they must not affect injection ranking. This scenario is a non-regression test for that design decision, and it held.

The five bugs Layer 1 caught at $0.00

Every one of these would have produced a confusing model response if I had gone straight to Layer 2:

#	Issue	Root cause	Fix
1	macOS bash 3.2 incompatibility	Script used associative arrays, unavailable in bash 3.2	Rewrote with `case` lookups
2	Hook returned HTTP 401 from enterprise:8095	Hook was calling enterprise directly, skipping the required local hydrate-server intermediary	Added `ensure_local_server` that starts hydrate-server per container before the hook fires
3	`canon add "text"` silently ignored the positional argument	`canon add` requires `--text "..."`; unlike `fact add`, it does not accept a positional argument	Fixed setup commands to use `--text` flag throughout
4	`bash: line 1: claude: command not found` in Layer 2	`bash -lc` did not pick up the npm-global PATH where the `claude` symlink lives	Use absolute path `/usr/local/share/npm-global/bin/claude` with explicit `PATH=` in `docker exec`
5	alice's container returned empty `additionalContext`	Script defaulted to `dev-key-12345` when env was empty; alice's server had generated a different `hk_...` key	Read `~/.hydrate/api.key` first as the source of truth

Bugs 2 and 5 are the ones I wouldn't have found quickly with a live Claude run. Bug 2 would have looked like an authentication error with no obvious connection to the hook architecture. Bug 5 would have looked like Alice's injection was silently empty, which could have been attributed to an empty fact store, a query mismatch, or half a dozen other things.

Layer 1 surfaces both as a grep failure within 30 seconds. No tokens spent.

On re-running: the harness is idempotent. Re-running --apply against unchanged input is a no-op in all three workstreams (per-conversation ledger for ingest, per-marker ledger for decisions, content-hash dedup for canon). Running the full smoke suite twice produces identical results.

What this confirms is shipped and working

hydrate canon add end to end (scenarios L1, L3, L5): canon fact → SQLite → injection → model uses it.
hydrate decisions scan --apply --pin end to end (scenario L2): regex extraction from source comments → fact → pinned canon → injection → model behaviour. Zero LLM calls in the extraction path.
hydrate ingest --source=dir --apply end to end (scenario L4): markdown files in a directory → extracted facts → injection → correct model recall.
Areas are injection-neutral in v1 (scenario L6): the design decision holds in production, not just in unit tests.
Cross-container parity: apex-tom, apex-dick, and hydrate-team-alice all behave correctly under their own enterprise identities.
Schema regression fixes: the splitSQL and hydrate_retrievals.source ALTER fixes from v0.2.0 shipped cleanly; every container's store.Open succeeded under load.

What this run does not cover

Being honest about scope is part of the methodology:

Team-sync propagation. L3 added canon directly to dick's container rather than having Tom push and Dick pull. A separate scenario should exercise hydrate team push/pull once a team is initialised on the apex containers.
hydrate timeline. Gated behind --experimental and depends on a siteengine endpoint not yet GA. Not in scope until the server side lands.
Tier substitution. These tests run on Haiku 4.5 only. To validate the "Haiku with memory matches Sonnet without" claim, the same scenarios need a Sonnet 4.6 baseline with no injection for comparison. That's a benchmark, not a smoke test.
hydrate ingest --source=claude-export and --source=chatgpt-export. Covered by unit tests and fixtures but not yet in the six live scenarios.

Recommended cadence

When	What to run	Cost	Time
Every change to hook / store / canon / decisions / ingest / areas	Layer 1 only (`./run.sh`)	$0.00	~30s
Before a release tag or PR merge	Both layers (`./run.sh --with-claude`)	~$0.03	~2 min
Nightly on `main` (CI)	Both layers	~$1/month	~2 min

The monthly CI cost of verifying every critical path against a live model is under a dollar. That number will climb as the scenario count grows, but it scales with the number of distinct features, not with the number of developers or the size of the fact store.

Sign-off: all post-MemPalace and post-Repowise features functional end to end, zero regressions, $0.03 total.