BETA In open beta. Install live. Lock $5/mo for your first 12 months. See pricing →
← Blog

Testing memory features without paying the model

Most ways of testing an LLM feature end with "call the model and check what it said". That's slow and it's expensive, and when the test fails you don't know whether the model is confused or your code is broken. Hydrate's pipeline has a clean seam in the middle: a fact gets written, the fact gets injected into the model's context, the model uses the fact. You can verify the first half (did the fact actually reach the context?) without calling a model at all. That half catches the plumbing bugs for free. The second half only needs to run when the plumbing is clean, and it's cheap because by then you know the input is right. I ran this two-stage harness across six post-release features last week. The free stage caught five real bugs. The paid stage cost three cents.

Why two layers

The failure mode I want to avoid: shipping a feature, writing a test that calls Claude, paying for tokens, getting a confusing model response, and spending an hour debugging whether the issue is in the model or the wiring.

Layer 1 eliminates that. It runs without Claude at all: it calls context-preview directly, checks whether the expected text appears in the injection output, and exits. Free. Runs in 30 seconds. Catches every bug where the feature writes the fact correctly but the fact never reaches the model's context.

Layer 2 runs the same scenarios with claude --print on Haiku 4.5 and checks whether the model's response reflects what was injected. That costs money, so it runs after Layer 1 is clean.

The harness caught five bugs in Layer 1 before I spent a single token on Layer 2.

What I tested

Six scenarios across six containers, all on hydrate v0.2.0-incident-hardening:

Scenario Container Feature under test Result
L1 apex-tom Pinned canon injects on UserPromptSubmit PASS
L2 apex-tom decisions scan --apply --pin end to end PASS
L3 apex-dick Independent developer container fully wired PASS
L4 apex-tom ingest --source=dir --apply end to end PASS
L5 hydrate-team-alice Team-virtual container fully wired PASS
L6 apex-tom Area-tagged fact still injects (non-regression) PASS
12 / 12 scenarios PASS across both layers. Total LLM cost: ~$0.03.

What each scenario proves

L1: Pinned canon

Added a canon fact to apex-tom: "Apex API responses MUST be wrapped in {data, error}". Layer 1 checked that the pattern appeared in additionalContext. Layer 2 asked Claude to add a new endpoint and verified the response used the wrapper.

This is the base case: canon add → SQLite → context-preview → hook output → model receives it → model uses it.

L2: Decisions scan

Wrote a # WHY: marker into internal/auth/jwt.go: JWT chosen over sessions; API must be stateless for k8s scaling. Ran hydrate decisions scan --apply --pin --project=apex. Layer 1 checked for stateless|JWT|k8s in injection. Layer 2 asked "Why is the auth stateless?" and checked the response matched.

The extraction path is pure regex, no LLM call. The fact lands in canon and injects on every session. The whole workstream verified end to end at zero extraction cost.

L3: Independent developer container

Added the same {data, error} canon directly to apex-dick's container and verified injection independently. Dick's container has its own hydrate-server, its own auth, its own store. This confirms that the v0.2.0 deploy is consistent across developer identities, not just one container.

Note: cross-developer team-sync propagation (Tom pushes, Dick pulls) is a separate scenario not yet in this harness. L3 establishes that Dick's environment is ready for it.

L4: Ingest from directory

Wrote We use Postgres 15... to a markdown file in /tmp/smoke-ingest/ and ran hydrate ingest --source=dir --apply /tmp/smoke-ingest. Layer 1 checked for Postgres|postgres in injection. Layer 2 asked "What database does this project use, and what version?" and checked for Postgres.3015 in the response.

The ingest path now has a source=dir tag that round-trips through extraction, store, and injection without being silently dropped. That tag was the thing I was most uncertain about after the schema changes; L4 confirms it works.

L5: Team-virtual container

Added a TypeScript strict-mode canon to hydrate-team-alice, asked Claude for tsconfig guidance, and verified the response included the strictness requirement. Confirms that the team-virtual containers (alice, bob, carol) are correctly registered with their own enterprise identities and can inject canon into live Haiku sessions.

L6: Areas don't break injection

Took an existing fact, ran hydrate area move <fact-id> --to=auth, and confirmed the fact still appeared in injection on the same prompt. Areas are navigation-only in v1; they must not affect injection ranking. This scenario is a non-regression test for that design decision, and it held.

The five bugs Layer 1 caught at $0.00

Every one of these would have produced a confusing model response if I had gone straight to Layer 2:

# Issue Root cause Fix
1 macOS bash 3.2 incompatibility Script used associative arrays, unavailable in bash 3.2 Rewrote with case lookups
2 Hook returned HTTP 401 from enterprise:8095 Hook was calling enterprise directly, skipping the required local hydrate-server intermediary Added ensure_local_server that starts hydrate-server per container before the hook fires
3 canon add "text" silently ignored the positional argument canon add requires --text "..."; unlike fact add, it does not accept a positional argument Fixed setup commands to use --text flag throughout
4 bash: line 1: claude: command not found in Layer 2 bash -lc did not pick up the npm-global PATH where the claude symlink lives Use absolute path /usr/local/share/npm-global/bin/claude with explicit PATH= in docker exec
5 alice's container returned empty additionalContext Script defaulted to dev-key-12345 when env was empty; alice's server had generated a different hk_... key Read ~/.hydrate/api.key first as the source of truth

Bugs 2 and 5 are the ones I wouldn't have found quickly with a live Claude run. Bug 2 would have looked like an authentication error with no obvious connection to the hook architecture. Bug 5 would have looked like Alice's injection was silently empty, which could have been attributed to an empty fact store, a query mismatch, or half a dozen other things.

Layer 1 surfaces both as a grep failure within 30 seconds. No tokens spent.

On re-running: the harness is idempotent. Re-running --apply against unchanged input is a no-op in all three workstreams (per-conversation ledger for ingest, per-marker ledger for decisions, content-hash dedup for canon). Running the full smoke suite twice produces identical results.

What this confirms is shipped and working

What this run does not cover

Being honest about scope is part of the methodology:

Recommended cadence

WhenWhat to runCostTime
Every change to hook / store / canon / decisions / ingest / areas Layer 1 only (./run.sh) $0.00 ~30s
Before a release tag or PR merge Both layers (./run.sh --with-claude) ~$0.03 ~2 min
Nightly on main (CI) Both layers ~$1/month ~2 min

The monthly CI cost of verifying every critical path against a live model is under a dollar. That number will climb as the scenario count grows, but it scales with the number of distinct features, not with the number of developers or the size of the fact store.

Sign-off: all post-MemPalace and post-Repowise features functional end to end, zero regressions, $0.03 total.