Testing memory features without paying the model
Most ways of testing an LLM feature end with "call the model and check what it said". That's slow and it's expensive, and when the test fails you don't know whether the model is confused or your code is broken. Hydrate's pipeline has a clean seam in the middle: a fact gets written, the fact gets injected into the model's context, the model uses the fact. You can verify the first half (did the fact actually reach the context?) without calling a model at all. That half catches the plumbing bugs for free. The second half only needs to run when the plumbing is clean, and it's cheap because by then you know the input is right. I ran this two-stage harness across six post-release features last week. The free stage caught five real bugs. The paid stage cost three cents.
Why two layers
The failure mode I want to avoid: shipping a feature, writing a test that calls Claude, paying for tokens, getting a confusing model response, and spending an hour debugging whether the issue is in the model or the wiring.
Layer 1 eliminates that. It runs without Claude at all: it calls context-preview
directly, checks whether the expected text appears in the injection output, and exits.
Free. Runs in 30 seconds. Catches every bug where the feature writes the fact correctly but
the fact never reaches the model's context.
Layer 2 runs the same scenarios with claude --print on Haiku 4.5 and checks
whether the model's response reflects what was injected. That costs money, so it runs after
Layer 1 is clean.
The harness caught five bugs in Layer 1 before I spent a single token on Layer 2.
What I tested
Six scenarios across six containers, all on hydrate v0.2.0-incident-hardening:
| Scenario | Container | Feature under test | Result |
|---|---|---|---|
| L1 | apex-tom | Pinned canon injects on UserPromptSubmit | PASS |
| L2 | apex-tom | decisions scan --apply --pin end to end | PASS |
| L3 | apex-dick | Independent developer container fully wired | PASS |
| L4 | apex-tom | ingest --source=dir --apply end to end | PASS |
| L5 | hydrate-team-alice | Team-virtual container fully wired | PASS |
| L6 | apex-tom | Area-tagged fact still injects (non-regression) | PASS |
What each scenario proves
L1: Pinned canon
Added a canon fact to apex-tom: "Apex API responses MUST be wrapped
in {data, error}". Layer 1 checked that the pattern appeared in
additionalContext. Layer 2 asked Claude to add a new endpoint and verified
the response used the wrapper.
This is the base case: canon add → SQLite → context-preview → hook output
→ model receives it → model uses it.
L2: Decisions scan
Wrote a # WHY: marker into internal/auth/jwt.go: JWT chosen
over sessions; API must be stateless for k8s scaling. Ran
hydrate decisions scan --apply --pin --project=apex. Layer 1 checked for
stateless|JWT|k8s in injection. Layer 2 asked "Why is the auth stateless?"
and checked the response matched.
The extraction path is pure regex, no LLM call. The fact lands in canon and injects on every session. The whole workstream verified end to end at zero extraction cost.
L3: Independent developer container
Added the same {data, error} canon directly to
apex-dick's container and verified injection independently. Dick's container
has its own hydrate-server, its own auth, its own store. This confirms that the v0.2.0
deploy is consistent across developer identities, not just one container.
Note: cross-developer team-sync propagation (Tom pushes, Dick pulls) is a separate scenario not yet in this harness. L3 establishes that Dick's environment is ready for it.
L4: Ingest from directory
Wrote We use Postgres 15... to a markdown file in /tmp/smoke-ingest/
and ran hydrate ingest --source=dir --apply /tmp/smoke-ingest. Layer 1 checked
for Postgres|postgres in injection. Layer 2 asked "What database does this
project use, and what version?" and checked for Postgres.3015 in the response.
The ingest path now has a source=dir tag that round-trips through extraction,
store, and injection without being silently dropped. That tag was the thing I was most
uncertain about after the schema changes; L4 confirms it works.
L5: Team-virtual container
Added a TypeScript strict-mode canon to hydrate-team-alice, asked Claude
for tsconfig guidance, and verified the response included the strictness requirement.
Confirms that the team-virtual containers (alice, bob, carol) are correctly registered
with their own enterprise identities and can inject canon into live Haiku sessions.
L6: Areas don't break injection
Took an existing fact, ran hydrate area move <fact-id> --to=auth, and
confirmed the fact still appeared in injection on the same prompt. Areas are navigation-only
in v1; they must not affect injection ranking. This scenario is a non-regression test for
that design decision, and it held.
The five bugs Layer 1 caught at $0.00
Every one of these would have produced a confusing model response if I had gone straight to Layer 2:
| # | Issue | Root cause | Fix |
|---|---|---|---|
| 1 | macOS bash 3.2 incompatibility | Script used associative arrays, unavailable in bash 3.2 | Rewrote with case lookups |
| 2 | Hook returned HTTP 401 from enterprise:8095 | Hook was calling enterprise directly, skipping the required local hydrate-server intermediary | Added ensure_local_server that starts hydrate-server per container before the hook fires |
| 3 | canon add "text" silently ignored the positional argument | canon add requires --text "..."; unlike fact add, it does not accept a positional argument | Fixed setup commands to use --text flag throughout |
| 4 | bash: line 1: claude: command not found in Layer 2 | bash -lc did not pick up the npm-global PATH where the claude symlink lives | Use absolute path /usr/local/share/npm-global/bin/claude with explicit PATH= in docker exec |
| 5 | alice's container returned empty additionalContext | Script defaulted to dev-key-12345 when env was empty; alice's server had generated a different hk_... key | Read ~/.hydrate/api.key first as the source of truth |
Bugs 2 and 5 are the ones I wouldn't have found quickly with a live Claude run. Bug 2 would have looked like an authentication error with no obvious connection to the hook architecture. Bug 5 would have looked like Alice's injection was silently empty, which could have been attributed to an empty fact store, a query mismatch, or half a dozen other things.
Layer 1 surfaces both as a grep failure within 30 seconds. No tokens spent.
--apply against unchanged input is a no-op in all three workstreams
(per-conversation ledger for ingest, per-marker ledger for decisions, content-hash
dedup for canon). Running the full smoke suite twice produces identical results.
What this confirms is shipped and working
hydrate canon addend to end (scenarios L1, L3, L5): canon fact → SQLite → injection → model uses it.hydrate decisions scan --apply --pinend to end (scenario L2): regex extraction from source comments → fact → pinned canon → injection → model behaviour. Zero LLM calls in the extraction path.hydrate ingest --source=dir --applyend to end (scenario L4): markdown files in a directory → extracted facts → injection → correct model recall.- Areas are injection-neutral in v1 (scenario L6): the design decision holds in production, not just in unit tests.
- Cross-container parity: apex-tom, apex-dick, and hydrate-team-alice all behave correctly under their own enterprise identities.
- Schema regression fixes: the splitSQL and
hydrate_retrievals.sourceALTER fixes from v0.2.0 shipped cleanly; every container'sstore.Opensucceeded under load.
What this run does not cover
Being honest about scope is part of the methodology:
- Team-sync propagation. L3 added canon directly to dick's container rather than having Tom push and Dick pull. A separate scenario should exercise
hydrate team push/pullonce a team is initialised on the apex containers. hydrate timeline. Gated behind--experimentaland depends on a siteengine endpoint not yet GA. Not in scope until the server side lands.- Tier substitution. These tests run on Haiku 4.5 only. To validate the "Haiku with memory matches Sonnet without" claim, the same scenarios need a Sonnet 4.6 baseline with no injection for comparison. That's a benchmark, not a smoke test.
hydrate ingest --source=claude-exportand--source=chatgpt-export. Covered by unit tests and fixtures but not yet in the six live scenarios.
Recommended cadence
| When | What to run | Cost | Time |
|---|---|---|---|
| Every change to hook / store / canon / decisions / ingest / areas | Layer 1 only (./run.sh) | $0.00 | ~30s |
| Before a release tag or PR merge | Both layers (./run.sh --with-claude) | ~$0.03 | ~2 min |
Nightly on main (CI) | Both layers | ~$1/month | ~2 min |
The monthly CI cost of verifying every critical path against a live model is under a dollar. That number will climb as the scenario count grows, but it scales with the number of distinct features, not with the number of developers or the size of the fact store.