BETA In open beta. Install live. Lock $5/mo for your first 12 months. See pricing →
← Compare Compare · orchestration

The autoresearch loop, generalised to software

Karpathy's autoresearch proved one idea cleanly: let an agent loop autonomously overnight, but only keep work that clears an objective gate. Hydrate's Design and Develop modes run that same discipline for software, across vendors, with memory and an audit trail underneath.

autoresearch is not a memory tool, and not a Hydrate competitor. It is the clearest published example of the pattern Hydrate's orchestration is built on: bounded autonomy under a measurable gate. The interesting question is what that pattern looks like once you take it off a single GPU and into a real engineering team. That is what Hydrate's Design and Develop modes are.

autoresearch vs Hydrate · Design + Develop

Dimension karpathy/autoresearch Hydrate · Design + Develop
Shared idea Autonomous loop with a keep/discard gate Same pattern: autonomous rounds, gated before anything is accepted
The loop Edit train.py → train 5 min → check metric → keep or discard → repeat Design / Develop rounds → acceptance block + verify pass → accept or reject
What's gated One held-out scalar (val_bpb) A checkable acceptance block (definition-of-done) + goal-coverage contract
Agents Single agent editing one file Multi-agent, multi-vendor: Claude implements, Codex reviews and judges, Fable as fallback
Domain LLM training on a single GPU Software engineering across repositories
Memory - none Runs on a shared, cross-runtime memory substrate
Audit trail Logs results to disk for morning review Durable, attributed audit trail of every round and decision
Human role Review the logs in the morning Supervisory sign-off mid-run (Design); integration gate (Develop)
Multi-agent swarms Noted as a future direction, not implemented The product: orchestration is the engine, not a roadmap note
Stack Python + PyTorch + uv, NVIDIA GPU required Single Go binary + SQLite, no GPU
Licence / maturity MIT, open source Commercial · Design proven, Develop live

Facts verified against the repository (June 2026): MIT, tens of thousands of GitHub stars, single-agent, no memory or cross-tool layer, train.py is the only agent-editable file, gated on val_bpb.

What autoresearch is

A self-contained harness for autonomous ML experimentation. Three files: prepare.py (data prep, untouched), train.py (the model and training loop, the only file the agent may edit), and program.md (human-authored research directives). The agent edits train.py, trains for a fixed five-minute budget, checks whether validation bits-per-byte improved, keeps or discards the change, and repeats, logging outcomes for a human to review in the morning. The design is deliberately minimal and points at agent swarms as a future direction.

The shared pattern, and what Hydrate adds

Both tools refuse to trust an agent's self-report. An agent may iterate without a human in the inner loop, but work is only accepted when it clears an explicit bar. autoresearch's bar is one number on a held-out split. Hydrate's bar is the acceptance block: a measurable definition-of-done plus a goal-coverage check, enforced by a verify pass in Develop mode before work is integrated.

Everything else Hydrate brings is what that pattern needs to survive outside a single GPU. Multiple agents across vendors (Claude implements, Codex reviews and judges, Fable as fallback). A shared cross-runtime memory substrate, so agents inherit context instead of starting cold. A durable audit trail with attribution, and human sign-off at the points that matter. autoresearch has none of these, because optimising one model overnight does not need them. Coordinating an engineering epic does.

Where autoresearch genuinely wins

ML training has one clean differentiable number to hill-climb. Software does not, so Hydrate's gate has to be a structured, checkable contract rather than a loss curve. We took the discipline, not the metric. For the record, Hydrate's acceptance block is, by our own design history, autoresearch-inspired. Borrowing the best idea in the category and saying so is more useful than pretending we invented it.