This post stakes a position: Opus 4.5 changed what orchestration means. Previous harnesses worked around model weaknesses. helix builds on model strengths.
A typical criticism of agents is that they "forget everything between sessions."
And as it goes with these arguments, the word "forget" is load-bearing. It implies a retention mechanism that failed. But agents don't forget—there's no architecture for remembering. Each session starts blank not because memories faded, but because no one built the filing system.
helix is that filing system. This post is my contribution to the discourse on agent memory—specifically, the claim that feedback is what separates storage from learning. ftl persisted artifacts. helix tracks which artifacts actually helped. The difference compounds.
I've been building with Opus 4.5 since its release. Before Opus 4.5, agentic harnesses focused on working around the two worst tendencies of coding LLMs: scope creep and over-engineering. Agents felt like overeager junior-savants that had to be carefully steered whenever projects became even moderately complex.
Opus 4.5 broke this pattern. The model capability is there—agents can be genuine collaborators. What was missing was architecture that let collaboration compound across sessions. helix provides that architecture, built to leverage what Opus 4.5 makes possible rather than compensate for what earlier models lacked.
Six principles, each diagnosing a specific failure mode:
| Principle | The Failure It Prevents |
|---|---|
| Feedback closes the loop | Memory accumulation without learning |
| Verify first | Work that can't prove it succeeded |
| Bounded scope | Unauditable agent modifications |
| Present over future | Premature abstraction, over-engineering |
| Edit over create | File proliferation, duplicate logic |
| Blocking is success | Token spiral on unsolvable problems |
The word "feedback" in the first principle names the mechanism most agent memory systems lack. ftl persisted artifacts—each task left traces that survived. But persistence is not learning. A filing cabinet that grows larger is not getting smarter.
helix closes the loop through verification-based feedback. When memory is injected into a task, the system tracks whether it helped—but the tracking is incorruptible. The Builder doesn't self-report what was useful. Instead, the orchestrator compares what was injected against verification outcomes. A memory claimed as "utilized" but followed by a failed verification? That's a failed memory, regardless of what the Builder reported.
I believe this is the critical difference between storage and learning: tracking not what exists, but what worked. Most agent memory systems fail here—they optimize retrieval relevance when they should optimize retrieval usefulness. And when feedback depends on self-reporting, agents learn to perform helpfulness rather than achieve it.
/helix <objective>
│
▼
┌─────────────────────────────────────┐
│ EXPLORER SWARM (haiku, parallel) │
│ One agent per scope: │
│ structure │ patterns │ memory │ targets
└─────────────────────────────────────┘
│
▼
┌─────────────────────────────────────┐
│ PLANNER (opus) │
│ TaskCreate → Dependencies → DAG │
└─────────────────────────────────────┘
│
▼
┌─────────────────────────────────────┐
│ BUILDER(S) (opus, parallel) │
│ Read → Implement → Verify │
│ TaskUpdate(helix_outcome) │
└─────────────────────────────────────┘
│
▼
┌─────────────────────────────────────┐
│ ORCHESTRATOR JUDGMENT │
│ Compare verification vs. claims │
│ feedback() based on outcomes │
└─────────────────────────────────────┘
Three agents. Explorer swarm gathers context in parallel. Planner decomposes objectives into a native Task DAG. Builders execute within strict constraints. Learning extraction is orchestrator judgment—not a separate agent that can be gamed.
Memory flows through the entire pipeline:
recall() → inject → verify → feedback() → store()
The Explorer queries memory before planning. Builders receive injected memories and work within delta constraints. Verification determines success—not self-assessment. The orchestrator extracts learning from outcomes, comparing what was injected against what verification proved useful.
What happens when a Builder claims a memory was utilized but verification fails? The feedback loop records a failure for that memory. What happens when verification succeeds without utilizing an injected memory? The orchestrator notices—that memory's effectiveness score drops. The system learns which context actually helps, and the signal is incorruptible because it flows from verification outcomes, not agent claims.
At one extreme: a single monolithic agent doing everything. Maximum context, minimum coordination overhead. But also maximum scope creep, maximum token waste when things go sideways.
At the other extreme: dozens of specialized microservices, each responsible for one atomic operation. Minimal blast radius, but coordination overhead dominates. The orchestrator becomes more complex than the work.
helix sits between: three agents, distinct capabilities, explicit constraints.
| Agent | Model | Role | Execution |
|---|---|---|---|
| Explorer | Haiku | Codebase reconnaissance | Parallel swarm (one per scope) |
| Planner | Opus | Task DAG decomposition | Sequential |
| Builder | Opus | Execution within constraints | Parallel (by dependency) |
The division is deliberate.
Explorer Swarm runs on Haiku for cost efficiency, with one agent per reconnaissance scope: structure, patterns, memory, targets. Parallel execution means reconnaissance completes faster than sequential querying. Each explorer has a focused mission—no scope creep, no coordination overhead within the swarm.
Planner runs on Opus with no budget limit. Planning is highest-leverage—a poor plan wastes every downstream tool call. The Planner decomposes complex objectives into focused tasks using Claude Code's native task system: TaskCreate for each task, TaskUpdate to set dependencies. Humans see progress via Ctrl+T.
Builder runs on Opus with tight budget (5-9 tools per task). This constraint prevents spiral. Can't complete within budget? Block with what was tried. The budget forces focus: read delta files, implement, verify. Builders execute in parallel when dependencies allow, with outcomes recorded in task metadata (helix_outcome: delivered or helix_outcome: blocked).
What about learning extraction? Previous versions of helix used a separate Observer agent. The failure mode: an agent extracting lessons from its own work is an agent that can game its own feedback. The Observer could claim patterns were useful when they weren't, could extract "lessons" that confirmed its biases.
In helix v2.0, learning extraction is orchestrator judgment—not a separate agent. The orchestrator compares verification outcomes against injected memories. No self-reporting. No gameability. The system learns from what actually worked, verified by tests and commands, not from what an agent claimed was helpful.
The Planner doesn't create a list. It creates a directed acyclic graph with explicit dependencies, registered directly in Claude Code's native task system.
TaskCreate("001: spec-auth-models")
TaskCreate("002: spec-auth-tests")
TaskCreate("003: impl-auth-service")
TaskUpdate(002, blockedBy=[001])
TaskUpdate(003, blockedBy=[001, 002])
Result:
001: spec-auth-models ─┬─→ 002: spec-auth-tests ─┬─→ 003: impl-auth-service
│ │
└─────────────────────────┘
Tasks are visible via Ctrl+T. Humans see progress as the pipeline executes. No custom UI, no separate dashboard—the native task system provides visibility.
Each task carries metadata:
| Field | Purpose |
|---|---|
subject |
Execution order + human-readable name ("001: spec-auth-models") |
description |
What this task accomplishes |
blockedBy |
Tasks that must complete first |
metadata.delta |
Files this task may modify (strict constraint) |
metadata.verify |
Command to verify completion |
metadata.budget |
Tool calls allocated (5-9) |
metadata.helix_outcome |
Result: delivered or blocked |
What happens when task 002 fails? The DAG structure means task 003 blocks until its dependencies resolve—but unrelated tasks can proceed. Failures are contained to their branch. A blocked task doesn't poison the entire objective.
The helix_outcome field is critical. When a Builder completes, it updates the task: TaskUpdate(taskId, metadata={helix_outcome: "delivered"}) or helix_outcome: "blocked". The orchestrator reads these outcomes to drive feedback—comparing verification success against memory utilization claims.
This is where helix diverges most from ftl. Not storage—a learning system with effectiveness tracking, decay, and incorruptible feedback.
Everything lives in a single SQLite database at .helix/helix.db. No scattered JSON files, no complex hierarchies.
| Table | Purpose |
|---|---|
memory |
Failures, patterns, and systemics with embeddings |
memory_edge |
Relationships between memories |
Memories are stored with 384-dimensional embeddings for semantic search. Our brains did not evolve to visualize a 384-dimensional space any more than they evolved to visualize the distance to Andromeda. But similarity in that space captures meaning in ways keyword matching never will. Two memories about "authentication failing silently" and "auth errors swallowed without logging" cluster together—zero shared keywords.
| Type | Purpose | Source |
|---|---|---|
| failure | What went wrong and how to avoid it | Blocked tasks |
| pattern | What worked well and should be repeated | Delivered tasks (SOAR chunking) |
| systemic | Recurring issues requiring architectural attention | 3+ occurrences of similar failures |
The third type is new. When the same failure appears three or more times, the system promotes it to systemic. Systemic memories surface with higher priority and include a note: "This has occurred 3+ times. Consider architectural changes." The system doesn't just learn from individual failures—it detects patterns of failure that suggest deeper issues.
Ranking is not just semantic relevance:
score = (0.5 × relevance) + (0.3 × effectiveness) + (0.2 × recency)
Where:
helped / (helped + failed), default 0.5 if no feedback2^(-days_since_use / 7) (ACT-R decay)The ACT-R decay comes from cognitive architecture research. Memories that haven't been used recently fade, matching how human memory works. A memory that helped six months ago but hasn't been touched since ranks lower than one used yesterday.
| Primitive | Purpose |
|---|---|
store |
Add a new failure or pattern |
recall |
Query memories by semantic similarity |
get |
Retrieve a specific memory by ID |
edge |
Create a relationship between memories |
edges |
List relationships for a memory |
feedback |
Update helped/failed counts based on outcomes |
decay |
Find dormant memories for review |
prune |
Remove memories with effectiveness < 0.25 |
health |
Report memory system status |
Nine operations. No more, no less. The constraint is deliberate—complexity in the memory system becomes complexity in every agent that uses it.
The critical difference from self-reported feedback:
# Old approach (corruptible):
# Builder self-reports: "I used memory-1 and memory-2"
# System trusts the report
# New approach (incorruptible):
# 1. Memory injected before task
# 2. Builder works, verification runs
# 3. Orchestrator compares:
# - Verification passed + memory claimed utilized → helped++
# - Verification failed + memory claimed utilized → failed++
# - Verification passed + memory NOT utilized → failed++ (didn't help)
The Builder still reports which memories it utilized. But the feedback loop validates those claims against verification outcomes. A memory can't accumulate "helped" counts if the tasks that claimed to use it keep failing verification.
Memory recall uses two parallel queries:
The union surfaces context that pure embedding similarity might miss. A failure about "circular imports in auth.py" might not rank high for "add user endpoints"—but it will surface if auth.py is in the delta.
When memories are injected, they include effectiveness signals:
FAILURES TO AVOID:
- circular-import-auth [75%]: When importing auth models...
- pydantic-v1-syntax [unproven]: Using deprecated validator...
[75%] means this memory has helped 75% of the time it was utilized. [unproven] means no feedback data yet—the memory is untested.
Builders can calibrate trust. A 90% effective memory is nearly certain to help. A 40% memory might be outdated or poorly scoped. An unproven memory is a hypothesis worth testing.
Memories don't exist in isolation. The system tracks edges:
| Type | Meaning |
|---|---|
co_occurs |
These failures tend to appear together |
causes |
This failure leads to that failure |
solves |
This pattern resolves that failure |
similar |
These memories are semantically close |
What happens when a new failure is stored? The edges() traversal explores its neighborhood—if this failure appears, what related failures should we watch for? What patterns have solved it before? The graph surfaces context that pure embedding similarity misses.
helix v2.0 is prose-driven. The logic lives in SKILL.md files. Python utilities provide muscle—embeddings, database operations, scoring calculations—but the orchestration logic is expressed in prose that Opus 4.5 can follow directly.
Why does this work now when it wouldn't have before?
The architecture assumes model capability that didn't exist before Opus 4.5. Previous harnesses needed rigid scaffolding because the models needed rigid scaffolding. Complex state machines, explicit transition rules, error recovery logic—all compensating for models that couldn't reliably follow nuanced instructions.
Opus 4.5 can follow nuanced instructions. A prose specification that says "after three failed attempts with similar approaches, STOP and analyze" actually works. The model exercises judgment. The architecture can be simpler because the model is capable enough to handle complexity in prose rather than requiring complexity in code.
The single source of truth is TaskList metadata. The native task system tracks:
No custom tables for plan or workspace state. The native system provides visibility (Ctrl+T), persistence, and a stable API. helix builds on Claude Code rather than alongside it.
This matters for maintenance. Previous versions maintained separate plan and workspace tables that could drift out of sync with reality. Native integration means one source of truth—what TaskList says is what exists.
| Command | Purpose |
|---|---|
/helix <objective> |
Full pipeline: explore → plan → build → learn |
/helix-query "topic" |
Search memory by semantic similarity |
/helix-stats |
Memory health metrics and feedback loop status |
The instinct is to keep trying. An agent that gives up feels like failure.
The word "failure" names the wrong thing here. A blocked task with clear documentation is information. An agent that spirals for 100k tokens is waste. The confidence to escalate—"this is beyond what I can solve within constraints, here's what I tried"—is a feature.
When a task goes sideways, the Builder has hard constraints: 5-9 tools max, delta scope enforced. If it hasn't solved the problem within budget, it's exploring, not debugging. At that point, or after hitting the same error three times, the Builder blocks.
The task metadata records what was tried:
TaskUpdate(taskId, metadata={
helix_outcome: "blocked",
blocked_reason: "Need to modify src/main.py but it's not in delta",
tried: "Implemented auth service in src/services/auth.py",
error: "Cannot import auth routes without modifying main.py"
})
The metacognition check is explicit: after three failed attempts with similar approaches, stop and analyze. Is there a fundamentally different approach? Is the task mis-scoped? Is information missing?
Blocked tasks feed the learning system. Every block becomes a potential failure memory—a lesson for future tasks. But critically, the learning happens through orchestrator judgment, not Builder self-assessment. The Builder reports what happened. The orchestrator decides what to learn.
Put frankly: a blocked task with good documentation is a contribution. Token spiral is not.
Use helix when:
Skip helix when:
I reach for helix when the work will matter again tomorrow. One-off script debugging doesn't need orchestration overhead. Infrastructure that will evolve over months—the memory system pays dividends every session.
The overhead is real. Explorer swarm, Planner, Builder(s), orchestrator judgment—that's more tokens than asking Claude Code directly. The value proposition is compounding returns: each session makes the next session smarter. For isolated tasks, the overhead doesn't pay back. For ongoing projects, the investment compounds.
# Add the crinzo-plugins marketplace
claude plugin marketplace add https://github.com/enzokro/crinzo-plugins
# Install helix
claude plugin install helix@crinzo-plugins
Or from inside Claude Code:
/plugin marketplace add https://github.com/enzokro/crinzo-plugins
/plugin install helix@crinzo-plugins
Context loss is an architecture problem, not a capability problem. Opus 4.5 proved that models can be genuine collaborators. What was missing was architecture that let collaboration compound.
helix builds on ftl with two crucial additions: verification-based feedback and native integration with Claude Code's task system. Memory accumulation is not learning. A system that tracks which memories actually helped—verified by test outcomes, not agent claims—is a system that improves.
The shift from v1 to v2 mirrors the shift Opus 4.5 enabled in the broader ecosystem. Previous harnesses compensated for model weaknesses with rigid scaffolding. helix v2 assumes model strength and builds accordingly: prose-driven logic, incorruptible feedback, parallel execution where dependencies allow.
In short: memory without verification is self-delusion. Memory with verification is learning.
The models are ready. The architecture is learning. We're building toward agents that genuinely compound knowledge across sessions—not filing cabinets that grow larger, but collaborators that grow smarter.