The Harness

OpenAI quietly dropped Symphony on GitHub this week. It's a long-running daemon that monitors your Linear board for work, creates an isolated workspace per issue, spawns a Codex agent, streams back proof of work — CI status, PR review, walkthrough video — and lands the PR when accepted. Engineers don't supervise individual agent runs. They manage the work.

One line in the README:

"Symphony works best in codebases that have adopted harness engineering."

So you click the link.

Harness Engineering is a five-month autopsy. One product, zero manually-written lines of code, roughly a million lines shipped, three engineers growing to seven, ~1,500 pull requests. 3.5 PRs per engineer per day. Throughput increased as the team grew — which is the number that matters. Compounding is real when the architecture is right.

Halfway through the article, they link to the Ralph Wiggum Loop.

If you've been following Forgeloop-kit, you know the Ralph loop is the architecture we've been running since January 2026 — before the harness engineering post existed. The same pattern: git-driven task routing, agent execution, proof-of-work verification, loop. OpenAI's team found it independently, ran it internally for five months, then published the manual. Symphony is what they built on top of it.

The proof that the pattern works isn't a blog post. It's the system publishing this one.

What they actually built

What they were actually building — underneath the product, through the product — was an environment capable of building products.

Five months of figuring out what breaks when humans stop writing code and start designing the environment in which agents write code. What does the repo need to look like so an agent can navigate it? What does the observability stack need to look like so an agent can debug it? What does the test suite need to look like so an agent can verify its own work?

Those are different questions than "what should this feature do." The shift is from builder to harness engineer.

The line from the post worth tattooing somewhere:

"what changes when a software engineering team's primary job is no longer to write code, but to design environments, specify intent, and build feedback loops that allow agents to do reliable work."

That's the job description for whatever comes next. I wrote about it from the principal/labor angle — this is the same thing from the inside of the codebase.

The AGENTS.md problem they hit first

Early in the experiment, they tried the obvious approach: one big AGENTS.md file with everything the agent needs to know. It failed in predictable ways.

Context crowding. A giant instruction file leaves no room for the task, the code, the relevant docs. Agents pattern-match locally instead of navigating intentionally. Rules rot instantly — humans stop maintaining a monolithic manual, agents can't tell what's still true. No mechanical verification.

Their fix: AGENTS.md as table of contents, ~100 lines, pointers to a structured docs/ directory. The knowledge lives in the repo; the file just maps it.

SkDD figured this out from a different angle. Skills are modular by design — discrete SKILL.md files, each doing one thing, each maintainable independently. The equivalent of their docs/ structure, except skill-shaped rather than documentation-shaped. Same principle: the repo's knowledge should be navigable, not monolithic.

If your AGENTS.md is over 200 lines, it's already a liability.

What the bottleneck actually is

Code throughput isn't the bottleneck. It hasn't been for a while.

OpenAI's team hit this fast. Once Codex was reliably shipping PRs, the constraint became human QA capacity. Their response was to make everything the agent needs for verification directly legible to the agent — per-worktree app instances, Chrome DevTools Protocol wired into the agent runtime, ephemeral observability stacks (logs, metrics, traces) per worktree.

Prompts like "ensure service startup completes in under 800ms" become tractable when the agent can actually query startup metrics. Prompts like "no span in these four critical user journeys exceeds two seconds" become tractable when the agent has PromQL access.

This is the depth-first principle the post describes: when something fails, the fix is never "try harder." The question is always "what capability is missing, and how do we make it legible to the agent?" You build the capability, you make it accessible, and the agent uses it.

The loop doesn't stall on hard tasks. It stalls on tasks where the environment hasn't been instrumented for that kind of work.

What's the same and what's different

Forgeloop's architecture maps almost exactly. git sync → task routing → plan → build → verify → push. The Ralph loop is the same loop they're describing — agents operate on discrete tasks, commit proof of work, loop. The difference is surface area: OpenAI had three full-time engineers and five months. Forgeloop is portable, designed to install into any repo and run from day one.

One concrete difference worth naming: Symphony is built around Linear as the work queue. Forgeloop uses GitHub Projects or local markdown files — IMPLEMENTATION_PLAN.md, REQUESTS.md. No Linear dependency, no external SaaS required. For teams already living in GitHub, that's the right default. The primitives are different; the pattern is identical.

The patterns that transferred cleanly: skills-driven modular knowledge, task-driven execution (not conversation-driven), agent-to-agent review, human time as the genuinely scarce resource.

The patterns they went deeper on: per-worktree isolation for parallel agent runs, UI legibility via DevTools Protocol, ephemeral observability. Those aren't in Forgeloop-kit yet. They're in the roadmap — and now there's a public spec to build against.

The pattern I haven't seen anyone talk about clearly enough: when the bottleneck shifts to verification, the harness becomes the product. Not the code. The environment that makes the code verifiable.

Symphony: one level up

Symphony is what harness engineering makes possible.

Once the repo is designed for agents to navigate — docs structured, verification automated, observability legible — Symphony is the daemon that removes the last manual step: a developer opening their laptop and kicking off a run. It monitors Linear, creates a workspace per issue, spawns Codex, and lands PRs. The loop doesn't wait for you. It runs while you sleep.

The policy lives in WORKFLOW.md, versioned with the code, loaded per run. Same principle as AGENTS.md-as-map: the repo owns its own operating instructions, and those instructions evolve with the codebase.

One detail the README buries: it says you can tell your coding agent to implement Symphony from the spec. The tool designed to run agents is itself designed to be built by an agent. That's not a cute recursion — it's the methodology validating itself. If your repo is harness-engineered, building Symphony becomes a tractable agent task.

Symphony is marked "low-key engineering preview for trusted environments." Not production-ready for all teams yet. But the spec is public and the reference implementation exists. The target is visible.

The stack nobody's named yet

SkDD, harness engineering, and Symphony aren't competing methodologies. They're sequential prerequisites.

SkDD is the knowledge layer. Agents forge skills as they build. The repo accumulates reusable capabilities. Every session leaves something behind — callable, discoverable, composable.

Harness engineering is the environment layer. The repo is designed for agents to navigate. AGENTS.md is a map. Observability is queryable by agents. Verification is automated. The environment makes the work tractable.

Symphony is the orchestration layer. A daemon reads the work queue, dispatches agents per issue, collects proof of work, lands PRs. Humans manage work, not agents.

You can't run Symphony on an unharnessed repo — it just produces faster chaos. You can't build a good harness without the knowledge primitives to populate it. The order matters. Start with the knowledge layer, build the environment, and the orchestration becomes possible.

Forgeloop-kit covers the middle layer — a portable harness that teams can install from day one, without a three-engineer team, without five months of runway, without a Linear subscription. It was running before OpenAI published the manual. Symphony is the next rung up. The distance between where most repos are and where they need to be is the work.

What to take from this if you're building

The post has buried leads. What actually compounds:

Start with the harness, not the tasks. Before you feed your agent a task list, ask what the repo needs to look like for an agent to navigate it. What's in AGENTS.md? What docs exist? What can the agent run to verify its work? The quality of the harness is the ceiling on agent throughput.

Make AGENTS.md a map, not a manual. ~100 lines. Pointers. Let the repo's structure carry the rest. If you're tempted to write everything into one file, write a skill instead.

Instrument for agent access, not human readability. Logs, metrics, UI state — if an agent can query it, the agent can verify it. Build it once, use it on every task. That's the multiplier.

The bottleneck is never generation. Your agent isn't too slow. Your environment isn't legible enough. Debug the harness before debugging the model.

Depth-first task breakdown is the only kind that compounds. When a task fails, find the missing capability. Build it. Make it legible. Now the whole class of tasks is unblocked. Width-first gets coverage. Depth-first gets compounding.

OpenAI ran a five-month experiment to confirm what the Ralph loop has been doing in production. Then they shipped Symphony as the next rung. The question now is whether your repo is ready for it.

The stack is: SkDD → harness → Symphony. The order matters.

Forgeloop-kit: forgeloop.zakelfassi.com
SkDD: github.com/zakelfassi/skills-driven-development
Symphony: github.com/openai/symphony

The Harness

What they actually built

The AGENTS.md problem they hit first

What the bottleneck actually is

What's the same and what's different

Symphony: one level up

The stack nobody's named yet

What to take from this if you're building

Filed Under

Subscribe to the systems briefings

About the Author

Share this post

What's next

The Plumber Lives Inside the House

48 Hours, 60 Seconds

SkDD: Skills-Driven Development