The Runtime Nobody Talks About

Every demo agent is a miracle. Every production agent is a distributed system.

This is the gap nobody wants to discuss, because discussing it means admitting that the hard part of agentic AI isn't the model. It isn't the prompt. It isn't the chain-of-thought reasoning or the tool definitions or the carefully curated system instructions. The hard part is the thing that holds all of it together while the world falls apart around it.

The runtime.

The Demo Delusion

A demo agent looks like this: a single model, a single turn, a handful of tools, a human watching. Latency doesn't matter because the audience is captivated. Failures don't matter because you can restart. Cost doesn't matter because it's a demo. There is no state to corrupt, no permission boundary to violate, no sub-agent to stall, no third-party API to timeout, no token budget to blow.

A demo agent is a function call with better marketing.

Now ship that to 10,000 enterprise users hitting it simultaneously. Each with different data permissions. Each expecting sub-second responses. Each triggering multi-turn conversations where the agent decomposes their question into a plan, executes five to twelve steps involving search, data analysis, calendar lookups, and document retrieval, evaluates its own confidence, and adapts. Some of those steps spawn sub-agents that run in parallel. Each sub-agent calls a different model provider. Each provider has different latency characteristics, rate limits, and failure modes. Each tool call has side effects. Each step generates context that the next step depends on.

That's not a prompting problem. That's a distributed systems problem with a language model in the hot path.

I've watched teams go through this transition. The demo takes two weeks. The production hardening takes six months. And most of that six months is spent on problems that never appeared in the demo: what happens when the model provider has an outage mid-conversation? What happens when two sub-agents return contradictory results? What happens when the user's permissions change between step three and step seven of a twelve-step plan? What happens when the agent's plan is correct but the execution takes forty seconds and the user has already navigated away?

These aren't edge cases. They're Tuesday.

What Runtime Actually Means

When I say "runtime," I don't mean a framework. I don't mean LangChain or CrewAI or whatever orchestration library shipped this week. I mean the infrastructure layer that makes agent execution reliable, observable, and governable at scale. The thing that sits between "the model decided to call a tool" and "the tool was actually called, the result was validated, the token budget was checked, the permission boundary was enforced, the latency was measured, and the next step was dispatched."

This is plumbing. And the abstraction doesn't matter. The plumbing does.

A production agentic runtime has to solve at least six hard problems simultaneously.

The first is multi-turn state management. An agent conversation isn't stateless. It's a DAG of decisions, tool calls, intermediate results, and plan revisions that accumulates across turns. The runtime has to maintain this state coherently, including when the plan changes mid-execution, when sub-agents fork and rejoin, when memory from previous sessions needs retrieval. This isn't session storage. It's a transaction log for cognition.

The second is model routing under latency constraints. Not every step deserves the same model. A plan decomposition step might need a frontier reasoning model. A simple entity extraction might need something fast and cheap. A code generation step might need a specialized model. The runtime has to route each step to the right model, factor in current latency and availability from each provider, and keep the aggregate p95 under the SLO while maintaining conversational coherence across model boundaries.

Third, token and tool budget enforcement. Left unchecked, an agent will consume unbounded resources. It will retry failed tool calls indefinitely. It will expand its plan when it should contract. It will call expensive models for trivial sub-tasks. The runtime needs hard budget enforcement, not as an afterthought, but as a first-class scheduling constraint. This is the backpressure problem: how do you throttle an agent that doesn't know it's expensive?

Fourth, fault isolation across sub-agents. When you decompose a complex query into parallel sub-agents (one searching documents, one analyzing data, one looking up people) you've created a distributed system. And distributed systems fail partially. The search sub-agent times out. The data analysis sub-agent returns a malformed result. The people lookup hits a rate limit. The runtime has to isolate these failures, decide which are recoverable, provide fallbacks for those that aren't, and synthesize a coherent response from incomplete information. Circuit breakers, bulkheads, graceful degradation. Patterns the distributed systems community solved twenty years ago. But now the "service" behind the circuit breaker is a language model that might hallucinate its error message.

Fifth, permission-aware execution. In enterprise contexts, every piece of data has an access control list. The agent can see what the user can see, nothing more. But the agent's plan doesn't know about permissions until execution time. A plan step might call for retrieving a document the user can't access. The runtime has to enforce this at the tool-call level, in real-time, without leaking information about the existence of restricted resources. This isn't authorization bolted on top. It's authorization woven into the execution graph.

Sixth, observability across the full execution chain. When a twelve-step agent chain produces a wrong answer, you need to understand why. Not just which step failed, but why the plan was structured that way, why the model chose that tool, why the intermediate result looked correct but led to a bad conclusion, why the self-reflection step didn't catch it. This is distributed tracing for cognitive processes. And it's harder than tracing microservices, because the "service" at each node is non-deterministic.

The Organizational Isomorphism

I've been building agent orchestration systems for two years. Before that, I spent five years at Meta managing a 70+ person global organization, building coordination infrastructure for platform partnerships across Messenger and Instagram.

The problems are the same.

When I started wiring up multi-agent systems, assigning specialized agents to sub-tasks, defining handoff protocols, building escalation paths, implementing quality checks, I realized I was rebuilding an org chart. Not metaphorically. Structurally. Reporting lines became routing rules. Communication protocols became context-passing interfaces. Performance reviews became output validation. The escalation path when a junior team member is stuck became the fallback strategy when a sub-agent fails.

Consider one concrete parallel: on-call escalation. In a human org, when a junior engineer hits a problem beyond their scope, there's a defined escalation path. They page the senior on-call, provide context about what they tried, and the senior either resolves it or escalates further. The timeout is explicit (fifteen minutes to acknowledge), the context-passing format is standardized (incident ticket), and if the escalation chain breaks, there's a fallback (skip-level page). In a multi-agent system, when a sub-agent fails or produces low-confidence output, the runtime follows the identical pattern: timeout threshold, structured context handoff to a more capable model, fallback if the escalation target is unavailable. Same state machine. Same failure modes. Same design trade-offs between speed and thoroughness. The difference is that the agent never argues about who's really on-call.

Stripping out the emotional layer didn't simplify management. It clarified it. Half of management is coordination engineering: who does what, how information flows, where decisions get made, how quality gets enforced. The other half is emotional labor: motivation, trust, belonging, growth. We've bundled these under one job title for so long that we forgot they're separate disciplines. Agent systems need the first half. All of it. None of the second.

This means the people best positioned to build production agentic runtimes aren't just distributed systems engineers. They're not just ML engineers. They're people who've operated coordination systems at scale, who understand that the runtime's job isn't intelligence. It's governance. The model provides the capability. The runtime provides the structure that makes capability reliable.

Context Is Inside the Loop

The most common architectural mistake in agent systems is treating context retrieval as a preprocessing step. You fetch the relevant documents, stuff them into the prompt, and let the model reason. This works for single-turn search. It breaks completely for multi-turn agent workflows.

In a production agent system, context is generated at every step. The model's plan creates context. Each tool call returns context. Each sub-agent produces context. The self-reflection step generates meta-context about the agent's confidence. And all of this feeds back into the next planning decision.

The knowledge graph isn't upstream of the orchestration layer. It's inside it. Context shapes the plan. The plan generates new context. The cycle runs until the agent is confident or the budget runs out.

This is why the companies building serious agentic platforms have invested years in search and knowledge graph infrastructure before adding agents. The agent layer is a consumer of context infrastructure and a producer of it. You can't bolt agents onto a system that doesn't already have deep, permission-aware, real-time context retrieval. You'll get demo agents. You won't get production agents.

I see this in my own system. An agent tasked with writing a deployment runbook needs to retrieve the last three incident reports, the current infrastructure state, and the team's on-call schedule. But the act of retrieving those reports generates new context: the agent notices a recurring failure pattern across incidents, which changes its plan from "write a standard runbook" to "write a runbook that addresses this specific failure mode." That new plan requires retrieving different context (the relevant service's architecture docs). The loop continues. Each retrieval reshapes the plan. Each plan reshapes the retrieval.

The most thoughtful architectures I've seen separate the stack into layers (context, models, orchestration, security, interfaces) but deliberately couple context and orchestration tightly. The orchestration layer needs to know what the organization knows in order to plan. The context layer needs to ingest what the orchestration layer produces in order to learn. Decoupling models from context is correct, since you want to swap providers without losing organizational knowledge. But decoupling context from orchestration is an anti-pattern. They're the same system wearing different hats.

This has a practical implication for anyone building agent infrastructure: your context retrieval system and your orchestration engine need to share state. Not through an API boundary with request-response semantics, but through shared memory, co-located processes, or at minimum a pub-sub channel that lets the orchestrator's plan updates flow to the retrieval system in real time. The moment you treat context as a service that the orchestrator calls, you've introduced a latency barrier into the innermost loop of agent execution. Every millisecond in that loop multiplies across every step of every agent chain of every concurrent user.

What Production Teaches You

I run a multi-model orchestration system in production. Not at Google scale. At indie-builder scale, which means every failure is my problem, every dollar of compute comes out of my pocket, and every architectural decision has to justify itself against the alternative of just doing the task manually.

Model routing turns out to be a scheduling problem, not an intelligence problem. You don't need an AI to pick the right AI. You need a policy engine that maps task types to model capabilities, factors in current latency and cost, and makes a deterministic decision. The routing logic should be boring. If it's interesting, you've over-engineered it. My system routes across OpenAI, Anthropic, and Google models using static rules with dynamic fallbacks. The most valuable optimization wasn't smarter routing. It was faster detection of provider degradation. When Anthropic's API starts returning elevated p95 latencies, the runtime needs to shift load to the fallback model within seconds, not minutes. That's not AI. That's health checking. The same health checking we've done for stateless HTTP services for two decades, except now the "service" occasionally returns creative fiction instead of a 500.

Memory is the hardest unsolved problem, and not because storage is hard. Relevance retrieval across sessions is hard. What does the agent need to remember from three conversations ago? Not everything. Not nothing. The right things. And "right" depends on the current task, which the agent hasn't finished planning yet. Memory and planning are co-dependent, and most architectures treat them as sequential. I've watched agents retrieve perfectly relevant context from two weeks ago and completely ignore it because the planning step had already committed to a direction before memory results arrived. The ordering problem is brutal. You can't plan without memory. You can't query memory without a plan. Every solution I've seen either resolves this with a two-pass approach (rough plan, then memory retrieval, then refined plan) or accepts the latency cost of interleaving them. Neither is satisfying.

Validation beats iteration, consistently and by a wide margin. I'd rather have an agent that runs once and validates its output against explicit criteria than an agent that runs five times and "improves." Self-reflection loops sound elegant. In practice, they're expensive, slow, and surprisingly bad at catching the failures that matter. A deterministic validation step (does this output match the schema? does this code compile? does this answer contain a citation?) catches more real failures than any amount of LLM self-critique. The reason is straightforward: the same model that produced the error is unlikely to catch the error on re-examination. You need an external reference frame. Deterministic checks provide one. Another language model does not.

Perhaps the deepest lesson: the agent boundary is an organizational decision, not a technical one. Where you draw the line between one agent and another, what constitutes a "sub-agent" versus a "tool call" versus a "step in the plan," is org design dressed up as architecture. It's about span of control, information flow, and failure isolation. Not about what's technically possible. Everything is technically possible. The question is what's governable.

I restructured my agent boundaries three times in six months. The first design had too many specialists (eight agents, constant context-passing overhead, failures cascading everywhere). The second had too few (two mega-agents that ran out of context window on complex tasks). The third landed on a pattern that mirrors how I'd staff a small team: a generalist coordinator with access to specialist tools, escalating to dedicated agents only when the task requires sustained focus in a specific domain. The coordinator handles 80% of requests directly. The specialists handle the 20% that would blow the coordinator's context budget. This isn't novel distributed systems theory. It's the same span-of-control trade-off every engineering manager makes when deciding whether to split a team.

The Work Ahead

The next generation of agentic systems won't be differentiated by model capability. Frontier models are converging. The differentiation will be in runtime: who can make agents reliable, observable, governable, and fast at scale. Who can solve the distributed systems problems that emerge when you put a non-deterministic process in the hot path of a production system that ten thousand people depend on.

Forty percent of agentic AI projects will be cancelled by 2027, according to current industry analysis. Not because the models aren't good enough. Because the runtime engineering isn't there. Teams built demo agents, showed them to stakeholders, got funding, and then discovered that making them reliable requires solving problems that have nothing to do with AI and everything to do with systems engineering.

The bottleneck isn't people who understand transformers and attention mechanisms and RLHF. Those skills matter, but they're not scarce in the way that matters. The bottleneck is people who can build low-latency distributed systems with non-deterministic components in the critical path. People who can implement circuit breakers for services that hallucinate their status codes. People who can design observability for processes where the execution trace is a tree of natural-language reasoning steps. People who understand that "the agent failed" is as useful as "the server returned an error," which is to say, not useful at all without the full trace.

The coordination skills, the management instincts, the ability to reason about how information flows through a system of semi-autonomous actors: these are transferable, and increasingly they're the skills that determine whether an agentic system works in production or only works in the demo room. The runtime engineering underneath, the actual plumbing that makes it all hold together, remains stubbornly hard and stubbornly human.

This isn't glamorous work. It doesn't get covered in keynotes or featured in product launches. It's the work that makes product launches possible.

In infrastructure, the plumbing is the product.

The Runtime Nobody Talks About

The Demo Delusion

What Runtime Actually Means

The Organizational Isomorphism

Context Is Inside the Loop

What Production Teaches You

The Work Ahead

Filed Under

Subscribe to the systems briefings

About the Author

Share this post

What's next

32 Tests, Zero Dollars: Visual E2E Testing with a VLM Running on My Laptop

The Human Bottleneck

My Tmux Skills Were Replaced by a SKILL.md File