The Missing Runtime for Knowledge Work

Agentic AI thrives in software engineering because code already runs inside a repeatable execution environment. Most knowledge work doesn't. The ceiling isn't model intelligence, it's the absence of runtimes that can execute, validate, and provide feedback on structured work.

Feb 21, 2026·14 min read

If you look at where "agentic AI" actually works today, a pattern jumps out.

It's not evenly distributed across domains. It's not correlated with how valuable the work is. And it's not explained by model capability.

Agentic systems thrive in software engineering, and struggle almost everywhere else.

The common explanations are familiar. "Code is easier for models than language." "Software engineers are early adopters." "Other domains are riskier or more regulated." I've heard all of these. I don't think any of them hold up under scrutiny.

The real reason is simpler. And more structural.

Software already runs inside a repeatable execution environment. Most knowledge work does not.

I think this observation explains most of the variance in where agentic AI has succeeded and where it's stalled. And I think it points toward a much bigger opportunity than most people in the AI ecosystem are currently building toward.

Let me explain.

Why software wins (and keeps winning)

When an AI system writes code, something interesting happens that doesn't happen in most other domains: the environment pushes back immediately.

The code compiles or it doesn't. It runs or it crashes. Tests pass or they fail. The type checker complains. The linter flags issues. CI goes red. These aren't suggestions. They're constraints, enforced by the system itself.

That feedback loop is doing most of the work.

I want to be precise about this, because I think it's widely misunderstood. When people see an AI agent write a working function, the intuition is: "the model understood the problem." Maybe. But what's actually happening is more interesting. The model generates a candidate. The environment evaluates it. The model observes the result. It adjusts. The environment evaluates again.

The intelligence isn't just in the model. It's distributed across the model and the environment.

The environment is doing a huge share of the cognitive labor: rejecting bad outputs, surfacing errors, creating the gradient the model follows toward a correct solution.

Software has all of this as table stakes: a runtime, a type system, constraints, tests, clear failure modes, and replayability. Agents don't just generate output. They execute, observe, and adjust. That's what makes horizontal scaling possible.

The analogy I keep coming back to is this: imagine trying to learn to play basketball in a gym with no hoops, no lines on the floor, no scoreboard, and no one to tell you whether the ball went in. You could still practice, in some abstract sense. But you'd never get meaningfully better. The hoop is the feedback mechanism. Without it, practice is just motion.

Software engineering has the hoop. Most knowledge work doesn't even have the gym.

The problem everywhere else

Now contrast that with the work most organizations actually do.

Strategy. Research. Analysis. Policy. Legal. Operations. Planning. Procurement. Consulting. These domains represent the overwhelming majority of knowledge work in the economy. They're where the value is. And they're exactly where agentic AI has stalled.

These domains are text-heavy. The primary artifact is a document, and language is slippery. There's no equivalent of "it compiles." A strategy memo can be beautifully written and completely wrong.

They're context-dependent. The same recommendation might be excellent for one company and disastrous for another. The evaluation function isn't portable. It's embedded in organizational context, relationships, and politics that are rarely written down.

They're evaluated subjectively. Ask five executives to rate a strategic plan and you'll get five different answers for five different reasons. "Good" is a vibe, not a boolean.

They're executed by humans. In software, the computer executes the code and reports back. In knowledge work, a human reads the output, maybe acts on it, and the results show up weeks later, entangled with a hundred other variables.

And they're loosely structured, if structured at all. Most knowledge work processes exist as tribal knowledge. There's no schema. There's no DAG. There's certainly no runtime.

So what happens when an agent produces a strategy memo or a research synthesis? Nothing pushes back. The output might sound right (LLMs are extremely good at sounding right) but it isn't grounded in anything that can verify it. There's no compiler error. There's no failing test. There's no red build.

Without grounding, there's no feedback loop. Without a feedback loop, agents can't improve iteratively. They're stuck in single-shot mode: generate, deliver, hope for the best.

This isn't a model problem. Giving agents a better LLM doesn't fix it. This is an environment problem.

A brief detour: why "language as code" never worked

If you've been around software long enough, you'll know this isn't the first time someone noticed this gap.

For decades, researchers tried to make natural language executable. Natural language programming. Controlled English. Executable specifications. Rule engines. Workflow DSLs. Business process modeling languages. Every enterprise software company from the '90s through the '00s took a swing at this.

They mostly failed, and they failed for the same reason.

Natural language is inherently ambiguous. Traditional compilers are deterministic parsers. They need unambiguous input. So when you try to compile natural language, you have two options, and neither is great.

Option one: force the language to become unnaturally rigid. You end up with something that looks like English but reads like a legal contract written by a robot. Users hate it. Adoption dies.

Option two: stay flexible and accept unreliability. The system tries to interpret what you mean, gets it wrong half the time, and users lose trust. This is the Siri problem, the early chatbot problem, the "enterprise AI" problem of 2015.

What these approaches lacked wasn't ambition. It was a compiler front-end capable of handling ambiguity.

For fifty years, that front-end didn't exist.

What changed (and why this is different now)

Large language models quietly flipped the equation.

I want to be careful here, because this is the point where it's easy to either overclaim or underclaim, and both are wrong.

LLMs are not perfect reasoners. They're not deterministic. They're not truth engines. They hallucinate. They lose the thread on long contexts. I don't need to enumerate the limitations. If you're reading this, you know them.

But they are extremely good at one specific thing that matters enormously for this problem: turning messy human language into structured representations.

The best mental model for LLMs, at least for this application, is not "artificial brain" or "reasoning engine." It's something more like:

Probabilistic compiler front-end.

Think about what a compiler front-end does. It takes source code, relatively human-readable and expressive, and produces an intermediate representation that the back-end can optimize and execute. The front-end handles parsing, lexical analysis, syntax checking, and semantic analysis. It takes something messy and produces something structured.

LLMs can do this for natural language in a way that no prior system could. They extract structure from ambiguity. They normalize inconsistent inputs. They generate intermediate representations: schemas, plans, graphs, structured objects. And critically, they can revise those representations when constraints fail. You can say "that's wrong, the budget constraint is $500K not $5M" and the model adjusts the structured output accordingly.

This is the missing piece that decades of NLP research didn't have. Not because the researchers weren't brilliant (they were) but because the technology wasn't there. You can't build a probabilistic front-end with deterministic tools.

But here's the critical part, and it's where I think most of the AI industry is currently confused:

A compiler without a runtime is useless.

The most sophisticated parser in the world doesn't matter if there's nothing to execute the output against. A front-end without a back-end is an academic exercise. It produces intermediate representations that sit there, inert, unvalidated, unexecuted.

This is the state of most "AI agents" today. They have an incredibly powerful front-end (the LLM). They can parse human intent into structured plans, schemas, queries, and artifacts. But there's no runtime to execute those artifacts against. No constraints to validate them. No tests to verify them. No feedback loop to improve them.

They're compilers that produce code for a machine that doesn't exist.

The real abstraction: work needs runtimes

What makes software engineering uniquely amenable to agentic automation isn't intelligence. It's executability.

Here's the general pattern, one that I think applies far more broadly than people currently realize:

Human intent (language)
        ↓
Structured intermediate representation
        ↓
Execution inside a domain runtime
        ↓
Tests, validation, feedback
        ↓
Artifacts + telemetry

Software fits this model perfectly. You express intent ("build a login page"), the model produces an IR (code), the runtime executes it (the browser, the server), tests validate it, and you get artifacts plus telemetry.

Most other domains don't fit this model. Yet.

A runtime provides a very specific set of capabilities that most knowledge work currently lacks:

Execution semantics. A well-defined notion of what it means to "run" a piece of work. In software, the code executes. In strategy: what does it mean to "run" a strategic plan? There are no execution semantics for a strategy memo.

Feedback. The runtime tells you what happened. Did it work? Where did it fail? Why? In software, this is stack traces, test results, log output. In most knowledge work, feedback is a meeting three weeks later where someone says "I don't think this is quite right."

Constraints. The runtime enforces rules. Types must match. Resources must be available. Constraints aren't suggestions. They're walls. In knowledge work, constraints are typically buried in prose and enforced by human judgment.

Observability. You can see what the system is doing. Logs, metrics, traces. In knowledge work, observability is "ask Sarah, she knows how this process works."

Replayability. You can run the same work again and compare results. This is the foundation of testing and continuous improvement. In knowledge work, every analysis is a snowflake.

Once work can be executed, it can be tested. Once it can be tested, it can be automated. Once it can be automated, agents can scale. Each step depends on the one before it.

Why agents plateau without structure

This framing explains something that a lot of people in the AI space are experiencing but not naming correctly.

Agent demos are impressive. You watch an agent research a topic, build a plan, call tools, and produce a deliverable, and it looks like magic. Then you deploy it in production and it falls apart.

The demo works because a human is watching, correcting, and guiding. The human is the runtime. They're providing the feedback loop, the constraints, the validation. The agent generates candidates, and the human evaluates them.

But humans don't scale. That's the whole point.

If you need a human in the loop to validate every output, you haven't automated the work. You've automated the typing.

So agents plateau. They generate plans but can't enforce them. They reason in chains but can't verify the chains. They produce deliverables but can't test them.

What they're missing is everything a runtime provides. Enforceable constraints, not "the agent should consider the budget" but "this plan is rejected because it exceeds the budget by $200K." Clear success criteria, not "does this feel right" but "this analysis covers all required segments and reconciles with the financial model within 2%." Failure states that are specific and actionable. Regression tests that prove today's output is better than yesterday's.

Without these, improvement is manual. Scaling is vertical: you add more humans, not more agents. And trust never compounds, because there's no mechanism to demonstrate reliability.

Without a runtime, agents are eloquent guessers. And guessing doesn't compound.

What building a runtime actually means

I want to be clear about what I'm not saying. I'm not saying we should turn everything into code, or that every knowledge worker needs to learn to program. And I'm not saying we should force the richness of human work into some rigid formalism that strips it of nuance.

What I am saying is that for work to benefit from agentic automation, it needs certain properties. And those properties look a lot like what a runtime provides.

None of this requires perfect formalization. Partial structure already changes what's possible.

Concretely, building a runtime for a domain of knowledge work means defining:

Intermediate representations. What is the structured form of the work? For procurement, this might be a schema of suppliers, pricing tiers, contract terms, and compliance requirements. Not as prose, but as structured data a system can reason about. For legal review, a graph of clauses, obligations, parties, and conditions. For strategic planning, a plan object with goals, milestones, dependencies, and risk assessments.

Execution steps with inputs and outputs. What does it mean to "run" this work? What are the discrete steps, what does each take as input, and what does it produce? This isn't a rigid waterfall. It's more like defining an interface. Each step has a contract: given these inputs, produce these outputs, subject to these constraints.

Validation rules. What makes the output correct, or at least not obviously wrong? Even partial validation is enormously more useful than none. "All numbers in the financial summary must reconcile with the source data" is a validation rule. "The analysis must reference at least three primary sources from the last 12 months" is a validation rule. They're not sufficient for quality, but they catch a huge class of errors that agents currently make silently.

Acceptance tests. Given a known input, does the system produce an acceptable output? You can define test cases: "given this RFP and these three vendor proposals, the system should identify the lowest-cost compliant option and flag compliance gaps." The expected output doesn't need to be exact. It needs to be evaluable.

Logging and traceability. Every step should produce a record of what happened, what inputs were used, what decisions were made. When an analysis is wrong, you should be able to trace back through the execution log and find where it went wrong. Not reconstruct the reasoning from a finished document.

Once these things exist, agents stop being chatbots that produce one-shot outputs and start behaving like programs that execute, get feedback, and improve. The human moves from being in the loop on every output to supervising exceptions and edge cases. Trust starts to compound because you can demonstrate that the system works reliably on known test cases.

This is exactly the transition that happened in software engineering. Programmers used to write code and manually verify it. Then testing frameworks emerged. Then CI/CD. Then automated deployment. Each step moved the programmer up the abstraction stack. The same progression is possible in knowledge work, but only if we build the runtimes to support it.

The uncomfortable implication

Here's the part that makes people uneasy.

The ceiling for agentic AI isn't model intelligence. It's not context windows, or reasoning capability, or tool use, or any of the other things the AI labs are racing to improve. Those things matter. But they're not the binding constraint.

The binding constraint is how much of our work is still vibes, judgment, and unstructured text pretending to be process.

I've talked to dozens of organizations that want to "deploy AI agents" for their knowledge work. Almost all of them hit the same wall. They realize that before an agent can do the work, someone has to define what the work actually is, in sufficient detail that a system can execute it, validate it, and improve on it. And that definition doesn't exist. It's in people's heads. It's in institutional knowledge. It's in the way Sarah on the third floor "just knows" how to do the quarterly analysis.

That's not an AI problem. That's a process problem that AI makes visible.

Vibes don't scale. Executability does.

The next wave of AI systems won't be defined by smarter models. They'll be defined by new runtimes, new execution environments for domains of work that have never had them. Runtimes for legal analysis. Runtimes for procurement. Runtimes for strategic planning. Runtimes for research synthesis.

Each will look different. Different intermediate representations, different execution semantics, different validation rules. But they'll share the same fundamental architecture: a probabilistic front-end (the LLM) that compiles human intent into structured representations, and a domain-specific runtime that executes, validates, and provides feedback on those representations.

Software engineering just got there first because it already had the runtime. Everyone else needs to build theirs.

The organizations that figure this out, that invest in defining the execution environments for their most important knowledge work, will be the ones that actually capture the value of agentic AI. Everyone else will be stuck in demo mode, watching impressive outputs that never quite work in production, wondering why the technology that transforms software engineering can't seem to transform anything else.

The answer was never the model.

The answer is the runtime.

And for most of the work that matters, we haven't built it yet.