AI Automates the Verifiable

Coding agents are designed to be coupled with an oracle. AI agents that automate knowledge work in business processes often cannot replicate this pattern. That is the fundamental gap that explains why coding is the flagship agent use-case, while agents for knowledge work are often still in experimentation.

Why coding agents and not everything else?

The gap between the hype around AI agents and operational reality is most visible in business process automation. Coding agents are shipping production code, running full test suites, operating with increasing autonomy. Meanwhile, most enterprise AI agent programs are still running pilots — impressive demos that haven't made it into production workflows at scale.

The easy explanation is that LLMs have been trained on more code than business process data. That's partially true, but it misses something more fundamental about how coding agents actually work — and why that pattern is difficult to replicate in knowledge work.

The oracle problem

AI agents are built around a loop. Given a goal and some initial context, they call tools repeatedly, accumulating information and taking actions in an effort to reach an end state.

The key question: what is the end state?

The more clearly you can define and verify that end state, the better the agent does at converging on it.

This is where coding agents have a structural advantage. Code compiles. It can be tested against structured, verifiable metrics. The end state is a test suite passing — a clear specification that the agent can check itself against.

An underwriting agent doesn't have this. It needs an SME's judgment to determine whether it made the right call. "Was this underwriting decision sound?" has no compiler or lint check that gives the agent a pass/fail signal. The end state is subjective, contextual, and often takes time to reveal itself.

In software testing, an oracle is any mechanism that can determine whether a program's output is correct. Coding agents are designed to leverage oracles — to self-verify whether the code they've written produces the desired output. This lets them run long, complex tasks and course-correct along the way. In agentic engineering, this self-verification capability is called back pressure.

Back pressure unlocks end-to-end automation

Back pressure is the feedback mechanism that steers an agent toward the defined end state. For a coding agent, this is concrete: lint checks, test suites, browser dev tools, and type checkers can all be run by the agent itself to verify its own work.

I happen to have some background in control theory — I spent time building large-scale control systems for manufacturing plants. The pattern maps well. Think of your home thermostat: you set a target temperature, and the system continuously measures the actual temperature and adjusts to close the gap. The "back pressure" is the temperature reading that tells the system how far off it is. Without that feedback, the system has no way to know whether it's converging or drifting.

Step response diagram — back pressure damping the agent toward the setpoint

Coding agents work the same way, just with software tools as the feedback mechanism instead of a temperature sensor.

Back pressure control loop — the oracle closes the gap

Coding agents operate the same way. The feedback from a failing test suite tells the agent exactly where to focus. Without it, the agent is flying blind — it has no way to assert whether its work is complete or correct.

Back pressure is increasingly proving to be a prerequisite for full automation. If an agent cannot self-verify that its work is correct, it cannot know when the task is done — and it cannot be trusted to run without human oversight.

Why this works for code and breaks for business processes

Level	Example	Verifiable?	How
Syntactic	Code compiles	Yes	Deterministic
Structural	Tests pass, lint clean	Yes	Deterministic
Semantic	Code does what was intended	Partially	Human review, evals
Business process	Decision was correct	Rarely	SME judgment
High-stakes business	Underwriting decision was sound	Almost never	Regulatory + time

Coding agents live in the first three rows. Business processes operate in the last two.

Notice that even at the semantic level — whether code actually does what was intended — verification is only partial and requires human judgment. Coding agents aren't fully autonomous even here. But the deterministic checks in rows one and two do so much of the heavy lifting that agents can operate with high confidence across most of the task.

In knowledge work, almost none of the task is deterministically verifiable. The back pressure mechanism is a subject matter expert — a human. That is why human-in-the-loop isn't just a nice-to-have design pattern; it's often the only available oracle.

This affects more than just runtime behavior. Agent quality in knowledge work is judged through evaluations, which require labelled examples of correct decisions. Building those evaluations means collecting representative case data, having SMEs label it, and reviewing agent outputs against those labels over time. With code, this data is abundant and the feedback is fast. In business processes, the historical data is often scarce, the labels don't exist, and the SME time required to create them is one of the hardest resources to allocate. And the consequences of errors are asymmetric — a bug in code surfaces quickly and can be patched; a bad underwriting decision might not surface for months, and the damage is already done.

What back pressure looks like in practice for business processes

The fact that back pressure is harder to build in knowledge work doesn't mean you can skip it — it means you have to construct it deliberately.

Eval-driven design. This is the foundation. Before you build, you need a labelled dataset that represents the realistic distribution of cases your agent will face. Not cherry-picked examples — real distribution, including the edge cases and ambiguous calls that are hardest to get right. The golden dataset is the hardest part of this work and the most frequently underinvested. Everything downstream — testing, iteration, deployment confidence — depends on its quality.

Human-in-the-loop, phased. Start with 100% human review and treat that as the baseline, not a temporary crutch. As you build confidence in specific case types, route those to the agent and keep complex or high-stakes cases with humans. This lets you scale gradually without betting the whole operation on early performance.

LLM-as-judge. A second AI model can review the primary agent's decisions as a signal — useful for catching obvious errors at scale. But judges drift, miss nuance, and can mask regressions. Treat this as one signal among several, not a verification gate.

A visual model of the system. Code can't be read by business leaders, and most process automation systems aren't reviewed by the people who understand the process best. A clear visual representation of what the agent does, what data it touches, and where humans remain in the loop is back pressure at design time. SMEs reviewing that map will catch process errors that no eval suite will.

Continuous feedback. Every production trace is a candidate for improving your evaluation set. Cases where the agent was wrong, cases where it was right but for unclear reasons, cases that were escalated to humans — all of it is signal. The loop from production back to evaluation data is how agent performance improves over time rather than decaying.

Conclusion

The reason coding agents are the flagship use case isn't that software development is uniquely suited to AI. It's that software development already had a verification infrastructure built up over decades — compilers, type systems, linters, test frameworks, CI pipelines. That infrastructure didn't exist because of AI; it was built because software quality is hard to maintain at scale. Coding agents inherited all of it for free.

Knowledge work is only beginning to build the equivalent. The golden dataset is the oracle. The SME review process is the test suite. Human-in-the-loop routing is the CI gate. These aren't workarounds — they are the infrastructure, and they take real time and investment to build well.

This doesn't mean AI agents in business processes are years away or overhyped to the point of uselessness. It means the organizations that will see real, durable results are the ones doing the unsexy work of building verification infrastructure before scaling automation. The ones skipping that step and deploying on vibes will get the demo but not the outcome.

The ceiling on autonomous agents in knowledge work is real. But it's a function of verification quality, not model capability. That's actually good news — it means the lever is in your hands.