Thesis · 2026
The hard part was never the model.
It was everything around it.
Every six months a new model lifts another benchmark. SWE-bench. GPQA. The bar exam. Math olympiads. The frontier moves fast enough that any prediction we made about agent capability in 2023 was wrong by 2024 and embarrassing by 2025.
Then you put one of those models inside your business, and it fails three times out of four.
Not because the model is dumb. The same model that wrote your code yesterday cannot reliably file a refund today. The smartest agent on Earth still cannot see how your business actually runs. It does not know who owns the account. It does not know what changed last Tuesday. It does not know which policy applies when a customer has two open tickets, a partial refund, and an enterprise SLA.
Those facts do not live in a doc. They live in twelve systems and two Slack threads. They change every day.
Without that picture, every action is a guess.
The capability frontier is moving. Adoption is not.
MIT NANDA's 2025 GenAI Divide report: 95% of enterprise GenAI pilots show no measurable P&L impact. Gartner: 40% of agentic AI projects will be cancelled by the end of 2027. McKinsey: 80% of organizations have already encountered risky behavior from AI agents. Only about one in twenty has agents fully deployed.
The best frontier model on τ³-bench, the leading real-world agent benchmark, hits 25%. Hand it the exact reference documents and it crawls to 40%.
The math underneath is unforgiving. An agent that gets each step right 99% of the time still finishes a 50-step task only 60% of the time. Series reliability multiplies. You cannot out-model multiplication.
This is not a story you fix by waiting for the next model.
Everyone is reaching for the wrong fix.
Most “agents” shipping today are spam machines with slightly better grammar. The internet is drowning in AI slop. That is not progress. It is noise pretending to be progress.
The reflex of the industry has been to throw more model at the problem. Bigger context. More tokens. More agents talking to more agents. None of it fixes the underlying shape.
A million-token context window still does not tell you which fact in the context is true today. A swarm of agents hits the same compounding-error wall. Self-correction makes things worse, not better. The model cannot find its own errors.
The bottleneck is not capability. It is everything that surrounds capability.
What does an autopilot have that a refund agent does not?
Redundant sensors. Declared rules. A failure mode that prefers a controlled abort over a wrong move. An audit chain that closes the loop on every action. We trust an autopilot to fly a plane through a storm. We do not trust an AI to file a refund. The gap is not the model. It is everything underneath.
What Cortal is.
Cortal is a runtime that sits between your agents and your systems. Every fact and action inside the workflow passes through. We check four things, before and after.
Does the agent know what counts as the job done right?
Not a vague goal. A concrete definition: who owns this account, what counts as resolved, what the policy says, what the customer was promised. Drawn from your real systems, structured, current.
Is the agent allowed to take this specific action right now?
Not a broad API token. A scoped, time-bounded permission tied to the operator who authorized the work. A $10k refund to a stranger refuses, even when the API token says read-write.
Is the agent right enough to act?
Not what the model claims about itself. What its actual track record says on this exact kind of decision, in this exact kind of context. Below the bar, the agent stops and tells you why. Above the bar, it proceeds. Models lie about their confidence. Cortal makes them prove it.
Did the action actually work?
Not whether the API returned 200. Whether the customer was helped, the deal moved, the ticket resolved. Outcomes flow back. Every result sharpens the next decision.
Concretely: a refund agent reads from Stripe, Salesforce, and your internal CRM. The CRM says enterprise, Stripe says basic-tier. Cortal stops the agent, surfaces the conflict, and waits for a human. Later the same agent wants to auto-approve a credit against a pattern Cortal has seen fail more often than it has succeeded. The evidence does not clear the bar; Cortal makes the agent abstain and explain why.
When a downstream step fails after another succeeded, Cortal runs the declared reversal where one exists, blocks unsafe follow-on writes where it does not, and reports the trail. When a policy refuses, the rule that refused is cited.
No guesses. No silent failures. No customer surprises.
Open loop systems forget. Closed loop systems compound.
Cortal logs what every agent saw, what it decided, and what happened next. When reality contradicts a stale doc, the doc loses priority. When a human corrects an answer, Cortal records the correction with source, owner, and age, so the next agent knows what to trust.
Week one is fine. Week ten is your best rep.
The ceiling is every place a real decision has real consequences. Eventually that includes physical autonomy. The same questions, asked of an agent that moves things in the world.
We start narrow because the math only works when we know one company cold.
Where this ends up.
Human progress means humans stop doing the work machines can do. Autonomous agents make that possible. Right now they do not. We are building the layer that lets them, on behalf of real people.
The tedious work goes to the machines. Humans go back to building, designing, planning, and creating the things that matter.
Long term, every autonomous system, physical or digital, eventually runs on something like this. Customer support and back office this year. Coding agents and research agents next. Then field operations, logistics, fleet control. Then anything that decides and acts on real consequence. The shape of the answer does not change. Does the agent know the job. Is it allowed to act. Is it right enough to act. Did the action work.
The next century should be humans doing the work only humans can do. On Earth, and not just on Earth.
That seems obviously worth doing.
Who we are looking for.
If you are running an AI agent or building one, and the failures are landing in your support queue, your Slack, or your churn data, we want to get this in your hands and working with you in as much depth as it takes.
B2B SaaS or services. 30 to 200 people.
If you are worried about uncontrolled agents, we should be talking tomorrow.
