Agent autonomy is usually pictured as a ladder — a low rung where the agent does a little on its own, a high rung where it runs nearly unattended, and the job is to climb. There’s a sharper way to read the same progression. Each step up isn’t more autonomy in some general sense; it’s one specific decision moving out of human judgment, or out of code, and into the agent.

That changes the question worth asking. Not “how autonomous does this sound,” but which decision moved — and unlike a vague autonomy level, that’s something you can point to exactly: which tool the agent gets to pick, say, or whether it defines the task at all. Each phase moves one.

Agent maturity is which decision boundary you moved into the agent.

There are five of those boundaries worth naming. Each phase hands the agent exactly one more: the form of an artifact, a judgment inside a fixed route, the path through a tool surface, the correction of its own past output, and finally the shape of the task itself. The phases stack — each assumes the one below — but the thing that defines each is the single boundary it crosses, not the total amount of freedom the system appears to have.

At a glance

Phase Decision handed over What stays fixed Example Failure mode
1 The form of an artifact You supply context, you consume the output Generate a landing page, an app scaffold, a draft Works only from what you paste — no tool or live lookup
2 One bounded judgment inside a route Code decides what context to gather and what happens next Classify an alert; server-side triage The route can’t adapt to an event it wasn’t shaped for
3 Which tools to call, in what order The task and its boundaries Local triage exploring a wide MCP surface Variability — a different path, and cost, each run
4 Whether its own output was wrong, across runs The rules, and who authors new ones Self-evolving triage Sharpens existing rules; can’t author missing ones
5 The task graph itself The outcome, accountability, the permission ladder Find incidents that hit us but never paged anyone The fuzzier the goal, the harder “done / correct” is to judge

The rest of the post takes each phase in turn — what moves into the agent, what stays out of it, and where each one breaks down.

Phase 1 — Generate the artifact

The first thing anyone hands an agent is the form of a thing they want made. You describe a website, an app scaffold, a report, a function, and the model produces it — refined over as many turns as it takes. The defining property is that there’s no durable system boundary: the agent isn’t wired into anything live. You bring the context yourself, and you do something with the artifact yourself; what the model decides is how the thing looks and reads. Everything else — what context it had, what happens to the output next — stays with you.

This is the phase most people mean when they say “I use AI” — the consumer chat experience, and genuinely useful across a huge range of one-off making.

The limit is the flip side of that: the agent can only work from what you put in front of it. There’s no tool, no MCP, no live lookup behind it — ask for this month’s numbers and it can only reuse whatever you pasted, never go and pull them. If a fact wasn’t in your input, it won’t be in the output, and nothing in the loop will fetch it. That single limit — no reach beyond the conversation — is what the next phase exists to lift.

Phase 2 — A bounded judgment inside a fixed route

The second boundary is the first one that lives inside software. Here the agent is one stage in a pipeline: code gathers the context, code invokes the model for a single bounded judgment, and code decides what happens to the result. The agent doesn’t choose what to read or what to do next. It reads what it’s handed and returns a constrained answer.

The thing that separates this from Phase 1 chat is programmatic context assembly. You’re not pasting a log into a box — your code pulls it, through an MCP, a direct API, or a CLI, shapes it, and feeds it in. The classify stage of a triage funnel is exactly this: one model call, no tools, emit a category tag, and a deterministic router acts on the tag. So is server-side triage on a hot path, where the client pre-fetches the APIs it already knows the alert needs and hands the model a compressed payload to interpret. In both, the route is fixed in code — the human pre-decided the path; the agent supplies judgment at one point on it.

That fixed route is the boundary, and it’s also the limit. The route handles what it was shaped for, and that’s its scope. An alert that needs a step the pipeline doesn’t have falls outside what the design can reach, because the agent stays on-script by construction. That rigidity is the point on a high-volume hot path — predictable, cheap, fast — and it’s the limit the next phase exists to lift.

Phase 3 — Hand over the path

The third boundary is the one most people picture when they hear “agent.” You give the model a tool surface — several MCPs, a shell, a set of APIs — and let it decide which tool to call, in what order, when to backtrack and try a different hypothesis. The route stops being something code wrote ahead of time and becomes something the agent discovers at runtime.

This is the right shape for work where the path isn’t knowable in advance. A local co-oncall investigating a novel incident wants the wide loadout precisely because you don’t know which system the next incident will pull you into — so you give it more tools rather than fewer and let it pick. Discovery is the feature.

What still stays fixed is the task and its boundaries. A human or a webhook set the goal of this run, and the agent’s blast radius is bounded by code, not by prompt — read-only by default, writes on an allow-list, the dangerous things simply absent from its reach. Handing over the path is not handing over the authority.

The trade-off comes bundled with the freedom: the route varies, and so does the cost. The same task takes a different path each run, and how much the agent looks — and so how much each run costs — isn’t fixed ahead of time. Locally, where you’re reading along and exploration is the point, that variability is exactly what you want. On a hot path, where every run should be predictable and cheap, the same variability is the reason you’d reach for Phase 2 instead. That’s why the two aren’t a ranking — they’re a fit to different jobs.

Phase 4 — Hand over the correction

The fourth boundary is a subtle one, and it’s easy to conflate with something simpler. The new decision the agent gets is whether its own past output was wrong, and how to tighten the rule behind it — and the word that carries the weight is past. An agent that reflects mid-run and retries is still Phase 3; reflection inside one run is just a smarter path. Phase 4 is a closed loop across runs: findings accumulate, a patch lands somewhere durable, and future behavior actually changes.

The self-evolving triage agent is this phase in full. A different model family audits every response and tool call, the aggregator clusters findings across many events, a patch self-validates against past runs, and it commits to a branch a human cherry-picks. The persistence is what makes it Phase 4 — without somewhere the correction lands, what you have is still Phase 3: reflection that improves a single run but doesn’t carry to the next.

What stays fixed is the rules themselves. The loop sharpens rules that already exist — tightens wording, adds a counter-example, promotes “should” to “must” — but it can’t write a rule that was never there; that still takes a human. And it’s only as good as its eval: a check that shares the agent’s blind spots — which a same-model self-audit does — tends to agree with the agent instead of catching what it missed. So the judgment has to come from somewhere that doesn’t share those blind spots, like a second model. Get the eval right and the loop sharpens in the right direction.

Phase 5 — Hand over the task graph

The last boundary is the one the previous four were holding the line on: the shape of the task itself. Up through Phase 4, someone — a human, a router, a webhook — defined the task and the agent executed it well. In Phase 5 you hand over a fuzzy outcome and let one or more agents pursue it — working out for themselves what the subgoals are and in what order, instead of executing a task someone already shaped.

Take a fuzzy goal like find the incidents that hit us but never paged anyone — the mirror image of alert noise: under-alerting instead of over. Nobody can hand the agent a checklist for that; it has to build one. To even establish what counts as a missed page, it cross-references two sources — the status page for what actually broke and when, and PagerDuty for what actually paged and when — lines them up on a timeline, and isolates the incidents with no page nearby. Then it digs into each one: was the alert muted, was the threshold set too high to ever fire, or was there no alert for that signal at all? It comes back with a conclusion — here are the blind spots, and the likely reason for each. None of those steps were specified. Deciding which steps exist, and in what order, is the work — and it’s the boundary that defines the phase.

It’s worth being precise about where that line falls, because it’s easy to mistake for something else. It isn’t the number of steps — a Phase 3 investigation is already multi-step, self-directed, and analytical. It isn’t self-correction — that’s Phase 4, getting better at a given task across runs. What’s new in Phase 5 is that no task was given at all, only an outcome, and the agent had to invent the task list to reach it. A Phase 4 triage agent can run forever and never decide what its task is; deciding that is exactly what a Phase 5 agent does.

Whether that runs in one agent or a few — one pulling status-page history, one pulling PagerDuty, one reconciling the two — is an implementation choice, not what defines the phase. A single agent can decompose a fuzzy goal, and several agents can just as easily run a Phase 3 investigation in parallel. What makes it Phase 5 is that the subgoals weren’t handed over — outcome decomposition, not headcount.

What stays human is the part that was never the agent’s to take. The agent runs the investigation and brings back findings, but a person reviews them — and signs off before any alert actually gets muted, retuned, or created, on the same permission ladder the earlier phases use. You set the outcome; the agent worked out how to chase it; the accountable calls stay with a human. Run it on a cadence and it keeps surfacing new blind spots as the system changes — still proposing, never committing on its own.

Evaluating an open-ended agent

An open-ended agent is the hardest of the five to trust — and not because it has free rein. It doesn’t: the open-endedness is in the goal, not in what the agent is allowed to do. The hard part is that a fuzzy goal gives you nothing to check the answer against. If it reports five blind spots, how do you know there weren’t eight — or that the five are even real? A defined task tells you when it’s done; find what we’re missing never does.

You can’t grade the goal, but you can grade what the agent hands back — each blind spot it reports. Take them one at a time: incident X had no page because alert Y was muted is a specific claim you can go verify — did X happen, was there really no page, was Y really muted? Checking the ones it reported (its precision) is the easy half, and you can do it on every run. The harder half is what it missed (its recall): for that you need a set of incidents you already know slipped through — chosen to cover the different ways a page goes missing, not just a few easy examples — and you watch whether it catches them. And because it chose its own steps, skim the log too — now and then the answer is right but the reasoning only got there by luck, which means the next case won’t. So you sign off on two concrete things: the plan before it runs, and each finding before you act on it — never the goal itself, because there’s nothing there to grade.

None of that works unless you build a couple of things in from the start. The big one is a log — every query the agent ran, every decision it made, every subgoal it set for itself. Without it, a wrong conclusion is a black box: nothing for you or an eval to go back and inspect. Two more are easy to leave until later. Cap the steps, cost, and time up front, since a fuzzy goal will expand to fill whatever budget it gets. And if it runs on a schedule, give it memory of what it already flagged — otherwise it hands you the same list every week instead of just what’s new.

What stays human, all the way up

Notice what each phase kept on the human side, because it’s the same thing every time, named differently:

  • Phase 1 keeps the context and the use of the output.
  • Phase 2 keeps the route.
  • Phase 3 keeps the task and the blast radius.
  • Phase 4 keeps the rules and authorship of new ones.
  • Phase 5 keeps the outcome definition and the permission ladder.

The boundary moves; the accountability doesn’t. At no phase did the agent acquire the authority to spend money, alter another party’s data, or own an irreversible call without a human at the gate. Climbing phases hands over more of the path. It never hands over the responsibility — that line is enforced in code, not in the prompt, and it sits across all five phases unchanged.

Which phase fits

The phases are numbered, but the numbering isn’t a ranking. A higher phase isn’t a better system — it’s a different fit for a different kind of problem.

Letting an agent decompose a goal whose path you already knew is more machinery than the problem needs — you pay variability and an eval burden to rediscover something a few lines of code could have settled. The opposite error is freezing a fixed route onto an investigation whose path you can’t know in advance, which boxes in work that needed room to explore. The local-vs-server split is this choice made well: the same reasoning core explores freely on your laptop and follows a fixed, pre-picked path server-side, because the jobs differ, not because one is more evolved.

Three questions place a problem, without needing the numbers:

  • Can the path be written down ahead of time? If yes, fix it in code — a set route is cheaper and more predictable than letting an agent rediscover it on every run. If no, hand over the tools and let the agent find the path.
  • Does the agent’s output have a trustworthy check? Self-correction only earns its place where something with a different blind spot can tell the agent it was wrong. Without that, a self-improvement loop just reinforces the agent’s own blind spots.
  • Is what you’re handing over a task, or an outcome? A defined task the agent can execute directly. Only a genuinely fuzzy outcome calls for letting it build the plan itself — and only when there’s a check underneath that catches a plan gone wrong.

The right phase is the lowest one that fits the problem. Everything above that is cost you’re paying for autonomy the problem didn’t ask for.

Closing

The ladder framing asks “how autonomous is your agent.” A more useful question sits underneath it. Autonomy isn’t a single level; it’s a set of specific boundaries, each in its own place, and the real shape of a system is which ones you moved.

Five boundaries, five phases: the form of an artifact, a judgment inside a fixed route, the path through a tool surface, the correction of past output, the shape of the task. Each phase hands over exactly one. None of them hands over accountability — that stays code-enforced and human-owned at every level.

So the question to ask of any agent system isn’t where it lands on a ladder. It’s: which decision did you move into the agent, was that the right one to move, and is the boundary that should have stayed put actually still there.