Most production AI agents are static. The prompt you ship is the prompt you keep — until somebody reads enough output to notice a recurring failure, edits by hand, and ships a fix. The cycle takes weeks; most failures slip through it.

A self-evolving agent shortens that loop, but it has to solve one structural problem first: a model auditing its own output is bad at it. Same-model audit produces self-deception — the auditor rationalises the same blind spot the triage agent just exhibited. The fix isn’t more prompt engineering; it’s a different model.

The setup: a triage system that processes incoming events — a PagerDuty alert, a metric anomaly, an error report; anything that lands in a queue and needs a structured first response. In our design, the triage agent — the model that drafts the response — runs on Claude Opus; the auditor runs on OpenAI GPT — a different model family, deliberately. The second agent watches every response and every tool call, accumulates findings, clusters them, and writes patches to the triage agent’s prompt. The patches self-validate before commit. The team’s job is reduced to cherry-picking the commits that look right.

The cross-model trick

The auditor’s biases are different from the triage agent’s. When the triage agent fabricates a citation URL — a known failure mode on this kind of output — the auditor flags it as a hallucination, because it doesn’t share that tendency. Detection isn’t because the auditor is smarter; it’s because the failure is in the triage agent’s blind spot but not in the auditor’s.

This is the part most “self-correcting agent” pitches skip. Same-model is not enough. A second model from a different family is the cheapest mechanism that gets you out of the self-deception trap, and the loop downstream of it only works if this part is right.

A natural follow-up: if same-model audit fails, who audits the auditor? It’s an LLM too, with biases of its own. The honest answer is that nothing does — but the design works around it. The auditor’s job is to flag, not to decide: it produces findings, the aggregator clusters them, regression tests on real events validate any patch before commit, and a human cherry-picks. Auditor bias becomes false positives that get filtered out by the cluster threshold, the regression tests, or the cherry-pick gate. The downstream layers are the ground-truth checkpoints; the auditor doesn’t need to be one.

The cost is small. One subprocess for the triage agent; one queue plus one worker for the auditor. The audit runs asynchronously after the note is posted, off the hot path. From the user’s perspective, it’s invisible.

What gets audited: content vs. process

A triage prompt has two kinds of rules, and they need different ground-truth sources to audit:

Rule type What it constrains Audit source
Content What the note says The note text
Process What the agent did The compliance trace

Content rules are the obvious ones. “Do NOT fabricate citation URLs.” “Use this exact section structure.” “Never speculate about config the agent didn’t read.” The auditor reads the response and checks.

Process rules are harder. “For events of category X, query the canonical data source first.” “Log queries MUST be filtered by the event’s identifier.” Reading the response isn’t enough — the response might say “confirmed via the canonical source” without the agent actually having queried it, or query it without mentioning it. To audit process rules, you need a record of what the agent actually did.

That record is the compliance trace — an ordered list of every tool invocation the agent made, with arguments and results, captured in the same run that produced the response. The auditor receives both inputs (response + trace) and reports findings against each independently.

A finding looks like this:

{
  "rule": "For category X, the canonical source MUST be queried before synthesis.",
  "rule_sha": "f3a1c9d2",
  "verdict": "violated",
  "evidence": "Trace contains no canonical-source call; response cites it as confirmed.",
  "event_id": "ev-18095"
}

A trace excerpt looks like this:

1. mcp.metrics.query(filter={endpoint:"abc"}, range="15m")  → 200, 47 series
2. mcp.logs.search(query="severity:error", range="15m")     → 200, 312 hits
3. (no canonical-source call)
4. response posted: "...confirmed via the canonical source..."

The trace is what catches the silent failures: a query the agent ran without mentioning, a tool invoked with the wrong argument shape, a check the rule said to do that didn’t happen at all. None of these are visible in the response alone.

From findings to patches

A single finding is noise. A finding that recurs across many events is signal — one mislabelled response could be a fluke; the same shape across N runs is a rule problem. Only signal justifies a prompt patch.

The aggregator clusters findings by rule_sha (a stable hash of the rule + failure shape), drops singletons, and only proposes a patch once a cluster crosses a threshold (> N occurrences). For each surviving cluster it spawns a headless agent — same model family as the triage agent — running under a strictly limited tool surface: read, edit, and stage changes only; no commit, push, or shell. The agent stages a patch under one of two paths:

  • The prompt itself — the rule was unclear or missing a counter-example. Tighten the wording, add a counter-example, or promote it from “should” to “must.” (e.g. “don’t speculate about config” → “don’t claim a config value unless the config-tool returned it this run.”)
  • A referenced skill or tool doc — the rule is clear, but the doc the agent reads at runtime misled it (wrong example, ambiguous “when to use”). Edit the doc, not the prompt.

A note on why the patcher is same-model. The audit needs a different lens — that’s the whole point of cross-model. Editing the prompt needs a familiar voice. A cross-model patcher fights the triage agent’s own phrasing — it rewrites in its own dialect, which then forces the triage agent’s next read to bridge two voices on every paragraph. Same-model editing fits into the existing prompt without pulling against it. Cross-model where you need a different mind; same-model where you need the same hand.

Self-validation: the part that makes autonomy safe

A patched prompt that fixes the original cluster but breaks five other things isn’t a fix. The validation step exists to catch that.

For prompt patches, before commit:

  1. Cluster regression check. Re-verify each event in the cluster under the patched prompt. The rule that was failing must now pass.
  2. Clean-sample regression check. Re-verify a sample of events that previously passed cleanly. No new rules may flag.

If both pass, the patch commits — to a dedicated proposals branch, never to main — with a rule-sha trailer for dedup so the same fix doesn’t get re-proposed next run.

For skill-doc patches, self-validation is skipped — and the asymmetry is worth spelling out. Prompt validation is cheap because the agent’s tool-call trace from the original run can be replayed: the trace is fixed input, only the prompt changes, so checking whether the rule now passes is a single prompt-only call. A skill-doc patch breaks that shortcut. The doc is what the agent reads while deciding which tools to call — change it, and the old trace is no longer representative; the agent would have made different calls. Honest validation requires re-running the agent end-to-end with live tools, which is too expensive inside the autonomous loop. So these patches commit with a validation: skill-change-needs-review caveat in the message body. The human is expected to read them before cherry-picking; the caveat is a deliberate “this one needs your review before merge.”

This split — auto-validate the cheap thing, route the expensive thing to human review — is what keeps the loop autonomous on the easy path while staying honest about which patches are which.

The cherry-pick gate

It would be tempting to say “all patches go through a human approval queue.” That’s the obvious gate, and it’s the wrong one.

The right gate is cherry-pick, not approve-each. The bot commits to a branch, one commit per cluster fix. The reviewer diffs the proposals branch against main, reads what’s there, and cherry-picks the commits that look right. Rejected commits stay on the bot’s branch — they don’t go away unless someone resets it. Cherry-picked commits drop automatically on the next git rebase main, because the aggregator rebases from main at the start of each run.

A few things this gets right that an approval queue doesn’t:

  • Curation, not approval. A reviewer reads ten commits, takes three, ignores the rest. That’s faster and less attention-burning than approving each in turn.
  • No fight-back. A cherry-picked commit doesn’t get re-proposed because the bot’s diff against main is now empty for that change. A rejected commit doesn’t get auto-rejected either; it sits, ready for re-evaluation if context changes.
  • The branch is the audit log. Every patch the bot has ever proposed is in git history on the branch. There’s no separate review queue to maintain.

The cherry-pick gate has a slightly higher setup cost (a branch, a rebase ritual) than an approval queue. It has a much lower ongoing attention cost — and ongoing attention is what determines whether the loop survives past month two.

Two deployment shapes

Same loop, two ways to run it — and the choice is mostly about who’s paying and where the agent lives.

Server-side, fully automated. A headless runtime where the auditor and patcher both make per-token API calls, fired by the aggregator when clusters cross threshold. In our setup, that’s headless Claude Code as the triage agent + patcher and OpenAI API as the auditor. The queue gets serviced 24/7.

Local, subscription-backed. A workstation-resident runtime where the cost is a flat subscription and the tool surface is the local environment. In our setup, that’s Claude Code’s loop feature for the triage agent + patcher and Codex CLI for the review. The local environment unlocks things the server-side version can’t easily do — read/write the actual repo, run the test suite, view diffs in an IDE. Trade-off: uptime is yours to maintain — 24/7 requires keeping the workstation awake and online.

Cost and the runaway-loop risk

Subscription is always cheaper than per-token API at any nontrivial workload — flat fee, unlimited iterations. The server-side version pays in tokens for everything, including the wasted iterations of a regression-test stall: the auditor keeps flagging a cluster, the patcher drafts a fix, regression validation rejects it because the clean-sample check now fails, the patch resets, and the cycle repeats — every iteration paid in tokens. A regression a human would spot in thirty minutes can take the loop a week of API spend to converge on, or never converge.

That’s why a daily budget cap is non-negotiable for the server-side version. Without one, a single bad regression can burn a month of LLM spend in an afternoon. The local version is immune to this exact failure mode — the same stall burns workstation time, not API budget.

The four pieces (cross-model audit, compliance trace, self-validation, cherry-pick gate) work the same in both shapes. What changes is the runtime: who pays, where it lives, and whether the loop runs while you sleep.

What this loop is and isn’t

This system is good at one specific thing: detecting and tightening rule compliance failures. A rule already in the prompt isn’t being followed reliably; the auditor sees it across multiple alerts; the loop sharpens the rule’s wording or the skill doc behind it.

It’s not good at finding rules that don’t yet exist. If the prompt never said “for events of category X, query the canonical source,” the auditor won’t flag the absence of that query. New rules — the kind that come from operator pattern-matching across many events a human reads in aggregate — still need a human writing them. The loop sharpens existing rules; it doesn’t author new ones.

Closing

A static agent works as long as the rules you wrote when you shipped it are the rules you wish you’d written. They never are.

The non-self-evolving fix is a human, periodically — read enough output, notice a pattern, edit the prompt, ship. The cycle is long, attention-expensive, and most failures slip through it.

The self-evolving fix has four structural pieces. Cross-model audit, because same-model audit doesn’t catch what it would have made itself. The compliance trace, so the audit covers what the agent did, not just what it said. Self-validation, so a patch can’t ship if it breaks something that was passing. Cherry-pick gating, because curation scales where per-edit approval doesn’t. Together they let the loop run autonomously on the easy path, honestly on the hard one, and bounded on every path.

The result isn’t an agent that gets smarter on its own. It’s an agent whose rules get sharper on their own — under the watch of a different model, validated against its own previous output, and curated by a team that mostly just reads commits.