The same triage agent running on your laptop and running server-side off an alert webhook looks like the same agent — but how much agency you give the LLM should be completely different on each side.
Locally, you should let the LLM stretch out: hand it the MCPs you’ve set up for investigation, let it pick which tool to call, let it choose its own path, let it backtrack and try a different hypothesis. Discovery is a feature. You’re standing right there — every step shows up on your screen, and you can redirect at any time.
Server-side, you should deliberately take some of that agency back — not as a limit on the LLM, but as a fit to a different job. The local job is deep investigation: root-cause something nobody has seen before, take as long as the case takes. The server-side job is preliminary triage on a hot path: classify the incoming event quickly, pull the obvious data, post a first answer that lets the on-call decide what to do next. Different jobs, different tool surfaces. A free-roaming exploratory agent is the right shape for the first; a narrow, predictable, fast agent is the right shape for the second.
This post is about that split. Same reasoning core, loose on one side, tight on the other. The LLM’s agency shouldn’t look the same on both sides.
The cleanest way to see the split is in terms most engineers already live in: same agent, two modes of teamwork.
Local is pair programming. You and the LLM share one screen, the incident spread between you. It tries a query, you nudge; you notice a thread, it pulls. Either of you catches the other before a wrong path runs too far. The output is better than either of you alone because you’re in sync — discovery is fast precisely because two minds are reacting to the same screen.
Server is solo-on-runbook. Same agent, no peer at the screen. It executes the SOP you and it agreed on last week, when there was time to think. Inside the runbook it moves fast, decisive, capable — same skill, just no improvisation. Outside the runbook it stops and pages a human. The intelligence didn’t shrink; the scope of discretion did.
A note on the premise. The whole argument below assumes the local loadout actually works — that handing an LLM ten MCP servers and a wide command surface produces a competent investigator. A couple of years ago that wasn’t a safe assumption; today it is. The post isn’t about whether to give the LLM agency — that question is settled by the fact that local works. The post is about why a powerful, freely-exploring agent shouldn’t ship server-side in the same shape.
Local: let the LLM be the LLM
Local triage is the Solo mode — you drive, you see the output. The agent’s job here isn’t to “execute the steps you wrote.” Its job is to work alongside you: SSH, query, browse, stitch sequences of steps together. The thing that makes it useful is that it picks the right tool for the moment, on its own.
The design rule is simple: err on the side of more tools, not fewer, and let the agent choose.
- Install the MCPs you’d actually reach for during an investigation. PagerDuty, Grafana, internal admin, log store, ticket system. The discovery cost is worth paying — you don’t know which system the next incident will pull you into, so don’t pre-decide for the agent.
- Tool overlap is fine. When the same data has both an MCP and a CLI, keep both. The agent picks what it reaches for first; you’ll see the choice in the transcript afterward.
- Don’t economize on context. Local inference latency is part of your reading speed. Adding ten more tool results to the context doesn’t feel slower to you, but it gives the agent another hypothesis to test.
The point of the local loadout is agency. You don’t know which path this incident wants, so you hand the LLM a wide tool surface and let it pick. You watch the path it takes; if a thread doesn’t pan out, you redirect.
Server: a different job, not a constrained agent
Server-side is the Headless mode — event-triggered, no human in the loop. An alert fires (or a cron tick, or a webhook), the agent runs, the result lands in a Slack thread, and that’s the end of the loop. The next event kicks off another round.
The job here isn’t the same job as local. Local is “investigate this novel thing carefully.” Server is “classify this incoming event quickly, pull the canonical data, post a first answer fast enough that the on-call can decide what to do next.” Preliminary analysis at speed — that’s the brief. Anything that doesn’t serve that brief is overhead.
That reframes the design rule. Match the agent’s surface area to the job, not to the LLM’s potential. Not because the LLM can’t do more — but because the job doesn’t ask for more, and asking for more spends budget you wanted on speed and on a tighter blast radius.
Replace MCP with a fixed API
MCP is the right tool for the local job — discovery, exploration, finding the path you didn’t know you’d need. On a hot-path server side — alert webhooks where seconds matter — the picture flips: you already know which APIs need to be called for this alert type, and MCP’s discovery layer becomes overhead the brief doesn’t ask for. (For non-hot-path server-side work — batch analysis, scheduled reports, internal lookups behind sandboxed timeouts — MCP can still earn its place. The argument here is specifically about the hot path.) Two distinct costs are worth separating:
The framework tax — paid for indirection you no longer need.
- Tool schema listing eats context. An agent with ten MCP servers attached burns thousands of tokens describing tools before it does anything.
- Retries and timeouts are easy to leave loose. A flaky MCP server retrying transient errors adds dead seconds to every invocation; unless the client is configured with tight timeouts and a circuit breaker, those seconds compound, regardless of what the agent is doing. (Strictly, this is an implementation risk, not a property of MCP itself — but it’s the default risk most setups inherit.)
- Round-trips compound. Every MCP call is a process boundary. Five tool calls is five round-trips you wouldn’t pay if the calls were inline.
The exploration tax — paid for letting the agent pick a path the brief already knows.
- Discovery cuts both ways. A buffet of tools means the agent will sample. That’s exactly what makes the local job work — it finds paths you wouldn’t have. Server-side, where the path is already known, sampling is just latency the on-call waits through.
- Flexibility implies variability. Same task takes a different path each run — a feature on the local job, a debt on the server-side one, where the brief is “do the same fast thing every time.”
The two costs argue for different fixes. The framework tax argues for fewer tool layers — call the API directly instead of through MCP. The exploration tax argues for fewer choices — pre-pick which APIs to call.
Concretely: locally, the data path is LLM → MCP → API — two layers, both paying the framework tax on every call. Server-side, you write a fixed API client that cuts the middle layer out: LLM → API, directly. You already know which API to call for this alert type — the LLM doesn’t need MCP’s discovery surface to figure it out, and you don’t need to pay for retries and schema listings on a code path you wrote by hand. Direct is just faster than mediated, when you know in advance what to mediate.
Notice exactly what got moved: the LLM’s choice of which tools to use is gone, but its choice of how to read the data those tools return is still there. The LLM hasn’t left the loop — it’s still doing the interpretation, still deciding what the metrics mean, still writing the Slack update. Judgment didn’t disappear; it shifted from tool selection to data interpretation. That’s exactly the part of the work the server-side brief asks for — read this fast, summarize this clearly — and exactly where the LLM is at its strongest.
Tight context is a latency budget
The rule on this side is inverted from local: feed the LLM only what it absolutely needs to answer this alert, and nothing else.
A five-minute local run is fine. A five-minute server-side run means nobody is reading the alert in time — alerts don’t wait for you.
Holding that budget means every tool result is squeezed before the agent sees it. Grafana doesn’t return the full series — it returns a summary plus a few salient samples. Log search doesn’t dump 1000 lines — it returns the top five clusters with counts. That compression isn’t the agent’s job; it’s the server-side client’s job, because that client knows up front what shape it should hand back.
Restricted commands are the production firewall
SSH wide open server-side is an incident waiting to happen. Locally it’s more workable — you’re watching, you can intervene — but it’s not zero risk either. Prompt injection, polluted tool descriptions, and a misread context don’t care that you’re at the keyboard. Even on the local side, destructive-command confirmation, clear prod/staging labels, secrets kept out of the model’s view, and an audit transcript are worth the small friction.
The defaults that matter:
- Read-only is the default. Query metrics, tail logs, look up config — all green.
- Writes go on an allow-list. Restart a service? Fine, but only the service named in the alert. Disable a feature flag? Fine, but only flags from a list that’s been vetted.
- Anything global is hard-blocked. Drop table, kill -9 across hosts, deploy, scale to zero — these aren’t “needs confirmation” problems. The server-side agent should not have the capability at all. If it has to happen, a human presses the button.
The spirit of the line: the server-side agent’s blast radius is bounded by code, not by prompt. A prompt that says “don’t do dangerous things” is a wish. An allow-list is a fact. The first one breaks; the second one doesn’t.
Why you can’t just pick one
The two failure modes that show up most:
“It works great locally, let’s just plug it into the webhook.” The first production alert lands and the agent does what it does best locally — explores. Three minutes listing tools, five minutes investigating a tangentially-related host, seven minutes cross-referencing staging against prod. None of those moves are wrong; they’re the local job done correctly. They’re just not the server-side job, which was “classify this in 30 seconds and post a first answer.”
“Server-side is so reliable, let’s just use that loadout locally too.” Now you’re debugging a novel incident and the agent can only run the five paths someone predefined. The shape that was a perfect fit for fast triage becomes a self-imposed cage on a job that calls for exploration.
It’s not a “which side is right” question. The two modes optimize for genuinely different things:
- Local is built for deep investigation — hand the LLM a wide tool surface, let it explore.
- Server is built for fast preliminary triage — pre-pick the path, keep blast radius small, let the LLM do the reading and summarizing.
What carries across both sides is the prompt, the model, the domain knowledge of this system. What doesn’t carry across is the job — and so the agent’s surface shouldn’t either.
How to design the split
In practice, slice the agent into three layers:
- Reasoning core — prompt, model, domain knowledge. Shared across both sides.
- Tool layer — MCP locally (the right shape for discovery), direct API client server-side (the right shape for known-path triage). Same reasoning core, different jobs.
- Output layer — terminal locally, Slack thread / ticket / metric server-side.
Once that split is clean, the workflow for adding a new alert type is:
- Identify the APIs that should be called for this alert type. For most alert types you already know which ones — that’s exactly the engineering judgment that doesn’t need an LLM in the loop.
- Write a server-side fixed client that calls those APIs directly, returning compressed shapes — just what the LLM needs to interpret, no more.
- Pin the path’s command scope to an allow-list.
The whole goal on the server side is minimize MCP calls and minimize context — both spend the same precious time budget. Locally those costs don’t matter; you’re reading along, and exploration is the point. Server-side, you already know the answer for this alert shape, so cut anything that exists only to help discover it.
Closing
Same agent code, two deployment shapes.
Local: tools maxed out, context maxed out, commands wide open, human in the loop. Let the LLM stretch — discovery is the job, exploration is the work, the wide tool surface is what lets it land. Server: tool layer narrowed to direct API calls, context tight, commands on an allow-list. The LLM still interprets; it just doesn’t pick the path. Fast triage is the job, and the narrow surface is what lets the LLM be fast at it.
The point worth seeing: constraining the LLM server-side isn’t a comment on the LLM. It’s a comment on the job. A free-roaming exploratory agent is the wrong shape for “classify this in 60 seconds and post a first answer,” not because it would do that job badly but because it would do a different job.
You need both — and you need to see that they shouldn’t look the same.