Engineer handing tools to an AI robot on the floor

Context Engineering for DevOps AI Agents

Most AI agent setups that disappoint their teams don’t disappoint because the model is wrong. They disappoint because the agent was asked to reason about systems it can’t see. A triage agent without PagerDuty access produces a vague analysis. An on-call agent without metrics hallucinates a root cause from alert titles. The agent isn’t bad; it’s undercontexted. Context engineering is the long game, and it has a structure. Specifically, it has four techniques — not a staircase. Each one matches a different shape of context source, and which ones apply depends on what you already have. A team with a CLI-heavy internal stack will spend most of its effort on technique 3. A team whose vendors all expose public MCPs might never touch technique 2. What remains true, regardless, is that technique 4 — bundling — is what turns any subset of the others into a team asset. ...

April 1, 2026 Â· 6 min Â· Jared L.
On-call engineer with AI robot partner at the desk

The Slack-Native AI On-Call Agent That Stands Shift With You

The limiting factor on how quickly an on-call engineer resolves an incident is rarely the rate at which they make decisions. It is the rate at which they can look at things — logs, metrics, traces, dashboards, SSH sessions — and assemble a picture of what the system is doing. Decision-making is cheap once the picture exists. Building the picture is where the time goes. This work is structurally single-seat. To bring a colleague in, an engineer has to translate what they have already seen: which log lines, when the metric started climbing, and what the network trace confirmed. The translation cost is high enough that most on-call engineers defer involving others until the problem forces it. The team is nominally 24/7; in practice, investigations happen alone. ...

March 1, 2026 Â· 8 min Â· Jared L.
Confused customer with laptop while engineer watches dashboards

Time to Customer Awareness: the Incident KPI No One Measures

In most incident response pipelines, customers begin experiencing impact at minute zero. Within a few minutes, the first support tickets arrive. By minute ten, customer-initiated posts appear in shared channels. Public status pages are typically updated somewhere between thirty and ninety minutes. The gap between when a customer first feels impact and when that customer is formally told is a measurable, important, and almost universally unmeasured KPI. Call it Time-to-Customer Awareness, or TTC-Aware. ...

February 1, 2026 Â· 6 min Â· Jared L.
On-call engineer at multi-monitor setup with alert bells

Measuring PagerDuty Alert Noise: A Diagnostic Framework

Every on-call veteran has the same intuition: a meaningful portion of what pages us isn’t actionable. What most of us don’t have is a way to prove it — let alone to classify why a given alert is noisy. Without classification, the only lever is blunt: raise the threshold, disable the alert, ignore the channel. None of these distinguishes an alert that’s noisy because the platform self-heals from one that’s noisy because it flaps. ...

January 1, 2026 Â· 5 min Â· Jared L.