These are notes on building AI agents for on-call, incident response, and infra reliability.
The premise is simple: the on-call stack — paging tools, dashboards, runbooks, postmortems — was shaped by what humans need to do incident response. As AI agents start to handle the procedural first pass, every layer of that stack is up for reorganization. The posts here work through that reframing one piece at a time: what survives, what changes form, what becomes redundant.
The writing is observational rather than promotional. Each post argues a specific structural move — a permission ladder, a triage funnel, a memory layout, a boundary pattern, a cross-model audit loop — concrete enough to apply today. The intent isn’t to claim AI will replace humans; it’s to show which parts of the current stack only exist because the responder was human, and what reorganizes when that constraint changes.
Topics covered: on-call ops, incident response, AI agent design, MCP servers, prompt engineering, context engineering, security boundaries around LLMs, observability tooling, customer-facing communication during incidents.
Contact: [email protected]