In most incident response pipelines, customers begin experiencing impact at minute zero. Within a few minutes, the first support tickets arrive. By minute ten, customer-initiated posts appear in shared channels. Public status pages are typically updated somewhere between thirty and ninety minutes.

The gap between when a customer first feels impact and when that customer is formally told is a measurable, important, and almost universally unmeasured KPI. Call it Time-to-Customer Awareness, or TTC-Aware.

Most incident response playbooks optimize Time to Detection and Time to Resolution. Both matter. But neither is what determines whether an incident erodes trust. A customer can tolerate a long incident they are kept informed about. They struggle to tolerate a short incident they discover through their own usage.

TTC-Aware is the window in which your team either earns or burns trust capital with that customer. It deserves a name, a number, and a place on the dashboard.

Defining the metric

Time to Customer Awareness: the elapsed time from the first customer-impacting event to the first communication received by an affected customer.

A few nuances:

  • The clock starts when impact begins, not when the alert fires. In practice, you measure from the alert timestamp because it’s the earliest timestamp you have, but the real clock started earlier.
  • The clock stops when a specific customer receives the communication, not when you publish it. A status page update at minute 30 has TTC-Aware = 30 min for a customer who checks the page, and TTC-Aware = ∞ for a customer who doesn’t.
  • “Affected customer” is the right denominator. Aggregating across unaffected customers dilutes the signal.

The six-stage pipeline

The traditional path from incident to customer communication has at least six stages, each with its own latency:

# Stage Typical latency Source of delay
1 Detection 1–5 min Alert fires, human notices
2 Triage 5–30 min Understanding scope and impact
3 Authorization 10–60 min Who decides to tell customers; who approves the wording
4 Drafting 15–60 min Translating engineering language into customer language
5 Publishing < 1 min Pressing the button
6 Reach minutes to hours Customer actively checks a status page or receives an email

Specific numbers vary by organization; the structure does not. Typical total: 30–90 minutes before the first affected customer is meaningfully informed.

During that window, the customer’s experience is dominated by silence. They are debugging their own integration, filing tickets, pinging sales, and — most consequentially — forming a narrative about your reliability that you have no input into.

What this reframes

Once TTC-Aware is the metric, several long-running architecture debates either dissolve or clarify.

Status page vs. Slack vs. email. These aren’t equivalent alternatives; they’re three points on the reach axis (stage 6) with wildly different latencies. A customer already in a shared Slack channel has reach ≈ 0. A customer who has to remember to check a status page has reach that depends on whether they happen to look. Multiple channels aren’t redundancy; they’re insurance against variable reach across the customer base.

Who writes the message. Under message-quality optimization, the answer is “whoever writes best.” Under TTC-Aware, the answer is “whoever is present now.” On-call engineer > absent PM > absent support lead. The quality of any individual message matters less than the presence of someone authorized to write one.

How detailed the message should be. Under TTC-Aware, detail trades off against frequency. Two terse updates beat one thorough one. The thorough writeup belongs in the post-incident summary, which operates under a different time budget.

What a TTC-Aware-optimized system looks like

The framework so far is diagnostic: it identifies where time is lost. A system designed around TTC-Aware, rather than retrofitted to it, has a few structural properties.

An alert-to-affected-customer mapping exists before the incident. The question “which customers care about this alert” should not be answered in real time by a human. It should be answered by configuration — alert tags mapped to customer scopes — so that when an alert fires, the list of affected customers is already available. Getting this wrong in either direction is expensive: broadcasting to uninvolved customers generates noise; missing affected ones reintroduces the original problem.

The first draft is generated automatically from alert context. As soon as the alert is ingested, an LLM produces a customer-facing draft using a consistent voice and the known incident state. The draft is not published automatically — it waits for human review — but the human starts with a near-ready message, not a blank page. This is the single largest collapse of stage 4.

Approval is ergonomic. One button to approve, one button to edit. If approval takes more than a few seconds of cognitive load, on-calls defer it; when they defer it, TTC-Aware balloons. Designing for a tired human at 3am is not optional.

The state machine is built into the workflow, not a convention. Investigating → identified → monitoring → resolved are the buttons the on-call presses; each produces a correctly structured update. Tiered communication becomes the default path, not something the team has to remember to do.

Cadence is tracked explicitly. The system shows the on-call, at a glance, how long since the last update went out. If the gap exceeds a threshold, it surfaces a prompt — “still investigating?” — which is itself a one-click update. Raising cadence is cheaper than relying on discipline.

Reach is multi-channel, same content. The same approved message fans out to Slack (for customers present there), to the status page (for record and for non-Slack customers), and optionally to email. These are not separate publishing workflows; they are one action with multiple destinations.

Post-incident output is derived, not re-authored. The sequence of updates during the incident is the substrate for the post-mortem summary. An LLM can draft the public-facing writeup from that substrate; the team edits for nuance and adds what’s missing. This avoids the trap where the post-mortem becomes a separate four-hour task that either ships late or doesn’t ship.

None of these components is novel on its own. What matters is that they be designed together, with TTC-Aware as the shared objective from the start.

A note on automation limits

TTC-Aware is minimized, not eliminated. Pushing it toward zero eventually produces communications that are wrong, premature, or leak internal detail. The relevant constraints:

  • Affected-customer identification has false positives and false negatives. Broadcasting to uninvolved customers creates noise; missing affected ones reintroduces the original problem. This is where ongoing investment compounds.
  • Drafted messages need a human on the trigger. Autoposting LLM output is a PR incident waiting to happen. The workflow is “LLM drafts, human approves in seconds” — the second half doesn’t go away.
  • Not every alert warrants a customer-facing update. Automating too aggressively collapses that distinction. The “should we communicate at all” decision — even if compressed — must remain in the pipeline. Over-automation degrades the channel’s meaning faster than occasional silence does.

The realistic target for TTC-Aware is single-digit minutes, not seconds — still a meaningful improvement over what most teams achieve (implicitly) today.

Closing

Time to Detection and Time to Resolution are measured because they are owned by engineering. They describe how the team performs. Time to Customer Awareness is owned by nobody — it sits between engineering, support, and communications — and that’s why it doesn’t get measured.

Name it. Measure it. Every tool, process, and staffing decision can then be evaluated against a single question: Does this shorten TTC-Aware?

That’s the only question that matters to the customer in the window they remember most.