sev1

View the Project on GitHub fjantunen/sev1

📄 Download the PDF version
Recommended for Safari users

SEV1 - The Art of Incident Command

SEV1: The Art of Incident Command

A Modern SRE-Aligned Approach to Incident Management

By Frank Jantunen

Why This Book?

Most books on incident response are locked behind paywalls or written like policy manuals. This isn’t that.

This is a tactical field guide for the people actually on-call—the ones who get paged at 3AM and have to lead through the fog of war.

If you’ve ever had to coordinate across Slack threads while pulling logs and writing updates to leadership—this is for you. If you’re the only SRE at your org, or part of a centralized team trying to shift culture from the edges, this is especially for you.

The Mission

This book exists to align modern incident response with SRE culture: fast, humane, and relentlessly practical.

It’s about using incidents as catalysts—not just to fix systems, but to transform how organizations think and operate. It’s a survival manual for the mess and an argument that culture—not just tooling—is what defines reliability.

Incidents drive change. Never let a crisis go to waste.

How to Use It

The structure is dead simple: before, during, and after the incident. You can jump to any section as needed.

The language is intentionally spartan. No fluff, no filler. Just clear ideas and hard-won practices tested under pressure.

It’s built to match human limitations—especially under cognitive load.

That’s why the guidance here is short, structured, and designed to be scanned—not read front to back like a novel. It’s optimized for clarity during degraded cognition, not academic perfection.

🧠 In incident response, the enemy isn’t just downtime—it’s overload. This book is built for peak usability during peak stress.

Why Emojis, Callouts, and Formatting Matter

You’re going to see a lot of visual cues in this book: emojis, callout boxes, tight bullets, and bolded takeaways. That’s not for style points. That’s for scannability under stress.

This is written for on-call humans—people skimming this at 3AM, half-asleep, with alerts firing and Slack melting down. The goal isn’t clever formatting. The goal is to make signal pop.

Emojis 🧠📉🚨
Used sparingly, they act like visual road signs. They help anchor ideas and break up cognitive load—especially in runbooks, alert payloads, and checklists. If it helps you spot the 🛑 STOP or ✅ DONE faster, it’s doing its job.

Callouts & Takeaways 📦
These isolate what actually matters. They’re the stuff people highlight in trainings—or forget when it counts. Use them to orient, not decorate.

Spartan Layout, Fast Reading 🏃‍♂️
Short paragraphs. Minimal prose. If it takes more than five seconds to understand, it’s probably rewritten. This isn’t about dumbing things down. It’s about reducing friction.

🧠 This book isn’t a blog. It’s a cockpit manual. And every second counts.

Value-for-Value

This book is free to read, remix, and share. If it helps you or your team, consider sending value back—feedback, stories, signal boosts, or donations.

There’s no DRM. No paywall. Just trust.

If it helps you, pass it on.

Copyright © 2025 Frank Jantunen All rights reserved.

This work is distributed under a value-for-value model. It may be freely read, shared, and discussed for personal, non-commercial use. If you found it valuable, consider supporting the project, offering feedback or sharing it. 🙏

No paywall. No ads. Just value-for-value.

Support the project:

⚡ Bitcoin: bc1qxl8uy3acrhlhgvn7653twmdmhr97j0xjxk2cak

💸 PayPal: https://paypal.me/frankjantunen

For commercial use—including redistribution, employee training, or internal documentation—please contact the author directly at [email protected].

No part of this publication may be reproduced, stored in a retrieval system, or transmitted by any means—electronic, mechanical, photocopying, recording, or otherwise—for commercial use without prior written permission from the author.

Printed in USA 🇺🇸 First Edition – June 2025

This book is intended for informational and educational purposes only. The views expressed are those of the author and do not represent the positions of any employer, organization, or entity unless explicitly stated.

All trademarks, logos, and brand names mentioned are the property of their respective owners. Their use is for identification and illustrative purposes only and does not imply affiliation, sponsorship, or endorsement.

Mentions of specific services, platforms, or vendors—including but not limited to PagerDuty, Datadog, Honeycomb, Gremlin, Netflix, Google, PayPal, and Microsoft—are made for example and context. No payments, sponsorships, or kickbacks were received. This book promotes no specific tool or service. All references are used in a neutral, educational context.

The content is provided “as-is.” Readers assume full responsibility for the use of any information presented herein. Always evaluate ideas in the context of your organization’s specific needs and risks.

Table of Contents 📜

Acknowledgements

Foreword

Part I: Before the Incident 🕰️

1. What Is an Incident, Really? 🤔
2. Operational Mindset & Culture 🧠
3. Clear Criteria for Incident Declaration ✅
4. Systems, Playbooks & Observability 🗺️
5. Alerting Without the Noise 🔕
6. Training, Simulation & Team Maturity 🏋️‍♀️

Part II: During the Incident 🔥

7. Triggers & Assembly 🚦
8. Incident Command in Practice 🧑‍✈️
9. Communication Under Pressure 🗣️
10. Managing People, Pace & Burnout 🧘

Part III: After the Incident 📝

11. Declaring the End & Recovery 🏁
12. Postmortems That Don’t Suck ✨
13. From Lessons to Systems Change 🔄
14. Measuring What Matters 📊
15. The Future State of Incident Command 🔮

Conclusion

The Journey Continues: Further Learning and Resources 🚀

Acknowledgements 🙏

To my family, who never asked why I was obsessed with writing this book—just made sure I didn’t forget to eat. Thank you for the support! ❤️

To Eric, who’s been a great mentor and a constant source of inspiration.

To everyone I’ve worked with over the years. 🤝

To the Learning From Incidents community, and to those who’ve pushed reliability thinking beyond dashboards and into the human domain—your work paved the way for this one.

Thank you to everyone who’s ever written a clear postmortem, spoken up when something felt off, or challenged process for the sake of people. You’ve made this field more humane, and this book wouldn’t exist without your example.

And to anyone who reads this and offers value for value—thank you. That exchange means more than you know. ✨

Foreword

When I got into tech in June 2000—slapping together fugly websites, streaming low-res videos, and trying to keep NT4 servers running—before YouTube was even a concept, I was live streaming, running end-to-end event production and becoming the SME for anything streaming or CDN. 👨‍💻

By 2011, I’d stumbled into incident management. The industry was deep in its ITIL hangover—rigid process, thick hierarchies, and enough red tape to mummify a data center. 📜 It brought order, sure, but agility? Like trying to steer a cargo ship with a joystick. 🚢

Then came the SRE wave. 🌊 Suddenly everyone wanted to “do SRE,” flipping the script on how we think about reliability and response. But despite all the tooling, the frameworks, the culture decks—we’re still flailing when it comes to human factors.

I’ve ridden every wave since—sometimes surfing 🏄‍♂️, sometimes just staying afloat. In 2018, working at a startup, I got my first exposure into the role of incident commander. No training, no playbook, barely any system visibility. Just raw chaos, flaming chainsaws 🔥🪚, and the expectation to “own it.” That trial by fire taught me this: strong incident command is non-negotiable, especially when you’re also wearing three other hats. 🎩🎩🎩

Across startups and giants, I’ve watched teams fumble and stall—not because they lacked tools, but because they ignored culture. Fixing incident management means wrestling that beast. And let’s not kid ourselves—it’s like sprinting uphill through molasses.

SEV1 – The Art of Incident Command is the distilled chaos. Not sanitized “best practices,” but the book I wish someone had handed me when I was drowning. It’s built from scars, scraped from real-world incidents, and shaped by teams both scrappy and sprawling.

Today, incident response is a three-ring circus: engineers juggling pagers 📟, debugging blind 🕶️, and improvising in real time while the stakes climb and the tooling sprawls. This book is your survival guide and your last line of defense.

🌊 The water’s rough. Are you ready to jump in?

—Frank Jantunen

PART I: Before the Incident 🕰️

1. What Is an Incident, Really? 🤔

The ITIL View: A Starting Point

The ITIL (Information Technology Infrastructure Library) framework provides a classic definition of an incident:

“An unplanned interruption to an IT service or reduction in the quality of an IT service.”

This approach is service-focused, reactive, and operational by nature—an incident exists when someone or something detects a problem.

Where ITIL Falls Short: The Priority Matrix Trap 😬

In modern, complex systems, the traditional ITIL model’s handling of urgency and impact is a critical bottleneck. The model separates priority from severity, calculating priority based on a function of its two main inputs:

Priority = Impact x Urgency

Debating whether an incident is a P2 or P3 wastes time not spent mitigating escalating customer impact.

The SRE Mindset: Engineering for Failure 💡

Site Reliability Engineering (SRE) collapses the distinction between priority and severity to move faster and assumes system failures are inevitable.

Key shifts:

Where an Incident Begins

A modern guideline:

An incident begins when a responder believes action may be needed to preserve service reliability.

One person is all it takes to declare: “Something may be wrong. We should respond as if it is until we confirm otherwise.”

Example triggers:

Example Severity Matrix (Impact-Focused) 🚀

Severity Impact Typical Response Time Examples & Notes
SEV-0 (optional) Severe platform failure, business risk Immediate Catastrophic event, exec-level coordination, unknown recovery path
SEV-1 Major service degradation or outage < 3 min Core features down, large-scale impact, “all-hands” response
SEV-2 Moderate service impact < 5 min Significant performance issues, workaround may exist, multiple services affected
SEV-3 Degraded user experience < 15 min Minor bug, single-service impact, logged for resolution
SEV-4 (optional) Minimal/cosmetic impact < 48 hours Flexible, for deferred issues
SEV-5 (optional) External/Partner issues Monitor Only Third-party outage, visible but not actionable

📊 Reality Check: Most teams operate with just SEV-1 to SEV-3. Start simple, expand only if needed.

🔄 Sidebar: Severity vs. Priority

📌 This matrix maps severity as a measure of impact—not priority.

  • Severity = how bad.
  • Priority = how fast.

A SEV-3 might trigger a P1 if it risks legal exposure. A SEV-2 might be stable and non-urgent.

🔔 Let the alert decide—use worst-case interpretation at time of fire. Severity should reflect what could go wrong if nothing is done. Escalate early; downgrade with certainty.

Treat severity as an engineering signal. Treat priority as a business response. Most orgs route by SEV; stakeholders triage by P#. If you deal with contracts, SLAs, or compliance—track both.

Lifecycle Comparison

Framework Lifecycle Steps Primary Context
ITIL Detection → Logging → Categorization → Prioritization → Diagnosis → Resolution → Closure Operational helpdesk 📞
SRE Detect → Triage → Mitigate → Resolve → Review Fast-moving, distributed systems 💨
NIST Preparation → Detection & Analysis → Containment, Eradication & Recovery → Post-Incident Activity Security-focused response 🛡️

🔑 Key Takeaway: Keep it simple: map severity to priority directly and define levels by the response they demand.

2. Operational Mindset & Culture 🧠

Incidents do not happen in a vacuum. Team response, escalation, and recovery are shaped by culture—how a team thinks, behaves, and values its work.

Resilience Over Redundancy

Redundancy can mask fragility. Instead of fixing flaky systems, teams add layers.

Resilience means being honest about what breaks, why, and what to do when it breaks again.

It’s about graceful degradation, fast recovery, and human readiness.

Resilient teams:

Resilient systems:

Blamelessness and Psychological Safety ❤️

If engineers are afraid to speak up, incident response is compromised.

Blame kills curiosity. 🔪

Blameless culture separates the person from the process, focusing on why a decision made sense at the time.

Psychological safety means:

SRE vs. DevOps Culture: Bridging the Mindsets

SREs DevOps
Emphasize error budgets, reliability as feature Fast, iterative, delivery-focused
Treat ops as software problem Willing to trade stability for speed
Quantify risk, push back on unsustainable pace Adaptable, but risk burnout or inconsistent quality

Bridge-building strategies:

Tooling Signals Culture

Your incident management tooling—Slack vs. Teams, PagerDuty vs. homegrown schedulers, orchestrators like Rootly, FireHydrant, Blameless or incident.io, and even how you structure observability—says a lot about your engineering culture. These choices shape more than your incident response; they signal what kind of environment you’re building and who it’s built for.

Some tools come with historical baggage. Others imply a more modern, progressive approach. Slack implies high-context, fast-moving collaboration. Teams might signal heavier governance. PagerDuty suggests urgency and maturity. Blameless implies structured learning and psychological safety. Homegrown tooling could imply a startup culture, which you may have to maintain.

These are cultural decisions disguised as tooling choices. Your stack becomes your story. Choose with intention—because it attracts (or repels) the kind of engineers you’ll end up relying on in a SEV1.

🏗️ Building Resilient Systems: Two Pillars

System-Level Resilience

Adaptive Capacity (Resilience Engineering)

🔑 Key Takeaway: Culture isn’t a slide deck or a slogan. It’s what people actually do—under pressure, in the dark, without a script. If you want real resilience, you need both: systems built to absorb shocks, and teams trained to adapt.

3. Clear Criteria for Incident Declaration ✅

If you ask five teams what counts as an incident, you’ll likely get ten different answers. Incident management cannot start effectively until everyone knows what qualifies as an incident, who can declare one, and what should happen next.

ITIL vs. SRE: Definitions

Concept ITIL SRE
Severity Not formal. Often muddled with “impact.” Clear measure of technical impact (e.g. downtime).
Priority Blend of impact and urgency for ticket SLAs Rarely used. Urgency implied by severity.

Common Failure Modes 💥

Scenario: A production database flips into read-only mode.

The Fix:

Incident Declaration Criteria

A healthy incident process starts with specific, trigger-based criteria:

📢 Important: “Incident” doesn’t mean “disaster.” It means structured response.

The Security Dimension 🔒

Some incidents are the direct result of malicious activity (e.g., DDoS attack). SRE and Security must collaborate:

Who Can Declare?

Anyone in the organization should be empowered to declare an incident. If it turns out to be a false alarm, that’s acceptable—over-alerting is better than delay.

Example Incident Assembly Orchestration:

Transparency and Announcement 📢

Incidents should be visible. Unless security-sensitive, post in a public #incidents channel with an auto-generated summary.

Example:

JIRA# INC-1234
SEV2 - Checkout - API - High error rate on checkout API
Slack Channel: #INC-1234

🔑 Key Takeaway: Define clear criteria for declaring incidents, this removes hesitation. When anyone can declare an incident quickly and transparently, teams respond faster and learn more effectively.

4. Systems, Playbooks & Observability 🗺️

Incidents aren’t just about people responding—they’re about systems telling us something is wrong and giving enough information to act.

MELT: Metrics, Events, Logs, Traces

Pillar Purpose Example
Metrics Trends, thresholds CPU usage, error rates 📈
Events Discrete signals Deploy, config change ⚙️
Logs Granular detail, forensics Error logs, audit trails 📜
Traces Connect dots across services Request tracing ➡️

💡 Tip: Mature systems integrate all four, but balance coverage with cost. 💰

The Service Catalog: Your Operational Map

A robust service catalog is indispensable:

What a Service Good Catalog Contains:

Runbooks, Dashboards, Dashbooks

Checklists: Always clearly structure docs as checklists to reduce errors and ensure critical steps aren’t missed.

Ultra-Terse Runbooks & Visual Cues ✂️👀

Runbooks are most useful when they’re scannable under stress. In high-tempo incidents, no one wants a wall of text. What I’ve found most effective is writing runbooks in ultra-terse, command-style language. Think: checklist, not essay.

Add visual cues—like emojis or icons—to guide the eye to high-priority actions (🛑 STOP, 🧪 VERIFY, ✅ DONE). These cues reduce mental overhead, especially when runbooks are embedded directly into alert payloads or chat workflows. The goal is clarity and speed, not cuteness.

💡 Tip: If your runbook isn’t readable in five seconds during a fire, it’s too long.

Auto-remediation: Guardrails & Pitfalls 🤖

Automation can act faster than humans, but speed without context is dangerous.

Platform Engineering Connection 🏗️

Modern platform teams embed observability, runbooks, and automation into the dev workflow, making reliability everyone’s responsibility.

🔑 Key Takeaway: Modern incident readiness requires integrated systems, current docs, practiced chaos, thoughtful automation, and platform-embedded reliability practices.

5. Alerting Without the Noise 🔕

The best alert is the one that matters. The rest are distractions—expensive ones.

SLO-Based Alerting and Signal Quality 📈

SLOs are contracts between system reliability and user expectations. Good alerts are rooted in those contracts.

Want quick triage? Link each alert type to its impact criteria and example dashboards. Even better, use AI-generated summaries (reviewed by humans) to surface what matters—so you’re not chasing 99 dashboards to find one root cause.

Routing, Deduping & Silencing the Noise ➡️⛈️🤫

These are the hygiene layers of alerting.

All of this should link directly to filtered dashboards, current runbooks, and team docs. No more hunting.

Alert Fatigue, False Positives, and Pager Hell 📟🔥

Bad alerts create distrust. False positives drain focus. Pager hell burns people out.

Track key alert health metrics:

Make these visible. Better yet, include them in each service’s landing page so responders see the context in real-time.

AI/ML for Detection: Promise vs. Reality 🤖🧃

AI can find weird patterns—but unfiltered, it just adds to the noise.

🤖 Reality Check: If AI fires an alert, humans still own the action. Treat it as a suggestion, not a verdict.

Tuning Alerts: From Wall of Noise to Layered Intelligence 🎛️🧠

Avoid alert overload by designing a three-tiered model:

🔁 Three-Tiered Alert Strategy:

  1. 📟 Page Alerts (High Fidelity):
    • 🚨 User impact is likely or confirmed
    • 🤖 No auto-remediation
    • 🕐 Needs immediate response
  2. 📮 Ticket Alerts (Medium Fidelity):
    • 📌 Worth tracking (e.g., disk 80%, 5xx spikes)
    • 🎫 Routed into backlog
  3. 📊 Dashboard/FYI Alerts (Low Fidelity):
    • 🧾 Informational
    • 🛑 Suppress during incidents

💡 Every alert should answer: “What action do I expect someone to take?”

You should be able to sort every alert into one of these buckets—if not, it probably doesn’t belong.

Living Documentation Inside the Alert Payload 📎📦

A strong alert payload is a mini-playbook.

📦 Include in every payload:

💬 Bonus: Use Slack bots to auto-expand this context when the alert fires.

🛠️ Tip: If your payload doesn’t help someone triage in 60 seconds, it’s not done.

Alert Ownership and Hygiene 🧼🧑‍🔧

Don’t let ancient alerts linger. Maintain alert quality like you maintain code.

🧽 Alert Hygiene Checklist:

✂️ If nobody would miss the alert, delete it.

Fire Drill Your Alerts 🔥📣

Test alerts in controlled environments. See if humans can actually respond to them.

🧪 Simulation Steps:

Use environment-specific channels for drills too—don’t test everything in #general.

If it can’t survive a drill, it won’t survive a real SEV.

Alert Response Plans: Terse Runbooks 🛬📚

When alerts come in from all sides, responders shouldn’t have to assemble their own context puzzle.

Create Alert Response Plans: simple, example docs per alert type (e.g., high latency, full disk, SLO breach).

Each ARP includes:

This becomes the first link shared in triage. Build it once, iterate, reuse it every time.

Minimize Clicks: Make It Instant, Not a Scavenger Hunt 🖱️❌

When you’re on-call at 4AM, every click is a tax on cognition. Responder UX matters.

Design alerts so responders don’t have to dig.

🏎️ Low-Click Design Principles:

🧠 Think like UX for responders:
When the alert hits, they should immediately see what broke, how bad, what to check, and what to do.

Visual Cues & Mental Anchors 🎯👁️‍🗨️

Design alert payloads for skimmability. Use emoji and formatting to direct the eye.

✅ Good format:

🚨 SEV-1: Checkout Errors
🔍 Error Rate: 42% (normal <1%)
📉 SLO Burn: 7% in last hour
🔗 [Dashboard] | [Runbook Step 1] | [Escalate to on-call]
📎 Context: New deploy @12:32, API latency spiked
🎯 Next Step: Rollback deploy via /rollback checkout-api

💡 Design your alert like a status page update for engineers—tight, scannable, decisive.

The “First 5 Seconds” Rule ⏱️👀

A responder should be able to answer these five within seconds of seeing the alert:

  1. ❓ What broke?
  2. 🧠 What’s the impact?
  3. 📊 Where can I verify it?
  4. 🛠️ What should I try first?
  5. 🧑‍💻 Who do I call if I’m stuck?

If your alert doesn’t answer those, fix the payload—not the human.

🔑 Key Takeaway:
🔕 Alerting isn’t about flooding inboxes—it’s about earning the right to interrupt someone.
🧩 Design your alerts like products: layered, human-aware, context-rich.
✅ Quiet alerts = faster humans = faster resolution.

6. Training, Simulation & Team Maturity 🏋️‍♀️

Chaos Engineering as Ongoing Readiness

Practice, don’t just plan.

Chaos engineering deliberately introduces failure to test resilience.

Practical Chaos Engineering: Building Muscle

Start with safe, controlled experiments in staging/dev environments.

Example: Simulating API Node Failure

Chaos Maturity Levels:

Level Description
Level 1 Reactive: Terminate instances, kill processes
Level 2 Proactive: Schedule experiments
Level 3 Integrated: Chaos in CI/CD, automate faults
Level 4 Adaptive: System adjusts based on live feedback

🔑 Key Takeaway: You can’t control when the next incident hits—but you can train your team to meet it with confidence. Chaos engineering and simulation aren’t optional; they’re how you transform individual skill into organizational readiness.

PART II: During the Incident 🔥

7. Triggers & Assembly 🚦

Every alert begins with a signal. The difference between chaos and coordination starts at that moment.

Who Triages the Alert

Alert Payload: From Noise to Signal

The best alerts are:

Checklist Example: ✅

Alert: High CPU on API Server

Checklist:

From Triage to Declaration

The transition between “alert received” and “incident declared” should be explicit and documented.

Standardized Intake Questions: ❓

Compliance and Business Risk

Not every incident requires immediate action. Sometimes the business accepts risk—document the risk, monitoring, and who made the call.

Access Controls and Break-Glass Scenarios 🚨

🔑 Key Takeaway: The first few minutes are where clarity and chaos compete. Triage is about signal discernment, role clarity, and high-quality intake.

8. Incident Command in Practice 🧑‍✈️

The Incident Commander (IC) is the single person responsible for the overall incident response. This is a temporary, highly focused role.

The Role of the Incident Commander

The IC is like the conductor of an orchestra—they don’t play every instrument, but they ensure everyone is playing in harmony.

IC Responsibilities:

🙅‍♀️ What an IC is NOT: The IC is not the person who fixes the problem. If the incident commander is glued to dashboards, no one is steering the response! They are the person who ensures the problem gets fixed. Delegate the analysis. Coordinate the people. Stay above the weeds. Resist the urge to dive into debugging!

Incident Roles and Responsibilities

Effective incident response relies on clear roles:

Important: In many organizations, one person may wear multiple hats initially, but the mindset of these distinct roles is crucial.

Handling Swarming: Creating Focused Workstreams During Chaos

Large-scale incidents often attract a flood of well-meaning responders. Slack fills with noise. The bridge becomes a spectator sport. People want to help—but without structure, they end up repeating efforts, derailing focus, or just adding background chaos.

The IC’s job isn’t to shut people out. It’s to create order from the influx. That means giving the swarm something useful to do—and somewhere to do it.

Break Into Workstreams

Divide the incident into focused areas of investigation or remediation. These typically follow existing team boundaries or runbook domains.

Examples:

Each workstream should have:

This keeps effort compartmentalized and allows the IC to move horizontally without micromanaging.

Use a Shared Landing Page

Establish a central document to orient everyone. This is the front door for anyone dropping into the incident.

Options include:

The landing page should contain:

Drop this link early and often. Anyone asking “What’s going on?” gets pointed here first.

Slack Discipline

Avoid the scroll-of-death. Centralize updates in a few clearly named threads:

Pin these in the incident channel or on the landing page. ICs should post summary updates, not raw logs. Ask responders to reply in the relevant thread, not the main channel.

Managing the Video Bridge

Video bridges are useful—but risky when unmanaged. Treat them like a war room, not a water cooler.

Best practices:

Most tactical work still happens in Slack or docs. If your bridge feels like a hangout, it’s time to trim the invite list.

Every responder wants to help. Make it easy for them to be useful without becoming a distraction.

The Incident Lifecycle: From Active to Resolved

  1. Detection & Declaration: Alert fires, IC declared. 🚦
  2. Triage & Assessment: What’s the impact? What’s the severity? 🤔
  3. Investigation: Deep dive into root cause. 🔬
  4. Mitigation: Reduce or stop impact (e.g., rollback, disable feature). 🛡️
  5. Resolution: Full fix applied, service restored. ✅
  6. Recovery: Bring systems back to full health. 💚
  7. Post-Incident Analysis: Learn from the incident. 📝

Decision-Making Under Pressure: OODA Loop

The OODA Loop (Observe, Orient, Decide, Act) is a powerful model for rapid decision-making:

  1. Observe: Gather information (metrics, logs, reports). 🧐
  2. Orient: Analyze the situation, put it in context (mental models, past incidents). 🧠
  3. Decide: Choose a course of action (mitigate, investigate further). ✅
  4. Act: Implement the decision. 🚀

Then, the loop repeats, constantly adapting to new information. This iterative process is vital in chaotic environments.

🧭 Seek Clarity Early

Incidents begin in a fog.
📊 Dashboards light up.
🚨 Alerts fire.
💬 Slack explodes.

It’s easy to confuse motion with progress.
But flailing fast is still flailing.
The IC’s first job isn’t to fix—it’s to make sense.

🧭 Clarity Is the Compass

Not certainty. Not root cause.
Just a grounded view of:

🔍 Start With the Basics

🗣️ Say it out loud.
👂 Ask others to explain their thinking.
If someone says, “It’s the database,”—ask:

Not to challenge—just to stabilize the narrative.

🧱 Use Structure

📝 Shared doc
📌 Pinned update
📋 List of knowns, unknowns, blockers

These small anchors ⛓️ reduce thrash and help the team move together.

Without shared clarity:

🧠 Practice Epistemic Humility

Remember:

The hardest part isn’t knowing what’s broken—it’s knowing what you can’t see.

Great responders ask:

They treat:

🔑 Key Takeaway:
Strong incident command isn’t about heroics—it’s about structure, clear roles, and iterative clarity.
In the fog, clarity > certainty.
But clarity without humility becomes overconfidence.
🧠 Question everything—especially yourself.

9. Communication Under Pressure 🗣️

During an incident, clear communication is paramount. Misinformation or lack of information fuels panic and slows resolution.

Internal Communication: Keeping the Team Aligned

🧭 The CAN Format: A Lightweight Comms Standard

C: Condition
What’s happening right now?
What systems or services are impacted?
When did it start?

A: Action
What’s been done?
What’s underway or queued?
What mitigation steps or playbooks have been attempted?

N: Need
What do we need?
Who should act, investigate, or approve?
What blockers exist?

Use this format in Slack threads, bridge updates, and stakeholder pings. It cuts noise and ensures people hear what matters.

Example Update: C: Elevated 5xxs on checkout API, spike at 10:14 UTC A: Rolled back 10:00 deploy, investigating DB connection pool N: Need SRE to confirm read replica lag in #checkout-db

Want to scale this? Use a Slack /can shortcut to prompt structured updates or train leads to anchor standups and bridges with it.

🧠 Speak the Same Language: Standardized Terminology in High-Pressure Environments

Communication during an incident hinges not just on speed, but clarity. Terminology friction—when responders don’t speak the same operational language—slows things down, increases error rates, and misroutes work. The fix isn’t fancy tooling—it’s consistent language, used everywhere.

✂️ Terseness, Not Obscurity

Terse language is a feature, not a bug. But it becomes a liability when masked behind team aliases, obscure acronyms, or insider references.

If someone says “get Bluebird on it” and half the team doesn’t know that’s the Traffic SRE group, you’ve just added confusion. Similarly, acronyms like “MARS” mean different things to different teams. Assume nothing. Spell it out.

🧩 Consistency Across the Stack

Standardized terminology should appear everywhere:

Pick a canonical term—“Probes,” not “Canaries”—and use it across the board. One word, one meaning.

🏗️ Build Language Into Culture

Clear, shared language reflects a strong ops culture. Encourage staff engineers and ICs to model it. Bake it into code reviews, alert payloads, postmortems, and onboarding.

You don’t need to sound clever. You need to be understood.

✅ The best responders sound boring. Clear, repeatable, boring language wins.

Slack First, Zoom If You Must

When every second matters, Slack is your command center. Zoom is supplementary.

Why Slack wins:

Zoom? Great for:

But if a decision is made on Zoom, someone must write it into Slack.
📢 If it didn’t make it to the channel, it didn’t happen.

Communication Tools & Workflows

External Communication: Managing Stakeholder Expectations

Segment your audience:

Pro Tips:

🔑 Key Takeaway:
Clarity under pressure isn’t optional—it’s the product of culture, structure, and repetition. Use Slack as your cockpit, use language precisely, and give everyone the same map. The only good chaos is the kind you’re driving.

10. Managing People, Pace & Burnout 🧘

Incidents are sprints, not marathons. Sustained high-pressure work leads to burnout and errors.

Recognizing and Mitigating Fatigue

Avoiding Cognitive Overload

Psychological Safety During the Incident

IC Self-Care & Handover

The IC role is incredibly demanding. Self-care is crucial.

Follow-the-Sun Coverage ☀️🌏

Global teams are a superpower—if you use them right.

Follow-the-sun coverage reduces fatigue and preserves decision quality by shifting incidents to fresh responders in aligned time zones. Instead of waking up heroes at 3AM, you rotate responsibility across regions as the sun moves.

It only works if:

This isn’t just operationally efficient—it’s biologically smart. Humans are not 24/7 systems. Sleep debt, disrupted circadian rhythms, and cognitive fatigue all degrade incident response.

🧠 Human factors matter. Tired responders miss signals, miscommunicate, and default to tunnel vision.

Wake-the-right-person beats wake-the-best-person. Optimizing for local time zones isn’t about laziness—it’s about preserving clarity under pressure.

If your team spans multiple continents but you’re still running incidents out of a single timezone, you’re paying for 24/7—but operating like 9-to-5.

🔑 Key Takeaway:
Follow-the-sun coverage isn’t just about scale—it’s about respecting the limits of human cognition. Minimize task switching, protect sleep, and align your processes to human performance windows.

PART III: After the Incident 📝

11. Declaring the End & Recovery 🏁

The incident isn’t truly over until services are fully restored, systems are stable, and the learning process begins.

Criteria for Incident Resolution

Resolution is not just “it’s working now.” It requires:

The Role of the Incident Commander in Closure

The IC is responsible for officially declaring the end of the active incident. This involves:

Recovery Steps & Checklist

Recovery means bringing systems back to their pre-incident state, and often better.

Recovery Checklist:

The “All Clear” Signal 🚥

A clear, unambiguous “all clear” signal helps shift the team’s focus from crisis to recovery and learning. This could be a message in the incident channel:

🔑 Key Takeaway: A clear and deliberate closure process ensures true resolution, prevents “phantom incidents,” and smoothly transitions the team to the critical learning phase.

12. Postmortems That Don’t Suck ✨

The post-mortem (or post-incident review) is the most critical learning opportunity. A “good” post-mortem isn’t about assigning blame; it’s about understanding and improving.

Blameless Postmortems: The Foundation of Learning ❤️

A blameless culture is non-negotiable for effective post-mortems.

Structure of a Modern Postmortem

A robust post-mortem document typically includes:

  1. Summary: High-level overview of the incident, impact, and resolution.
  2. Timeline: Detailed chronological log of events, including detection, actions taken, and key decisions. ⏳
  3. Impact: Comprehensive description of business and customer impact. 💸
  4. Root Cause(s): The underlying systemic factors that led to the incident. (Often multiple contributing factors). 🌳
  5. Detection: How was the incident detected? Was it timely? 🚨
  6. Mitigation: How was the impact reduced or stopped?
  7. Resolution: How was the service fully restored?
  8. Lessons Learned: What did we learn about our systems, processes, and people? 🎓
  9. Action Items: Concrete, measurable tasks assigned to specific owners with due dates. These are the outputs of the post-mortem. ✅
  10. Preventative Measures: What changes will prevent recurrence or reduce future impact? 🛡️

Facilitating the Postmortem Meeting

🔑 Key Takeaway: A blameless postmortem is a gift to your organization. It transforms errors into opportunities for systemic improvement, fostering a culture of continuous learning and resilience.

Positive Retrospectives: When Nothing Broke (Because You Did It Right) ✨

We usually wait for things to break before we learn from them. But some of the best signals come from the near-misses—the moments where something could have gone sideways but didn’t.

Maybe a deploy was flagged and rolled back before it hit prod. Maybe someone spotted an odd metric pattern, kicked off an investigation, and quietly averted a major issue. Maybe a fallback system kicked in perfectly and no one even noticed there was a problem.

These are not accidents. These are successes. And they deserve just as much attention as the big blowups.

We call these positive retrospectives.

A positive retrospective is a deliberate look back at a time when the system, the team, or the process caught something early and acted before damage occurred. It’s not about high-fives or chest-thumping. It’s about studying what worked, so you can do it again.

What to explore in a positive retro:

You’re not chasing a root cause here—you’re mapping the early warning system and the immune response. These moments are often quiet wins that disappear into the noise unless someone captures them.

If you want real resilience, you can’t just study failures. You have to study the things that almost failed but didn’t. They show you where your systems flexed instead of snapped, and where your people trusted their gut and were right.

🔑 Key Takeaway:
Celebrate the anti-incidents. They’re often invisible, but they’re proof your systems—and your people—are getting stronger.

13. From Lessons to Systems Change 🔄

A retrospective without action is just a history lesson. The real value comes from turning insights into tangible improvements.

The Action Item Lifecycle

Action items must be treated with the same rigor as product features.

  1. Creation: Clear, specific, measurable, assigned, time-bound (SMART).
  2. Prioritization: Integrated into existing backlog processes (e.g., JIRA, Asana). Prioritized alongside other development work. 📈
  3. Tracking: Regularly reviewed and updated.
  4. Completion: Verified and closed. ✅
  5. Verification: Confirm the change had the intended effect.

Prioritizing Reliability Work

This is often the hardest part. Reliability work (from post-mortems) competes with new feature development.

The Feedback Loop: How Incidents Inform Product & Engineering

Championing Systemic Change

🔑 Key Takeaway: The true measure of an effective incident management program is its ability to drive concrete, systemic change. Turn lessons learned into prioritized, actionable work that continuously improves reliability.

14. Measuring What Matters 📊

You can’t improve what you don’t measure. Metrics provide insights into the health of your incident response process and system reliability.

Key Incident Metrics

The Danger of Vanity Metrics

Building Incident Dashboards & Reports

Continuous Improvement Loop

Measuring is part of a continuous loop:

  1. Define Metrics: What do you want to improve?
  2. Collect Data: Implement logging and tooling.
  3. Analyze & Visualize: Understand trends and outliers.
  4. Identify Areas for Improvement: Where are the bottlenecks?
  5. Implement Changes: Prioritize and execute action items.
  6. Measure Again: Did the changes have the desired effect?

🔑 Key Takeaway: Strategic metrics provide the evidence needed to understand your current state, justify investment in reliability, and demonstrate the impact of your incident management program. Choose metrics that drive actionable insights, not just numbers.

15. The Future State of Incident Command 🔮

Incident management is a constantly evolving discipline. What’s next?

AI/ML in Incident Response: Beyond Anomaly Detection

🤖 Reality Check: AI won’t replace human ICs soon. It will augment their capabilities, offloading cognitive burden and speeding up information processing. Human judgment, empathy, and creative problem-solving remain essential.

Proactive Incident Management & Resilience Engineering

Distributed & Federating Incident Command

As systems become more distributed, so too will incident response.

Human-Centered Design for On-Call & Tooling

🔑 Key Takeaway: The future of incident command is about continuous human-computer collaboration, deeply integrated reliability into every stage of the software lifecycle, and a relentless focus on the well-being and adaptive capacity of the people on the front lines.

Conclusion

The journey of mastering incident command is continuous. It’s a blend of technical expertise, human psychology, and organizational culture. You’ve learned about:

The next time an alert fires, you’ll be better equipped. Not just with tools, but with a mindset, a framework, and the confidence to lead. The art of incident command is about transforming chaos into learning, and ultimately, building more resilient systems and teams.

The Journey Continues: Further Learning and Resources 🚀

Keep learning. Keep practicing. Keep building resilient systems and, more importantly, resilient people. Your users—and your on-call teams—will thank you. 🙏

One Last Thing 💬

If this book helped you—if it made you think, saved you time, or gave you language for what you’ve lived—consider helping someone else.

That might mean sending feedback. Sharing it with a teammate. Or supporting the project so it stays free for the next person who needs it.

This is value-for-value. No gatekeepers. Just trust.

Thanks for reading. Stay resilient. 🙏