sev1

View the Project on GitHub fjantunen/sev1

SEV1 - The Art of Incident Command

SEV1: The Art of Incident Command

A Modern SRE-Aligned Approach to Incident Management

By Frank Jantunen

Copyright © 2025 Frank Jantunen All rights reserved.

This work is distributed under a value-for-value model. It may be freely read, shared, and discussed for personal, non-commercial use. If you found it valuable, consider supporting the project, offering feedback or sharing it. 🙏

No paywall. No ads. Just value-for-value.

Support the project:

⚡ Bitcoin: bc1qxl8uy3acrhlhgvn7653twmdmhr97j0xjxk2cak

💸 PayPal: https://paypal.me/frankjantunen

For commercial use—including redistribution, employee training, or internal documentation—please contact the author directly at [email protected].

No part of this publication may be reproduced, stored in a retrieval system, or transmitted by any means—electronic, mechanical, photocopying, recording, or otherwise—for commercial use without prior written permission from the author.

Printed in USA 🇺🇸 First Edition – June 2025

This book is intended for informational and educational purposes only. The views expressed are those of the author and do not represent the positions of any employer, organization, or entity unless explicitly stated.

All trademarks, logos, and brand names mentioned are the property of their respective owners. Their use is for identification and illustrative purposes only and does not imply affiliation, sponsorship, or endorsement.

Mentions of specific services, platforms, or vendors—including but not limited to PagerDuty, Datadog, Honeycomb, Gremlin, Netflix, Google, PayPal, and Microsoft—are made for example and context. No payments, sponsorships, or kickbacks were received. This book promotes no specific tool or service. All references are used in a neutral, educational context.

The content is provided “as-is.” Readers assume full responsibility for the use of any information presented herein. Always evaluate ideas in the context of your organization’s specific needs and risks.

Table of Contents 📜

Acknowledgements Foreword

Part I: Before the Incident 🕰️

  1. What Is an Incident, Really? 🤔
  2. Operational Mindset & Culture 🧠
  3. Clear Criteria for Incident Declaration ✅
  4. Systems, Playbooks & Observability 🗺️
  5. Alerting Without the Noise 🔕
  6. Training, Simulation & Team Maturity 🏋️‍♀️

Part II: During the Incident 🔥

  1. Triggers & Assembly 🚦
  2. Incident Command in Practice 🧑‍✈️
  3. Communication Under Pressure 🗣️
  4. Managing People, Pace & Burnout 🧘

Part III: After the Incident 📝

  1. Declaring the End & Recovery 🏁
  2. Postmortems That Don’t Suck ✨
  3. From Lessons to Systems Change 🔄
  4. Measuring What Matters 📊
  5. The Future State of Incident Command 🔮

Conclusion

The Journey Continues: Further Learning and Resources 🚀

Acknowledgements 🙏

To my family, who never asked why I was obsessed with writing this book—just made sure I didn’t forget to eat. Thank you for the support! ❤️

To Eric, who’s been a great mentor and a constant source of inspiration.

To everyone I’ve worked with over the years. 🤝

To the Learning From Incidents community, and to those who’ve pushed reliability thinking beyond dashboards and into the human domain—your work paved the way for this one.

Thank you to everyone who’s ever written a clear postmortem, spoken up when something felt off, or challenged process for the sake of people. You’ve made this field more humane, and this book wouldn’t exist without your example.

And to anyone who reads this and offers value for value—thank you. That exchange means more than you know. ✨

Foreword

When I got into tech in June 2000—slapping together fugly websites, streaming low-res videos, and trying to keep NT4 servers running—before YouTube was even a concept, I was live streaming, running end-to-end event production and becoming the SME for anything streaming or CDN. 👨‍💻

By 2011, I’d stumbled into incident management. The industry was deep in its ITIL hangover—rigid process, thick hierarchies, and enough red tape to mummify a data center. 📜 It brought order, sure, but agility? Like trying to steer a cargo ship with a joystick. 🚢

Then came the SRE wave. 🌊 Suddenly everyone wanted to “do SRE,” flipping the script on how we think about reliability and response. But despite all the tooling, the frameworks, the culture decks—we’re still flailing when it comes to human factors.

I’ve ridden every wave since—sometimes surfing 🏄‍♂️, sometimes just staying afloat. In 2018, working at a startup, I got my first exposure into the role of incident commander. No training, no playbook, barely any system visibility. Just raw chaos, flaming chainsaws 🔥🪚, and the expectation to “own it.” That trial by fire taught me this: strong incident command is non-negotiable, especially when you’re also wearing three other hats. 🎩🎩🎩

Across startups and giants, I’ve watched teams fumble and stall—not because they lacked tools, but because they ignored culture. Fixing incident management means wrestling that beast. And let’s not kid ourselves—it’s like sprinting uphill through molasses.

SEV1 – The Art of Incident Command is the distilled chaos. Not sanitized “best practices,” but the book I wish someone had handed me when I was drowning. It’s built from scars, scraped from real-world incidents, and shaped by teams both scrappy and sprawling.

Today, incident response is a three-ring circus: engineers juggling pagers 📟, debugging blind 🕶️, and improvising in real time while the stakes climb and the tooling sprawls. This book is your survival guide and your last line of defense.

🌊 The water’s rough. Are you ready to jump in?

—Frank Jantunen

PART I: Before the Incident 🕰️

1. What Is an Incident, Really? 🤔

The ITIL View: A Starting Point

The ITIL (Information Technology Infrastructure Library) framework provides a classic definition of an incident:

“An unplanned interruption to an IT service or reduction in the quality of an IT service.”

This approach is service-focused, reactive, and operational by nature—an incident exists when someone or something detects a problem.

Where ITIL Falls Short: The Priority Matrix Trap 😬

In modern, complex systems, the traditional ITIL model’s handling of urgency and impact is a critical bottleneck. The model separates priority from severity, calculating priority based on a function of its two main inputs:

Priority = Impact x Urgency

Debating whether an incident is a P2 or P3 wastes time not spent mitigating escalating customer impact.

The SRE Mindset: Engineering for Failure 💡

Site Reliability Engineering (SRE) collapses the distinction between priority and severity to move faster and assumes system failures are inevitable.

Key shifts:

Where an Incident Begins

A modern guideline:

An incident begins when a responder believes action may be needed to preserve service reliability.

One person is all it takes to declare: “Something may be wrong. We should respond as if it is until we confirm otherwise.”

Example triggers:

Example Modern Severity Matrix 🚀

Severity Impact Typical Response Time Examples & Notes
SEV-0 (optional) Severe platform failure, business risk Immediate Catastrophic event, exec-level coordination, unknown recovery path
SEV-1 Major service degradation or outage < 3 min Core features down, large-scale impact, “all-hands” response
SEV-2 Moderate service impact < 5 min Significant performance issues, workaround may exist, multiple services affected
SEV-3 Degraded user experience < 15 min Minor bug, single-service impact, logged for resolution
SEV-4 (optional) Minimal/cosmetic impact < 48 hours Flexible, for deferred issues
SEV-5 (optional) External/Partner issues Monitor Only Third-party outage, visible but not actionable

📊 Reality Check: Most teams operate with just SEV-1 to SEV-3. Start simple, expand only if needed.

Lifecycle Comparison

Framework Lifecycle Steps Primary Context
ITIL Detection → Logging → Categorization → Prioritization → Diagnosis → Resolution → Closure Operational helpdesk 📞
SRE Detect → Triage → Mitigate → Resolve → Review Fast-moving, distributed systems 💨
NIST Preparation → Detection & Analysis → Containment, Eradication & Recovery → Post-Incident Activity Security-focused response 🛡️

🔑 Key Takeaway: Effective incident management requires knowing which framework to apply and when to adapt. SRE principles thrive on clarity and speed, collapsing the old severity/priority math into a single, actionable SEV level.

🔑 Keep it simple: map severity to priority directly and define levels by the response they demand.

2. Operational Mindset & Culture 🧠

Incidents do not happen in a vacuum. Team response, escalation, and recovery are shaped by culture—how a team thinks, behaves, and values its work.

Resilience Over Redundancy

Redundancy can mask fragility. Instead of fixing flaky systems, teams add layers.

Resilience means being honest about what breaks, why, and what to do when it breaks again.

It’s about graceful degradation, fast recovery, and human readiness.

Resilient teams:

Resilient systems:

Blamelessness and Psychological Safety ❤️

If engineers are afraid to speak up, incident response is compromised.

Blame kills curiosity. 🔪

Blameless culture separates the person from the process, focusing on why a decision made sense at the time.

Psychological safety means:

SRE vs. DevOps Culture: Bridging the Mindsets

SREs DevOps
Emphasize error budgets, reliability as feature Fast, iterative, delivery-focused
Treat ops as software problem Willing to trade stability for speed
Quantify risk, push back on unsustainable pace Adaptable, but risk burnout or inconsistent quality

Bridge-building strategies:

🏗️ Building Resilient Systems: Two Pillars

System-Level Resilience

Adaptive Capacity (Resilience Engineering)

🔑 Key Takeaway: Culture isn’t a slide deck or a slogan. It’s what people actually do—under pressure, in the dark, without a script. If you want real resilience, you need both: systems built to absorb shocks, and teams trained to adapt.

3. Clear Criteria for Incident Declaration ✅

If you ask five teams what counts as an incident, you’ll likely get ten different answers. Incident management cannot start effectively until everyone knows what qualifies as an incident, who can declare one, and what should happen next.

ITIL vs. SRE: Definitions

Concept ITIL SRE
Severity Not formal. Often muddled with “impact.” Clear measure of technical impact (e.g. downtime).
Priority Blend of impact and urgency for ticket SLAs Rarely used. Urgency implied by severity.

In Practice: Where It Breaks 💥

Scenario: A production database flips into read-only mode.

The Fix:

Incident Declaration Criteria

A healthy incident process starts with specific, trigger-based criteria:

📢 Important: “Incident” doesn’t mean “disaster.” It means structured response.

The Security Dimension 🔒

Some incidents are the direct result of malicious activity (e.g., DDoS attack). SRE and Security must collaborate:

Who Can Declare?

Anyone in the organization should be empowered to declare an incident. If it turns out to be a false alarm, that’s acceptable—over-alerting is better than delay.

Example Incident Assembly Orchestration:

Transparency and Announcement 📢

Incidents should be visible. Unless security-sensitive, post in a public #incidents channel with an auto-generated summary.

Example:

Ticket# INC-2341
SEV2 - Checkout - API - High error rate on checkout API
Slack Channel: #inc-2341

🔑 Key Takeaway: Clear criteria eliminate hesitation. When anyone can declare an incident quickly and transparently, teams respond faster and learn more effectively.

4. Systems, Playbooks & Observability 🗺️

Incidents aren’t just about people responding—they’re about systems telling us something is wrong and giving enough information to act.

MELT: Metrics, Events, Logs, Traces

Pillar Purpose Example
Metrics Trends, thresholds CPU usage, error rates 📈
Events Discrete signals Deploy, config change ⚙️
Logs Granular detail, forensics Error logs, audit trails 📜
Traces Connect dots across services Request tracing ➡️

💡 Tip: Mature systems integrate all four, but balance coverage with cost. 💰

The Service Catalog: Your Operational Map

A robust service catalog is indispensable:

What a Good Catalog Contains:

Runbooks, Dashboards, Dashbooks

Checklists: Always clearly structure docs as checklists to reduce errors and ensure critical steps aren’t missed.

Auto-remediation: Guardrails & Pitfalls 🤖

Automation can act faster than humans, but speed without context is dangerous.

Platform Engineering Connection 🏗️

Modern platform teams embed observability, runbooks, and automation into the dev workflow, making reliability everyone’s responsibility.

🔑 Key Takeaway: Modern incident readiness requires integrated systems, current docs, practiced chaos, thoughtful automation, and platform-embedded reliability practices.

5. Alerting Without the Noise 🔕

The best alert is the one that matters. The rest are distractions—expensive ones.

SLO-Based Alerting and Signal Quality

SLOs are contracts between system reliability and user expectations. Good alerts are rooted in these contracts.

Alert Routing, Deduplication, and Suppression

Alert Fatigue, False Positives, and Pager Hell 📟🔥

High false positives erode trust. Key metrics:

AI/ML for Detection: Promise vs. Reality 🤖

AI-driven anomaly detection can flood inboxes with irrelevant noise. Use ML to augment, not automate, human judgment.

🔑 Key Takeaway: Alerting without noise is about discipline. Earn the right to wake someone up—not with volume, but with relevance.

6. Training, Simulation & Team Maturity 🏋️‍♀️

Chaos Engineering as Ongoing Readiness

Practice, don’t just plan.

Chaos engineering deliberately introduces failure to test resilience.

Practical Chaos Engineering: Building Muscle

Start with safe, controlled experiments in staging/dev environments.

Example: Simulating API Node Failure

Chaos Maturity Levels:

Level Description
Level 1 Reactive: Terminate instances, kill processes
Level 2 Proactive: Schedule experiments
Level 3 Integrated: Chaos in CI/CD, automate faults
Level 4 Adaptive: System adjusts based on live feedback

🔑 Key Takeaway: You can’t control when the next incident hits—but you can train your team to meet it with confidence. Chaos engineering and simulation aren’t optional; they’re how you transform individual skill into organizational readiness.

PART II: During the Incident 🔥

7. Triggers & Assembly 🚦

Every alert begins with a signal. The difference between chaos and coordination starts at that moment.

Who Triages the Alert

Alert Payload: From Noise to Signal

The best alerts are:

Checklist Example: ✅

Alert: High CPU on API Server

Checklist:

From Triage to Declaration

The transition between “alert received” and “incident declared” should be explicit and documented.

Standardized Intake Questions: ❓

Compliance and Business Risk

Not every incident requires immediate action. Sometimes the business accepts risk—document the risk, monitoring, and who made the call.

Access Controls and Break-Glass Scenarios 🚨

🔑 Key Takeaway: The first few minutes are where clarity and chaos compete. Triage is about signal discernment, role clarity, and high-quality intake.

8. Incident Command in Practice 🧑‍✈️

The Incident Commander (IC) is the single person responsible for the overall incident response. This is a temporary, highly focused role.

The Role of the Incident Commander

The IC is like the conductor of an orchestra—they don’t play every instrument, but they ensure everyone is playing in harmony.

IC Responsibilities:

🙅‍♀️ What an IC is NOT: The IC is not the person who fixes the problem. They are the person who ensures the problem gets fixed. Resist the urge to dive into debugging!

Incident Roles and Responsibilities

Effective incident response relies on clear roles:

Important: In many organizations, one person may wear multiple hats initially, but the mindset of these distinct roles is crucial.

The Incident Lifecycle: From Active to Resolved

  1. Detection & Declaration: Alert fires, IC declared. 🚦
  2. Triage & Assessment: What’s the impact? What’s the severity? 🤔
  3. Investigation: Deep dive into root cause. 🔬
  4. Mitigation: Reduce or stop impact (e.g., rollback, disable feature). 🛡️
  5. Resolution: Full fix applied, service restored. ✅
  6. Recovery: Bring systems back to full health. 💚
  7. Post-Incident Analysis: Learn from the incident. 📝

Decision-Making Under Pressure: OODA Loop

The OODA Loop (Observe, Orient, Decide, Act) is a powerful model for rapid decision-making:

  1. Observe: Gather information (metrics, logs, reports). 🧐
  2. Orient: Analyze the situation, put it in context (mental models, past incidents). 🧠
  3. Decide: Choose a course of action (mitigate, investigate further). ✅
  4. Act: Implement the decision. 🚀

Then, the loop repeats, constantly adapting to new information. This iterative process is vital in chaotic environments.

🔑 Key Takeaway: Strong incident command isn’t about individual heroics; it’s about structured leadership, clear roles, and rapid, iterative decision-making to tame the chaos.

9. Communication Under Pressure 🗣️

During an incident, clear communication is paramount. Misinformation or lack of information fuels panic and slows resolution.

Internal Communication: Keeping the Team Aligned

External Communication: Managing Stakeholder Expectations

🚨 Crisis Communication Tip: When communicating externally, always err on the side of transparency. Acknowledge impact, provide updates frequently, and communicate when you don’t have an update (e.g., “Still investigating, next update in 15 minutes”).

Communication Tools & Workflows

🔑 Key Takeaway: Effective incident communication is structured, timely, and audience-aware. It builds trust, reduces noise, and ensures everyone stays aligned towards resolution.

10. Managing People, Pace & Burnout 🧘

Incidents are sprints, not marathons. Sustained high-pressure work leads to burnout and errors.

Recognizing and Mitigating Fatigue

Avoiding Cognitive Overload

Psychological Safety During the Incident

IC Self-Care & Handover

The IC role is incredibly demanding. Self-care is crucial.

🔑 Key Takeaway: Managing an incident isn’t just about systems; it’s about managing humans under stress. Prioritize wellbeing, prevent overload, and foster psychological safety to ensure a sustained, effective response.

PART III: After the Incident 📝

11. Declaring the End & Recovery 🏁

The incident isn’t truly over until services are fully restored, systems are stable, and the learning process begins.

Criteria for Incident Resolution

Resolution is not just “it’s working now.” It requires:

The Role of the Incident Commander in Closure

The IC is responsible for officially declaring the end of the active incident. This involves:

Recovery Steps & Checklist

Recovery means bringing systems back to their pre-incident state, and often better.

Recovery Checklist:

The “All Clear” Signal 🚥

A clear, unambiguous “all clear” signal helps shift the team’s focus from crisis to recovery and learning. This could be a message in the incident channel:

🔑 Key Takeaway: A clear and deliberate closure process ensures true resolution, prevents “phantom incidents,” and smoothly transitions the team to the critical learning phase.

12. Postmortems That Don’t Suck ✨

The post-mortem (or post-incident review) is the most critical learning opportunity. A “good” post-mortem isn’t about assigning blame; it’s about understanding and improving.

Blameless Postmortems: The Foundation of Learning ❤️

A blameless culture is non-negotiable for effective post-mortems.

Structure of a Modern Postmortem

A robust post-mortem document typically includes:

  1. Summary: High-level overview of the incident, impact, and resolution.
  2. Timeline: Detailed chronological log of events, including detection, actions taken, and key decisions. ⏳
  3. Impact: Comprehensive description of business and customer impact. 💸
  4. Root Cause(s): The underlying systemic factors that led to the incident. (Often multiple contributing factors). 🌳
  5. Detection: How was the incident detected? Was it timely? 🚨
  6. Mitigation: How was the impact reduced or stopped?
  7. Resolution: How was the service fully restored?
  8. Lessons Learned: What did we learn about our systems, processes, and people? 🎓
  9. Action Items: Concrete, measurable tasks assigned to specific owners with due dates. These are the outputs of the post-mortem. ✅
  10. Preventative Measures: What changes will prevent recurrence or reduce future impact? 🛡️

Facilitating the Postmortem Meeting

🔑 Key Takeaway: A blameless postmortem is a gift to your organization. It transforms errors into opportunities for systemic improvement, fostering a culture of continuous learning and resilience.

13. From Lessons to Systems Change 🔄

A retrospective without action is just a history lesson. The real value comes from turning insights into tangible improvements.

The Action Item Lifecycle

Action items must be treated with the same rigor as product features.

  1. Creation: Clear, specific, measurable, assigned, time-bound (SMART).
  2. Prioritization: Integrated into existing backlog processes (e.g., JIRA, Asana). Prioritized alongside other development work. 📈
  3. Tracking: Regularly reviewed and updated.
  4. Completion: Verified and closed. ✅
  5. Verification: Confirm the change had the intended effect.

Prioritizing Reliability Work

This is often the hardest part. Reliability work (from post-mortems) competes with new feature development.

The Feedback Loop: How Incidents Inform Product & Engineering

Championing Systemic Change

🔑 Key Takeaway: The true measure of an effective incident management program is its ability to drive concrete, systemic change. Turn lessons learned into prioritized, actionable work that continuously improves reliability.

14. Measuring What Matters 📊

You can’t improve what you don’t measure. Metrics provide insights into the health of your incident response process and system reliability.

Key Incident Metrics

The Danger of Vanity Metrics

Building Incident Dashboards & Reports

Continuous Improvement Loop

Measuring is part of a continuous loop:

  1. Define Metrics: What do you want to improve?
  2. Collect Data: Implement logging and tooling.
  3. Analyze & Visualize: Understand trends and outliers.
  4. Identify Areas for Improvement: Where are the bottlenecks?
  5. Implement Changes: Prioritize and execute action items.
  6. Measure Again: Did the changes have the desired effect?

🔑 Key Takeaway: Strategic metrics provide the evidence needed to understand your current state, justify investment in reliability, and demonstrate the impact of your incident management program. Choose metrics that drive actionable insights, not just numbers.

15. The Future State of Incident Command 🔮

Incident management is a constantly evolving discipline. What’s next?

AI/ML in Incident Response: Beyond Anomaly Detection

🤖 Reality Check: AI won’t replace human ICs soon. It will augment their capabilities, offloading cognitive burden and speeding up information processing. Human judgment, empathy, and creative problem-solving remain essential.

Proactive Incident Management & Resilience Engineering

Distributed & Federating Incident Command

As systems become more distributed, so too will incident response.

Human-Centered Design for On-Call & Tooling

🔑 Key Takeaway: The future of incident command is about continuous human-computer collaboration, deeply integrated reliability into every stage of the software lifecycle, and a relentless focus on the well-being and adaptive capacity of the people on the front lines.

Conclusion

The journey of mastering incident command is continuous. It’s a blend of technical expertise, human psychology, and organizational culture. You’ve learned about:

The next time an alert fires, you’ll be better equipped. Not just with tools, but with a mindset, a framework, and the confidence to lead. The art of incident command is about transforming chaos into learning, and ultimately, building more resilient systems and teams.

The Journey Continues: Further Learning and Resources 🚀

Keep learning. Keep practicing. Keep building resilient systems and, more importantly, resilient people. Your users—and your on-call teams—will thank you. 🙏