Module Review: Operations

[!NOTE] This module explores the core principles of Module Review: Operations, deriving solutions from first principles and hardware constraints to build world-class, production-ready expertise.

1. Key Takeaways

  1. Reliability is a Feature: It must be prioritized like any other feature. SLOs and Error Budgets are the tools to negotiate this prioritization with Product.
  2. Blameless Culture: You cannot fix a system if people are afraid to admit mistakes. Post-Mortems focus on process, not people.
  3. Command & Control: During an outage, democracy is suspended. The Incident Commander leads, others follow.
  4. Mitigate ≠ Resolve: During an incident, your first goal is to stop the bleeding (Mitigate), not to find the perfect fix (Resolve).
  5. Paved Roads: Don’t force standards; make the right way the easiest way using Golden Paths.

2. Interactive Flashcards

Test your knowledge of Operational Excellence.

What is the formula for Error Budget?

(Click to reveal)

100% - SLO
1 / 7

3. Cheat Sheet

The Nines Table

Availability Downtime per Year Downtime per Month Typical Use Case
99% 3.65 days 7.31 hours Batch jobs, non-critical internal tools
99.5% 1.83 days 3.65 hours Standard e-commerce, user dashboards
99.9% 8.76 hours 43.8 minutes Industry Standard for SaaS
99.99% 52.6 minutes 4.38 minutes Core Banking, Payments, Auth
99.999% 5.26 minutes 26.3 seconds Telco, Pacemakers, AWS S3

Incident Roles

  • Incident Commander (IC): Leader. Decision maker.
  • Operations Lead: Doer. Executes commands.
  • Scribe: Recorder. Timeline keeper.
  • Comms Lead: Speaker. Updates stakeholders.

SEV Levels

  • SEV-1: Critical. Site down. All hands on deck.
  • SEV-2: High. Major feature broken. Fix ASAP.
  • SEV-3: Medium. Minor bug. Fix in business hours.
  • SEV-4: Low. Cosmetic. Backlog.

4. Further Reading

Staff Prep Glossary