Review & Cheat Sheet
Congratulations! You have mastered the “Day 2” operations that keep systems alive. Before moving to the Final Assessment, let’s review.
1. Key Takeaways
- Observability is Not Monitoring: Monitoring tells you something is broken. Observability gives you the data to ask why it’s broken (Logs, Metrics, Traces).
- Beware Cardinality: Tagging metrics with unbounded values like
user_idcauses Cardinality Explosion, crashing your time series databases (Prometheus). - Fail Fast with Circuit Breakers: Don’t let a slow dependency drag down your entire service. Open the circuit, shed load, and recover gracefully.
- Retries Require Jitter: Naive retries create Thundering Herds. Always use Exponential Backoff and Jitter to spread out the recovery load.
- Zero Trust Architecture: Inside your VPC, trust nothing. Use mTLS for service-to-service authentication, and OAuth 2.0 (Valet Keys) for authorization.
- Deploy ≠ Release: Use Deployment Strategies (Blue/Green, Canary) and Feature Flags to separate the act of deploying code from exposing it to users.
- Infrastructure as Code: GitOps ensures your infrastructure matches the state defined in your Git repository, providing a clear audit trail and easy rollbacks.
2. War Story: The 3 AM Retry Storm
In a well-documented outage, a major ride-sharing app experienced a minor network blip that disconnected thousands of mobile clients. When the network recovered seconds later, every single app attempted to reconnect simultaneously without any delay.
This massive spike in requests overwhelmed their API gateways, causing them to time out and drop connections. The apps interpreted the timeouts as failures and immediately retried again. The system had entered a Thundering Herd death spiral, effectively DDoS-ing itself. The resolution required engineers to completely shut off traffic at the load balancer and slowly bleed it back in, all because the client retry logic lacked Exponential Backoff and Jitter. This is why we always spread out recovery loads.
3. Interactive Flashcards
Click on a card to reveal the definition.
4. Interactive Scenario: The Panic Button
It’s 3 AM. You are on-call. The system is down. What do you do?
PAGERDUTY ALERT
"High Latency Detected on Payment Service (p99 > 5s)"
5. System Design Cheat Sheet
| Category | Concept | Key Takeaway |
|---|---|---|
| Observability | Logs | Structured (JSON) for querying specific events. |
| Metrics | Aggregates for trends. Watch out for Cardinality Explosion (no user_id). |
|
| Tracing | Follow request across microservices. Use Sampling (Head/Tail). | |
| OTel | Vendor-neutral standard. Use SDK + Collector. | |
| Reliability | Circuit Breaker | Stop cascading failures. States: Closed, Open, Half-Open. |
| Retry | Only for transient errors. Always use Exponential Backoff + Jitter. | |
| Idempotency | Ensure f(f(x)) = f(x). Use Idempotency-Key header. |
|
| Security | TLS 1.3 | Encrypts transit. 1-RTT handshake. Forward Secrecy. |
| Security | OAuth 2.0 | Authorization (Valet Key). Flows: Auth Code (User), Client Creds (Service). |
| mTLS | Mutual TLS. Zero Trust for service-to-service calls. | |
| JWT | Stateless token. Header.Payload.Signature. |
|
| Deployment | Rolling | Low cost, K8s default. Slow rollback. |
| Blue/Green | Safe, instant rollback, 2x cost. | |
| Canary | Test in production with real users (1% → 100%). Lowest risk. | |
| GitOps | Infrastructure as Code + Automated Sync (ArgoCD). Pull Model. |
6. Quick Revision
- Logs: Point-in-time events (structured JSON). Best for deep debugging.
- Metrics: Aggregated numerical data over time. Best for alerting and dashboards. Low cardinality is key.
- Traces: Tracks a single request as it traverses multiple services (Context Propagation). Vital for latency analysis.
- Circuit Breaker: Stops requests to failing services. Closed → Open → Half-Open.
- Bulkhead Pattern: Isolates resources (like thread pools) to prevent a failure in one area from affecting others.
- TLS 1.3: Faster (1-RTT) and more secure (Forward Secrecy).
- OAuth 2.0 vs OIDC: OAuth is for Authorization (Delegated access). OIDC is for Authentication (Identity).
- Canary Deployment: The safest deployment strategy. Roll out to a small percentage, verify, then expand.
7. Glossary Link
Review all the terms mentioned in this module: System Design Glossary
8. What’s Next?
You have completed the core technical modules! You are now ready for the Final Boss.
The next module is Module 18: Final Assessment, where we will simulate a real System Design Interview with a full Mock Scenario.