Metrics, Logs & Tracing
[!TIP] The Senior Engineer’s Mantra: “You cannot fix what you cannot see.”
In a Monolith, debugging is easy: tail -f /var/log/syslog.
In Microservices, a single user request hits 50 different services. If one fails, where do you look?
Observability is not just “monitoring”. Monitoring tells you when something is wrong (“CPU is 99%”). Observability allows you to ask why it is wrong (“Which user caused the CPU spike?”).
This chapter covers the Three Pillars of Observability and the most critical concept for scaling them: Cardinality.
1. The Three Pillars
Logs vs. Metrics vs. Traces
"Something happened at T=0."
High Volume. Expensive to store. Good for debugging specific errors.
"What is the trend?"
Low Volume. Cheap to store. Good for alerting (Dashboards).
"Where did the time go?"
Links logs across services. Critical for latency optimization.
1.1 Logs: Structured vs Unstructured
- Unstructured:
2023-10-01 10:00:00 Error connecting to DB. Hard to query. - Structured (JSON):
{"time": "...", "level": "error", "component": "db", "error": "timeout"}. Easy to query (component="db").
[!TIP] Use Exemplars: Modern metrics systems (Prometheus + OpenTelemetry) allow you to link a specific Metric bucket to a Trace ID.
- “Show me the p99 latency bucket.” → Click → “Here is a Trace ID
abc-123that took 5 seconds.”- This is the “Holy Grail” that connects Metrics to Traces.
2. Metrics & The Cardinality Trap
[!CAUTION] The #1 Mistake: Tagging metrics with High Cardinality data like
user_id,trace_id, orurl.
Metrics are stored in Time Series Databases (TSDB) like Prometheus. A metric is identified by its name and its labels (tags). Each unique combination of labels creates a new Time Series.
- Low Cardinality:
method="GET". Only ≈5 possibilities (GET, POST, PUT…). Safe. - High Cardinality:
user_id="u_123". 1 Million users = 1 Million time series. Disaster.
When you have too many time series, the TSDB uses up all RAM to index them, and crashes. This is called Cardinality Explosion.
[!NOTE] War Story: The $50,000 Metrics Bill A fast-growing social app decided to track API latency. They tagged the metric
http_request_duration_secondswith theuser_idto see if specific users had slow experiences. Everything was fine until a marketing push brought in 500,000 new users in a weekend. The Time Series Database (Prometheus) had to create 500,000 new time series. The RAM usage spiked, crashing their monitoring system during their biggest launch. When they switched to a managed observability vendor as a quick fix, the cardinality explosion resulted in a surprise $50,000 bill at the end of the month. The Fix: They removeduser_idfrom the metric tags and moved it to Distributed Tracing spans instead.
Interactive Visualizer: Cardinality Explosion
See what happens to your memory usage when you add high-cardinality tags.
[!TIP] Try it yourself: Click the buttons to change the tagging strategy and watch the RAM usage explode.
TSDB Memory Simulator
Tagging Strategy: Low Cardinality (Safe)
3. Distributed Tracing
Distributed Tracing solves the “Needle in a Haystack” problem.
[!NOTE] War Story: Needle in the Haystack During Black Friday, a major e-commerce platform started seeing a 5% timeout rate on the Checkout endpoint. Logs showed random timeouts across Auth, Inventory, and Payment services. Without Distributed Tracing, engineers spent 6 hours manually matching log timestamps across 15 microservices. Once Distributed Tracing (Jaeger) was rolled out the next year, a similar incident was root-caused in 2 minutes: The visual flame graph clearly showed a 4-second delay in a secondary fraud-check service blocking the critical path.
3.1 Trace Context Propagation
How do we know that a log in the Payment Service belongs to the same request as a log in the API Gateway?
- Trace ID: A unique ID generated at the edge (Gateway). e.g.,
x-trace-id: abc-123. - Span ID: Represents a single unit of work (e.g., “DB Query”).
- Context Propagation: Passing these IDs via HTTP Headers (
traceparentin W3C standard) to every downstream service.
3.2 Sampling: Head vs Tail
Tracing every request is expensive (CPU & Storage). We need to Sample.
- Head-Based Sampling: The Gateway decides randomly (e.g., 1%) at the start of the request.
- Pros: Simple, low overhead.
- Cons: You might miss the one interesting error because it wasn’t sampled.
- Tail-Based Sampling: Collect all spans, wait for the request to finish, then decide. “Did it fail? If yes, keep it. If no, discard.”
- Pros: You keep 100% of errors.
- Cons: Requires buffering trace data in memory (Complex & Expensive).
Interactive Visualizer: Head vs Tail Sampling
Compare how many errors you catch with each strategy.
[!TIP] Try it yourself: Toggle between Head and Tail sampling. Notice how Tail sampling catches 100% of errors (Red dots) while Head sampling misses most of them.
Sampling Simulator
4. The “RED” Method
For every microservice, you should measure:
- Rate (Requests per second).
- Errors (Failed requests per second).
- Duration (Latency histograms).
Combine this with the USE Method (Utilization, Saturation, Errors) for infrastructure (CPU/Disk).
5. OpenTelemetry (OTel)
In the past, we had vendor lock-in (DataDog agents, NewRelic agents). Now, we have OpenTelemetry. It provides a vendor-neutral SDK to generate logs, metrics, and traces.
- Auto-Instrumentation: OTel agents can automatically attach to your Java/Python/Node.js app and start sending data without you writing a single line of code.
- Collector: A sidecar or central service that receives OTel data, processes it (e.g., removes PII), and exports it to your backend (DataDog, Prometheus, Jaeger).
[!TIP] Interview Answer: “I would use OpenTelemetry for instrumentation to avoid vendor lock-in. I’d configure the OTel Collector to export metrics to Prometheus and traces to Jaeger.”
6. Summary
| Concept | Tool Example | Best For |
|---|---|---|
| Logs | ELK Stack (Elasticsearch), Splunk | Debugging specific crashes. “Why did this request fail?” |
| Metrics | Prometheus, Grafana | Alerting & Trends. “Is the site slower today?” |
| Traces | Jaeger, Zipkin, Tempo | Latency Analysis. “Which microservice is the bottleneck?” |