Observability: The Eyes of the System

[!TIP] The Senior Engineer’s Mantra: “You cannot fix what you cannot see.”

In a Monolith, debugging is easy: tail -f /var/log/syslog. In Microservices, a single user request hits 50 different services. If one fails, where do you look?

Observability is not just “monitoring”. Monitoring tells you when something is wrong (“CPU is 99%”). Observability allows you to ask why it is wrong (“Which user caused the CPU spike?”).

This chapter covers the Three Pillars of Observability and the most critical concept for scaling them: Cardinality.


1. The Three Pillars

Logs vs. Metrics vs. Traces

1. Logs (Events)

"Something happened at T=0."

{
"level": "ERROR",
"msg": "DB Connection Failed",
"user_id": "u_123"
}

High Volume. Expensive to store. Good for debugging specific errors.

2. Metrics (Aggregates)

"What is the trend?"

http_requests_total{status="500"} = 42

Low Volume. Cheap to store. Good for alerting (Dashboards).

3. Traces (Context)

"Where did the time go?"

[Gateway] -> [Auth] -> [DB]
(20ms) -> (50ms) -> (TIMEOUT)

Links logs across services. Critical for latency optimization.


2. Metrics & The Cardinality Trap

[!CAUTION] The #1 Mistake: Tagging metrics with user_id, trace_id, or url.

Metrics are stored in Time Series Databases (TSDB) like Prometheus. A metric is identified by its name and its labels (tags). Each unique combination of labels creates a new Time Series.

  • Low Cardinality: method="GET". Only ~5 possibilities (GET, POST, PUT…). Safe.
  • High Cardinality: user_id="u_123". 1 Million users = 1 Million time series. Disaster.

When you have too many time series, the TSDB uses up all RAM to index them, and crashes. This is called Cardinality Explosion.

Interactive Visualizer: Cardinality Explosion

See what happens to your memory usage when you add high-cardinality tags.

TSDB Memory Simulator

Tagging Strategy: Low Cardinality (Safe)

Time Series
RAM Usage
Unique Series: 5
RAM Used: 50 MB

3. Distributed Tracing

Distributed Tracing solves the “Needle in a Haystack” problem.

3.1 Trace Context Propagation

How do we know that a log in the Payment Service belongs to the same request as a log in the API Gateway?

  1. Trace ID: A unique ID generated at the edge (Gateway). e.g., x-trace-id: abc-123.
  2. Span ID: Represents a single unit of work (e.g., “DB Query”).
  3. Context Propagation: Passing these IDs via HTTP Headers (traceparent in W3C standard) to every downstream service.

3.2 Sampling: Head vs Tail

Tracing every request is expensive (CPU & Storage). We need to Sample.

  • Head-Based Sampling: The Gateway decides randomly (e.g., 1%) at the start of the request.
    • Pros: Simple, low overhead.
    • Cons: You might miss the one interesting error because it wasn’t sampled.
  • Tail-Based Sampling: Collect all spans, wait for the request to finish, then decide. “Did it fail? If yes, keep it. If no, discard.”
    • Pros: You keep 100% of errors.
    • Cons: Requires buffering trace data in memory (Complex & Expensive).

3.3 The “RED” Method

For every microservice, you should measure:

  1. Rate (Requests per second).
  2. Errors (Failed requests per second).
  3. Duration (Latency histograms).

Combine this with the USE Method (Utilization, Saturation, Errors) for infrastructure (CPU/Disk).


4. OpenTelemetry (OTel)

In the past, we had vendor lock-in (DataDog agents, NewRelic agents). Now, we have OpenTelemetry. It provides a vendor-neutral SDK to generate logs, metrics, and traces.

  • Collector: A sidecar or central service that receives OTel data, processes it (e.g., removes PII), and exports it to your backend (DataDog, Prometheus, Jaeger).

[!TIP] Interview Answer: “I would use OpenTelemetry for instrumentation to avoid vendor lock-in. I’d configure the OTel Collector to export metrics to Prometheus and traces to Jaeger.”


5. Summary

Concept Tool Example Best For
Logs ELK Stack (Elasticsearch), Splunk Debugging specific crashes. “Why did this request fail?”
Metrics Prometheus, Grafana Alerting & Trends. “Is the site slower today?”
Traces Jaeger, Zipkin, Tempo Latency Analysis. “Which microservice is the bottleneck?”

Next: Reliability Patterns (Circuit Breakers) ->