Skip to main content

Observability Overview

What is Observability?

Observability is the ability to understand the internal state of a system based on its external outputs. In software systems, this means being able to answer questions about your application's behavior in production by examining logs, metrics, and traces.

The difference between monitoring and observability:

  • Monitoring answers known questions: "Is the CPU above 80%?" "Is the error rate above 5%?"
  • Observability enables exploring unknown questions: "Why is this specific user's checkout failing?" "Where in the call chain is latency increasing?"

Monitoring tells you that something is wrong. Observability helps you understand why and where.


The Three Pillars of Observability

Modern observability relies on three complementary data types, often called the "three pillars":

1. Logs - The Narrative

Logs capture discrete events that happened in your system. Each log entry tells a small story: "User U-123 attempted login," "Database query took 2.5s," "Payment P-456 completed successfully."

Strengths:

  • Detailed context about specific events
  • Essential for debugging specific issues
  • Provides audit trails for compliance
  • Can answer "What exactly happened at 10:15:30 AM?"

Limitations:

  • High volume (every event creates a log)
  • Difficult to aggregate and trend
  • Can be noisy and hard to search without structure

Example questions logs answer:

  • Why did payment PAY-123 fail?
  • What error message did user U-456 see?
  • Who accessed customer record C-789 yesterday?

2. Metrics - The Statistics

Metrics provide quantitative measurements aggregated over time. Instead of recording each individual request, metrics tell you "1000 requests per second," "P95 latency is 250ms," "error rate is 2%."

Strengths:

  • Low storage cost (aggregated data)
  • Excellent for alerting and dashboards
  • Shows trends over time
  • Can answer "How much?" and "How fast?"

Limitations:

  • Loses individual request details
  • Can't tell you why a specific thing happened
  • Requires knowing what to measure upfront

Example questions metrics answer:

  • Is latency increasing over time?
  • What's the current error rate?
  • Are we meeting our SLA of 99.9% uptime?

3. Distributed Tracing - The Journey

Distributed tracing visualizes how individual requests flow through your distributed system. It reconstructs the call path across services, showing exactly where time was spent.

Strengths:

  • Shows request flow across services
  • Identifies bottlenecks (which service is slow?)
  • Visualizes dependencies between services
  • Can answer "Where did the request go?"

Limitations:

  • More expensive than metrics (per-request data)
  • Requires instrumentation across all services
  • Sampling needed for cost management

Example questions tracing answers:

  • Why are checkout requests taking 5 seconds?
  • Which service in the call chain is the bottleneck?
  • How many services does a typical request touch?

How the Three Pillars Work Together

The three pillars are complementary, not redundant. A mature observability strategy uses all three together:

Typical Investigation Workflow

Step-by-step:

  1. Metrics detect the anomaly: Alert fires because P95 latency exceeded threshold
  2. Metrics narrow scope: Dashboard shows the /api/checkout endpoint is affected
  3. Traces identify location: Trace visualization shows database queries in order-service taking 3s
  4. Logs explain why: Logs filtered by trace ID show "connection pool exhausted" errors
  5. Resolution: Increase database connection pool size

Complementary Strengths

ScenarioMetricsLogsTraces
GOOD:Alert triggersPrimary (aggregates detect patterns)Too noisy
Find affected endpointDashboard breakdownRequires aggregationPossible but slow
GOOD:Locate bottleneckNo request-level detailNo cross-service view
Understand root causeNo contextual detailDetailed error messagesShows symptoms not causes
Historical analysisEfficient time-series storageExpensive long-term storageUsually retained short-term

Key insight: You need all three because they answer different questions at different stages of investigation.


Correlation: Tying It All Together

The true power of observability comes from correlating data across the three pillars:

Correlation IDs / Trace IDs

A unique identifier (correlation ID or trace ID) flows through all three pillars, enabling you to:

  • Metrics → Traces: When metrics show an issue, query traces for affected requests
  • Traces → Logs: When trace shows a slow operation, filter logs by trace ID for detailed context
  • Logs → Metrics: When investigating a log error, check metrics to see if it's widespread

Implementation:

  • Generate a unique ID when request enters system
  • Propagate ID across service boundaries via HTTP headers
  • Include ID in all logs (MDC in Java, context in Node.js)
  • Attach ID to metrics as exemplar (where supported)
  • Use ID as trace ID in distributed tracing

See Logging: Correlation IDs for implementation details.


Observability Maturity Model

Teams typically progress through these stages:

Level 1: Basic Monitoring

  • Server metrics (CPU, memory, disk)
  • Application logs to files
  • Manual investigation when issues reported
  • Pain points: Reactive, slow troubleshooting, no visibility

Level 2: Structured Observability

  • Centralized logging (ELK/Splunk)
  • Application metrics (Prometheus/Grafana)
  • Correlation IDs in logs
  • Alerting on key metrics
  • Pain points: Still hard to trace issues across services

Level 3: Distributed Tracing

  • OpenTelemetry instrumentation
  • Trace visualization (Jaeger/Zipkin)
  • Correlation between logs, metrics, traces
  • Service dependency mapping
  • Pain points: Need better automation, coverage gaps

Level 4: Full Observability

  • Automatic instrumentation across all services
  • SLI/SLO-based alerting
  • Proactive anomaly detection
  • Self-service investigation tools
  • Observability as code

Most teams should target Level 3. Level 4 requires significant investment and is typically needed only at scale.


Getting Started

For New Projects

  1. Start with structured logging: JSON format from day one
  2. Add correlation IDs early: Easier to add before you have many services
  3. Instrument metrics: Use auto-instrumentation where available (Spring Boot Actuator, Micrometer)
  4. Add tracing when going multi-service: Single service doesn't need distributed tracing

For Existing Projects

  1. Structured logging first: Highest ROI, enables better log analysis
  2. Metrics second: Alerting and dashboards catch issues proactively
  3. Tracing third: Most valuable with 3+ services, higher implementation cost
  4. Incremental adoption: Don't need 100% coverage - start with critical paths

Tool Ecosystem

PillarToolWhy
LogsELK Stack (Elasticsearch, Logstash, Kibana)Industry standard, powerful search, self-hosted
MetricsPrometheus + GrafanaPull-based metrics, flexible queries (PromQL), rich visualization
TracesOpenTelemetry + JaegerVendor-neutral standard, strong Spring Boot integration

Commercial Alternatives

  • All-in-one: Datadog, New Relic, Dynatrace (unified platform, easier setup, higher cost)
  • Logs: Splunk (enterprise features, high cost)
  • Metrics: Datadog, SignalFx (managed Prometheus-compatible)
  • Traces: Datadog APM, Lightstep (advanced features, automatic profiling)

Recommendation: Start with open source to learn fundamentals, consider commercial for scale or convenience.


Best Practices

Do

GOOD: Use correlation IDs everywhere - The glue between logs, metrics, and traces GOOD: Structured logging (JSON) - Enables machine parsing and analysis GOOD: Monitor the Four Golden Signals - Latency, traffic, errors, saturation GOOD: Sample traces intelligently - 100% of errors, 5-10% of successes GOOD: Alert on symptoms - User-visible problems, not infrastructure details GOOD: Percentiles over averages - P95/P99 reveal what averages hide GOOD: Instrument at service boundaries - HTTP, database, message queues

Don't

BAD: Log sensitive data - Credentials, tokens, PII must never appear in logs BAD: High cardinality metrics - User IDs/transaction IDs as metric tags exhaust memory BAD: Excessive logging - DEBUG level in production creates noise and cost BAD: Ignore log retention - Balance cost with compliance/troubleshooting needs BAD: Trace everything - Sampling is necessary for cost management BAD: Forget to correlate - Without correlation IDs, pillars exist in isolation BAD: Alert fatigue - Too many alerts mean real issues get ignored


Further Reading

Internal Documentation

External Resources


Summary

Observability is about understanding your system's behavior from its outputs. The three pillars - logs (what happened), metrics (how much), and traces (where) - work together to provide complete visibility.

Key takeaways:

  • Each pillar answers different questions - you need all three
  • Correlation IDs tie the pillars together for powerful investigations
  • Start with structured logging and metrics, add tracing when multi-service
  • Open source tools (ELK, Prometheus, Jaeger) provide production-ready observability
  • Observability enables proactive problem detection and faster resolution

Ready to dive deeper? Start with Logging Best Practices to build your observability foundation.