Observability Overview

What is Observability?

Observability is the ability to understand the internal state of a system based on its external outputs. In software systems, this means being able to answer questions about your application's behavior in production by examining logs, metrics, and traces.

The difference between monitoring and observability:

Monitoring answers known questions: "Is the CPU above 80%?" "Is the error rate above 5%?"
Observability enables exploring unknown questions: "Why is this specific user's checkout failing?" "Where in the call chain is latency increasing?"

Monitoring tells you that something is wrong. Observability helps you understand why and where.

The Three Pillars of Observability

Modern observability relies on three complementary data types, often called the "three pillars":

1. Logs - The Narrative

Logs capture discrete events that happened in your system. Each log entry tells a small story: "User U-123 attempted login," "Database query took 2.5s," "Payment P-456 completed successfully."

Strengths:

Detailed context about specific events
Essential for debugging specific issues
Provides audit trails for compliance
Can answer "What exactly happened at 10:15:30 AM?"

Limitations:

High volume (every event creates a log)
Difficult to aggregate and trend
Can be noisy and hard to search without structure

Example questions logs answer:

Why did payment PAY-123 fail?
What error message did user U-456 see?
Who accessed customer record C-789 yesterday?

2. Metrics - The Statistics

Metrics provide quantitative measurements aggregated over time. Instead of recording each individual request, metrics tell you "1000 requests per second," "P95 latency is 250ms," "error rate is 2%."

Strengths:

Low storage cost (aggregated data)
Excellent for alerting and dashboards
Shows trends over time
Can answer "How much?" and "How fast?"

Limitations:

Loses individual request details
Can't tell you why a specific thing happened
Requires knowing what to measure upfront

Example questions metrics answer:

Is latency increasing over time?
What's the current error rate?
Are we meeting our SLA of 99.9% uptime?

3. Distributed Tracing - The Journey

Distributed tracing visualizes how individual requests flow through your distributed system. It reconstructs the call path across services, showing exactly where time was spent.

Strengths:

Shows request flow across services
Identifies bottlenecks (which service is slow?)
Visualizes dependencies between services
Can answer "Where did the request go?"

Limitations:

More expensive than metrics (per-request data)
Requires instrumentation across all services
Sampling needed for cost management

Example questions tracing answers:

Why are checkout requests taking 5 seconds?
Which service in the call chain is the bottleneck?
How many services does a typical request touch?

How the Three Pillars Work Together

The three pillars are complementary, not redundant. A mature observability strategy uses all three together:

Typical Investigation Workflow

Step-by-step:

Metrics detect the anomaly: Alert fires because P95 latency exceeded threshold
Metrics narrow scope: Dashboard shows the /api/checkout endpoint is affected
Traces identify location: Trace visualization shows database queries in order-service taking 3s
Logs explain why: Logs filtered by trace ID show "connection pool exhausted" errors
Resolution: Increase database connection pool size

Complementary Strengths

Scenario	Metrics	Logs	Traces
GOOD:	Alert triggers	Primary (aggregates detect patterns)	Too noisy
Find affected endpoint	Dashboard breakdown	Requires aggregation	Possible but slow
GOOD:	Locate bottleneck	No request-level detail	No cross-service view
Understand root cause	No contextual detail	Detailed error messages	Shows symptoms not causes
Historical analysis	Efficient time-series storage	Expensive long-term storage	Usually retained short-term

Key insight: You need all three because they answer different questions at different stages of investigation.

Correlation: Tying It All Together

The true power of observability comes from correlating data across the three pillars:

Correlation IDs / Trace IDs

A unique identifier (correlation ID or trace ID) flows through all three pillars, enabling you to:

Metrics → Traces: When metrics show an issue, query traces for affected requests
Traces → Logs: When trace shows a slow operation, filter logs by trace ID for detailed context
Logs → Metrics: When investigating a log error, check metrics to see if it's widespread

Implementation:

Generate a unique ID when request enters system
Propagate ID across service boundaries via HTTP headers
Include ID in all logs (MDC in Java, context in Node.js)
Attach ID to metrics as exemplar (where supported)
Use ID as trace ID in distributed tracing

See Logging: Correlation IDs for implementation details.

Observability Maturity Model

Teams typically progress through these stages:

Level 1: Basic Monitoring

Server metrics (CPU, memory, disk)
Application logs to files
Manual investigation when issues reported
Pain points: Reactive, slow troubleshooting, no visibility

Level 2: Structured Observability

Centralized logging (ELK/Splunk)
Application metrics (Prometheus/Grafana)
Correlation IDs in logs
Alerting on key metrics
Pain points: Still hard to trace issues across services

Level 3: Distributed Tracing

OpenTelemetry instrumentation
Trace visualization (Jaeger/Zipkin)
Correlation between logs, metrics, traces
Service dependency mapping
Pain points: Need better automation, coverage gaps

Level 4: Full Observability

Automatic instrumentation across all services
SLI/SLO-based alerting
Proactive anomaly detection
Self-service investigation tools
Observability as code

Most teams should target Level 3. Level 4 requires significant investment and is typically needed only at scale.

Getting Started

For New Projects

Start with structured logging: JSON format from day one
Add correlation IDs early: Easier to add before you have many services
Instrument metrics: Use auto-instrumentation where available (Spring Boot Actuator, Micrometer)
Add tracing when going multi-service: Single service doesn't need distributed tracing

For Existing Projects

Structured logging first: Highest ROI, enables better log analysis
Metrics second: Alerting and dashboards catch issues proactively
Tracing third: Most valuable with 3+ services, higher implementation cost
Incremental adoption: Don't need 100% coverage - start with critical paths

Tool Ecosystem

Open Source Stack (Recommended for most teams)

Pillar	Tool	Why
Logs	ELK Stack (Elasticsearch, Logstash, Kibana)	Industry standard, powerful search, self-hosted
Metrics	Prometheus + Grafana	Pull-based metrics, flexible queries (PromQL), rich visualization
Traces	OpenTelemetry + Jaeger	Vendor-neutral standard, strong Spring Boot integration

Commercial Alternatives

All-in-one: Datadog, New Relic, Dynatrace (unified platform, easier setup, higher cost)
Logs: Splunk (enterprise features, high cost)
Metrics: Datadog, SignalFx (managed Prometheus-compatible)
Traces: Datadog APM, Lightstep (advanced features, automatic profiling)

Recommendation: Start with open source to learn fundamentals, consider commercial for scale or convenience.

Best Practices

Do

GOOD: Use correlation IDs everywhere - The glue between logs, metrics, and traces GOOD: Structured logging (JSON) - Enables machine parsing and analysis GOOD: Monitor the Four Golden Signals - Latency, traffic, errors, saturation GOOD: Sample traces intelligently - 100% of errors, 5-10% of successes GOOD: Alert on symptoms - User-visible problems, not infrastructure details GOOD: Percentiles over averages - P95/P99 reveal what averages hide GOOD: Instrument at service boundaries - HTTP, database, message queues

Don't

BAD: Log sensitive data - Credentials, tokens, PII must never appear in logs BAD: High cardinality metrics - User IDs/transaction IDs as metric tags exhaust memory BAD: Excessive logging - DEBUG level in production creates noise and cost BAD: Ignore log retention - Balance cost with compliance/troubleshooting needs BAD: Trace everything - Sampling is necessary for cost management BAD: Forget to correlate - Without correlation IDs, pillars exist in isolation BAD: Alert fatigue - Too many alerts mean real issues get ignored

Summary

Observability is about understanding your system's behavior from its outputs. The three pillars - logs (what happened), metrics (how much), and traces (where) - work together to provide complete visibility.

Key takeaways:

Each pillar answers different questions - you need all three
Correlation IDs tie the pillars together for powerful investigations
Start with structured logging and metrics, add tracing when multi-service
Open source tools (ELK, Prometheus, Jaeger) provide production-ready observability
Observability enables proactive problem detection and faster resolution

Ready to dive deeper? Start with Logging Best Practices to build your observability foundation.

Observability Overview

What is Observability?

The Three Pillars of Observability

1. Logs - The Narrative

2. Metrics - The Statistics

3. Distributed Tracing - The Journey

How the Three Pillars Work Together

Typical Investigation Workflow

Complementary Strengths

Correlation: Tying It All Together

Correlation IDs / Trace IDs

Observability Maturity Model

Level 1: Basic Monitoring

Level 2: Structured Observability

Level 3: Distributed Tracing

Level 4: Full Observability

Getting Started

For New Projects

For Existing Projects

Tool Ecosystem

Open Source Stack (Recommended for most teams)

Commercial Alternatives

Best Practices

Do

Don't

Further Reading

Internal Documentation

External Resources

Summary

What is Observability?​

The Three Pillars of Observability​

1. Logs - The Narrative​

2. Metrics - The Statistics​

3. Distributed Tracing - The Journey​

How the Three Pillars Work Together​

Typical Investigation Workflow​

Complementary Strengths​

Correlation: Tying It All Together​

Correlation IDs / Trace IDs​

Observability Maturity Model​

Level 1: Basic Monitoring​

Level 2: Structured Observability​

Level 3: Distributed Tracing​

Level 4: Full Observability​

Getting Started​

For New Projects​

For Existing Projects​

Tool Ecosystem​

Open Source Stack (Recommended for most teams)​

Commercial Alternatives​

Best Practices​

Do​

Don't​

Further Reading​

Internal Documentation​

External Resources​

Summary​

What is Observability?

The Three Pillars of Observability

1. Logs - The Narrative

2. Metrics - The Statistics

3. Distributed Tracing - The Journey

How the Three Pillars Work Together

Typical Investigation Workflow

Complementary Strengths

Correlation: Tying It All Together

Correlation IDs / Trace IDs

Observability Maturity Model

Level 1: Basic Monitoring

Level 2: Structured Observability

Level 3: Distributed Tracing

Level 4: Full Observability

Getting Started

For New Projects

For Existing Projects

Tool Ecosystem

Open Source Stack (Recommended for most teams)

Commercial Alternatives

Best Practices

Do

Don't

Further Reading

Internal Documentation

External Resources

Summary