Distributed Tracing Best Practices
Overview
Distributed tracing is the third pillar of observability, providing visibility into how requests flow through distributed systems. While metrics answer "how many/how fast?" and logs provide "what happened?" detail, tracing answers "where did the request go?" and "where did time get spent?"
In monolithic applications, a stack trace tells you the call path. In microservices, a single user action might traverse 10+ services - tracing reconstructs that distributed call path.
What distributed tracing provides:
- Request flow visualization: See the complete path a request takes through your system
- Latency breakdown: Understand which services/operations contribute to total response time
- Bottleneck identification: Pinpoint slow database queries, external API calls, or inefficient code
- Dependency mapping: Visualize how services call each other
- Error attribution: Identify exactly where failures occur in the call chain
- Performance regression detection: Compare trace patterns over time to spot degradations
This guide covers tracing fundamentals, OpenTelemetry instrumentation, trace context propagation, and integration with Jaeger for visualization.
Tracing vs. Metrics vs. Logs
Understanding when tracing provides unique value:
| Aspect | Distributed Tracing | Metrics | Logs |
|---|---|---|---|
| Question | "Where did time go?" | "How much/how fast?" | "What happened?" |
| Structure | Directed acyclic graph (DAG) of spans | Time-series numbers | Text/JSON events |
| Scope | Single request across services | Aggregate across all requests | Individual events |
| Use case | Latency investigation, bottleneck ID | Alerting, trend analysis | Debugging, audit |
| Example | "Request took 2.5s: 1s in API gateway, 1.2s in database, 0.3s in cache" | "P95 latency: 850ms" | "DB query failed: timeout after 30s" |
When to use tracing:
- Investigating why specific requests are slow
- Understanding service dependencies
- Debugging cross-service issues
- Performance optimization (finding where time is spent)
Tracing complements but doesn't replace:
- Metrics for alerting (trace all requests, but metrics tell you when to investigate)
- Logs for detailed why (traces show the slow service, logs show the specific error)
Core Principles
- Trace Every Request: Instrument all services to capture complete request flows
- Propagate Context: Pass trace and span IDs across service boundaries via headers
- Meaningful Span Names: Use descriptive names that identify operations (
GET /api/users,db.query.users.findById) - Capture Key Attributes: Add relevant context (user ID, resource IDs, response codes) as span attributes
- Sample Intelligently: Balance visibility with cost by sampling strategically (always sample errors, sample % of success)
- Correlate with Logs: Use trace IDs in logs to link detailed context with trace visualization
Tracing Concepts
Before implementing tracing, you need to understand its fundamental building blocks.
Trace
A trace represents the complete journey of a single request through your distributed system. It's identified by a unique trace ID that stays constant as the request flows through services.
Think of it as: A story with a beginning (user clicks button) and end (response displayed), covering everything that happened in between across all services.
Key characteristics:
- Has a unique trace ID (e.g.,
abc-123-xyz-789) - Spans multiple services
- Contains a hierarchy of spans
- Has a duration (total time from start to finish)
Span
A span represents a single operation within a trace - one unit of work. Each span has its own unique span ID and knows its parent span ID.
Think of it as: A chapter in the trace story. Each HTTP call, database query, or significant method execution gets its own span.
Key characteristics:
- Has a unique span ID
- References parent span ID (except root span)
- Has operation name (e.g.,
HTTP GET /api/users) - Records start time and duration
- Contains attributes (metadata about the operation)
- Can record events (discrete moments like "cache hit" or "retrying")
- Has status (OK, ERROR)
Span types:
- Client span: Outgoing call to another service
- Server span: Incoming request from another service
- Internal span: Operation within your service (database query, cache lookup)
Trace Context
Trace context is the metadata that must be propagated across service boundaries to maintain the trace. It's passed via HTTP headers (or message headers for async systems).
Required context:
- Trace ID: Identifies the overall trace
- Span ID: Identifies the current span (becomes parent span ID for next service)
- Trace flags: Sampling decision, debug flags
Standard: W3C Trace Context
traceparent: 00-0af7651916cd43dd8448eb211c80319c-b7ad6b7169203331-01
││ └──────────── trace-id ────────────┘ └── span-id ──┘ └flags
││ 32 hex chars 16 hex chars 8 bits
│└─ version
└─ header format
Trace Hierarchy Example
Here's how spans nest to form a complete trace:
Trace: Process Payment (trace_id: abc-123, total: 2.5s)
│
├── Span: [SERVER] HTTP POST /api/payments (span_id: 1, duration: 2.5s)
│ │
│ ├── Span: [INTERNAL] Validate Payment (span_id: 2, parent: 1, duration: 50ms)
│ │ └── Event: "Validation passed"
│ │
│ ├── Span: [INTERNAL] Check Balance (span_id: 3, parent: 1, duration: 1.2s)
│ │ └── Span: [CLIENT] DB Query accounts.findById (span_id: 4, parent: 3, duration: 1.15s)
│ │ └── Attributes: db.system=postgresql, db.statement=SELECT...
│ │
│ ├── Span: [CLIENT] HTTP POST /fraud-service/check (span_id: 5, parent: 1, duration: 800ms)
│ │ └── Span: [SERVER] POST /check (span_id: 6, parent: 5, duration: 790ms)
│ │ └── Span: [INTERNAL] ML Model Inference (span_id: 7, parent: 6, duration: 750ms)
│ │ └── Attributes: model.version=v2.1, fraud.score=0.05
│ │
│ └── Span: [INTERNAL] Process Transaction (span_id: 8, parent: 1, duration: 400ms)
│ └── Span: [CLIENT] DB Insert transactions (span_id: 9, parent: 8, duration: 380ms)
│ └── Attributes: db.system=postgresql, db.statement=INSERT...
What this visualization reveals:
- Total request took 2.5s
- Largest bottleneck: Database query took 1.15s (46% of total time)
- Fraud check took 800ms (32% of total time)
- Validation was fast (50ms)
- Operations happened sequentially (no parallelism opportunities visible)
Spring Boot with OpenTelemetry
Dependencies
// build.gradle — Spring Boot 3.x includes Micrometer Tracing which wraps OpenTelemetry
implementation 'org.springframework.boot:spring-boot-starter-actuator'
implementation 'io.micrometer:micrometer-tracing-bridge-otel' // OTel bridge
implementation 'io.opentelemetry:opentelemetry-exporter-otlp' // OTLP exporter (Jaeger, Grafana Tempo, etc.)
Configuration
# application.yml
spring:
application:
name: payment-service
otel:
service:
name: ${spring.application.name}
traces:
exporter: jaeger
exporter:
jaeger:
endpoint: http://jaeger:14250
resource:
attributes:
environment: ${spring.profiles.active}
service.version: 1.0.0
Automatic Instrumentation
Spring Boot with OpenTelemetry automatically traces:
- HTTP requests/responses
- Database queries (JDBC)
- Redis operations
- Kafka producers/consumers
- RestTemplate/WebClient calls
Custom Spans
import io.opentelemetry.api.trace.Span;
import io.opentelemetry.api.trace.Tracer;
import io.opentelemetry.context.Scope;
@Service
public class PaymentService {
private final Tracer tracer;
public PaymentService(Tracer tracer) {
this.tracer = tracer;
}
public PaymentResult processPayment(Payment payment) {
// Get current span (automatically created by Spring)
Span currentSpan = Span.current();
currentSpan.setAttribute("payment.id", payment.getId());
currentSpan.setAttribute("payment.amount", payment.getAmount().doubleValue());
currentSpan.setAttribute("payment.currency", payment.getCurrency());
// Create custom span
Span validationSpan = tracer.spanBuilder("validate.payment")
.startSpan();
try (Scope scope = validationSpan.makeCurrent()) {
validatePayment(payment);
validationSpan.setAttribute("validation.result", "success");
} catch (ValidationException e) {
validationSpan.recordException(e);
validationSpan.setAttribute("validation.result", "failed");
throw e;
} finally {
validationSpan.end();
}
// Process payment logic
return executePayment(payment);
}
private void validatePayment(Payment payment) {
Span span = tracer.spanBuilder("validate.payment.rules")
.startSpan();
try (Scope scope = span.makeCurrent()) {
// Validation logic
span.addEvent("Validating amount");
if (payment.getAmount().compareTo(BigDecimal.ZERO) <= 0) {
span.setAttribute("error", "Invalid amount");
throw new ValidationException("Amount must be positive");
}
span.addEvent("Validating currency");
// More validation...
span.setAttribute("rules.checked", 5);
} finally {
span.end();
}
}
}
Context Propagation
HTTP Headers
OpenTelemetry uses W3C Trace Context standard:
traceparent: trace-id, parent-span-id, trace-flagstracestate: vendor-specific data
traceparent: 00-0af7651916cd43dd8448eb211c80319c-b7ad6b7169203331-01
tracestate: rojo=00f067aa0ba902b7
RestTemplate Configuration
@Configuration
public class RestTemplateConfig {
@Bean
public RestTemplate restTemplate(RestTemplateBuilder builder) {
return builder
.interceptors((request, body, execution) -> {
// OpenTelemetry automatically propagates trace context
return execution.execute(request, body);
})
.build();
}
}
Manual Propagation (if needed)
import io.opentelemetry.context.propagation.TextMapPropagator;
import io.opentelemetry.context.Context;
@Service
public class ExternalServiceClient {
private final TextMapPropagator propagator;
public void callExternalService(String url) {
HttpHeaders headers = new HttpHeaders();
// Inject trace context into headers
propagator.inject(Context.current(), headers, (carrier, key, value) -> {
carrier.add(key, value);
});
// Make HTTP request with headers
restTemplate.exchange(url, HttpMethod.POST, new HttpEntity<>(headers), String.class);
}
}
Span Attributes
Standard Attributes
// HTTP attributes
span.setAttribute("http.method", "POST");
span.setAttribute("http.url", "/api/payments");
span.setAttribute("http.status_code", 200);
// Database attributes
span.setAttribute("db.system", "postgresql");
span.setAttribute("db.name", "payments");
span.setAttribute("db.operation", "SELECT");
span.setAttribute("db.statement", "SELECT * FROM payments WHERE id = ?");
// Messaging attributes
span.setAttribute("messaging.system", "kafka");
span.setAttribute("messaging.destination", "payment.events");
span.setAttribute("messaging.operation", "publish");
Business Attributes
// Payment processing
span.setAttribute("payment.id", "PAY-123");
span.setAttribute("payment.amount", 100.00);
span.setAttribute("payment.currency", "USD");
span.setAttribute("user.id", "USER-456");
span.setAttribute("account.id", "ACC-789");
// Fraud detection
span.setAttribute("fraud.score", 0.05);
span.setAttribute("fraud.checked", true);
// Transaction details
span.setAttribute("transaction.id", "TXN-101");
span.setAttribute("transaction.type", "transfer");
Sampling Strategies
Always On (Development)
otel:
traces:
sampler: always_on
Probability-Based (Production)
otel:
traces:
sampler: traceidratio
sampler-arg: 0.1 # Sample 10% of traces
Custom Sampling
@Configuration
public class TracingConfig {
@Bean
public Sampler sampler() {
// Sample all errors, 10% of successes
return Sampler.parentBased(
new CustomSampler()
);
}
private static class CustomSampler implements Sampler {
@Override
public SamplingResult shouldSample(
Context parentContext,
String traceId,
String name,
SpanKind spanKind,
Attributes attributes,
List<LinkData> parentLinks) {
// Always sample if error
if (attributes.get(AttributeKey.stringKey("error")) != null) {
return SamplingResult.recordAndSample();
}
// Sample high-value transactions
Double amount = attributes.get(AttributeKey.doubleKey("payment.amount"));
if (amount != null && amount > 10000) {
return SamplingResult.recordAndSample();
}
// 10% sampling for everything else
return Math.random() < 0.1
? SamplingResult.recordAndSample()
: SamplingResult.drop();
}
@Override
public String getDescription() {
return "CustomSampler";
}
}
}
Jaeger Setup
Docker Compose
version: '3'
services:
jaeger:
image: jaegertracing/all-in-one:latest
ports:
- "16686:16686" # Jaeger UI
- "14250:14250" # Collector gRPC
- "14268:14268" # Collector HTTP
- "6831:6831/udp" # Agent compact thrift
environment:
- COLLECTOR_ZIPKIN_HOST_PORT=:9411
Accessing Jaeger UI
- Open http://localhost:16686
- Select service (e.g., payment-service)
- Search traces by:
- Operation
- Tags (payment.id, user.id)
- Duration
- Time range
Correlating Traces with Logs
MDC Integration
import org.slf4j.MDC;
import io.opentelemetry.api.trace.Span;
@Component
public class TraceLoggingFilter implements Filter {
@Override
public void doFilter(ServletRequest request, ServletResponse response, FilterChain chain)
throws IOException, ServletException {
Span currentSpan = Span.current();
String traceId = currentSpan.getSpanContext().getTraceId();
String spanId = currentSpan.getSpanContext().getSpanId();
// Add to MDC for logging
MDC.put("traceId", traceId);
MDC.put("spanId", spanId);
try {
chain.doFilter(request, response);
} finally {
MDC.remove("traceId");
MDC.remove("spanId");
}
}
}
Log Pattern with Trace IDs
<!-- logback-spring.xml -->
<pattern>
%d{ISO8601} [%thread] %-5level %logger{36} - [traceId=%X{traceId} spanId=%X{spanId}] - %msg%n
</pattern>
Example Log Output
2025-01-28 10:15:30 [http-nio-8080-exec-1] INFO c.b.p.PaymentService - [traceId=0af7651916cd43dd8448eb211c80319c spanId=b7ad6b7169203331] - Processing payment PAY-123
Node.js / TypeScript Tracing
OpenTelemetry Setup
import { NodeSDK } from '@opentelemetry/sdk-node';
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';
import { JaegerExporter } from '@opentelemetry/exporter-jaeger';
const sdk = new NodeSDK({
traceExporter: new JaegerExporter({
endpoint: 'http://jaeger:14268/api/traces'
}),
instrumentations: [getNodeAutoInstrumentations()],
serviceName: 'payment-service'
});
sdk.start();
Custom Spans
import { trace } from '@opentelemetry/api';
const tracer = trace.getTracer('payment-service');
export class PaymentService {
async processPayment(payment: Payment): Promise<PaymentResult> {
const span = tracer.startSpan('process.payment');
span.setAttributes({
'payment.id': payment.id,
'payment.amount': payment.amount,
'payment.currency': payment.currency
});
try {
const result = await this.executePayment(payment);
span.setAttribute('result.status', result.status);
return result;
} catch (error) {
span.recordException(error);
span.setStatus({ code: SpanStatusCode.ERROR });
throw error;
} finally {
span.end();
}
}
}
Best Practices
Span Naming
// GOOD: Descriptive operation names
"POST /api/payments"
"validate.payment.rules"
"db.query.accounts.findById"
"kafka.publish.payment.events"
// BAD: Vague names
"process"
"handle"
"do"
Error Recording
try {
processPayment(payment);
} catch (Exception e) {
span.recordException(e);
span.setStatus(StatusCode.ERROR, "Payment processing failed");
throw e;
}
Avoid Sensitive Data
// BAD: Logging sensitive data
span.setAttribute("card.number", cardNumber);
span.setAttribute("password", password);
// GOOD: Mask or omit
span.setAttribute("card.last4", cardNumber.substring(cardNumber.length() - 4));
Further Reading
Internal Documentation
- Observability Logging - Structured logging
- Observability Metrics - Application metrics
- Spring Boot Observability - Spring Boot integration
External Resources
Summary
Distributed tracing completes the observability triad. While metrics aggregate data ("P95 latency is 850ms") and logs provide detailed events ("query timeout after 30s"), tracing visualizes the complete request journey ("request spent 1.2s in database, 800ms in fraud service").
Key Takeaways
-
Tracing reconstructs distributed call paths: In microservices, a single user action crosses multiple services. Tracing rebuilds that path like a stack trace does for monoliths.
-
Traces are directed acyclic graphs of spans: A trace contains spans (operations) arranged in a parent-child hierarchy. Each span measures one operation (HTTP call, DB query, method execution).
-
Trace context propagates via standardized headers: The W3C Trace Context standard defines how trace ID, span ID, and sampling flags flow through HTTP headers across service boundaries.
-
OpenTelemetry is the vendor-neutral standard: Like Micrometer for metrics or SLF4J for logging, OpenTelemetry provides instrumentation APIs that work with multiple backends (Jaeger, Zipkin, Datadog, New Relic).
-
Automatic instrumentation handles common scenarios: OpenTelemetry agents automatically trace HTTP requests, database calls, Redis operations, and Kafka messaging without code changes.
-
Custom spans add business context: While auto-instrumentation captures infrastructure, custom spans track business operations ("validate payment," "calculate risk score") with relevant attributes.
-
Span attributes enable filtering and analysis: Adding metadata (user ID, resource IDs, business values) to spans allows querying traces by business criteria, not just technical dimensions.
-
Sampling balances cost and visibility: Trace storage is expensive. Sample 100% of errors but only 5-10% of successes. Head-based sampling decides at trace start, tail-based decides after seeing outcomes.
-
Correlation with logs provides complete context: Add trace ID and span ID to log MDC. When investigating a slow trace, you can query logs for detailed context about specific operations.
-
Jaeger visualizes traces for human analysis: Raw trace data is JSON. Jaeger (or Zipkin, Tempo) provides waterfall diagrams showing span hierarchy and duration, making bottlenecks obvious.
Relationship to Other Observability Pillars
Tracing is most powerful when combined with the other pillars:
- Metrics → Tracing: Metrics alert "P95 latency > 1s," then you search traces for slow requests to understand why
- Tracing → Logs: Trace shows database operation took 5s, logs show the specific query and error message
- Complete workflow:
- Metrics dashboard shows latency spike at 10:15 AM
- Trace query finds slow traces at that time
- Trace visualization reveals bottleneck is in account-service database call
- Logs (filtered by trace ID) show "connection pool exhausted"
- Fix: Increase database connection pool size
Trade-offs to consider:
- Trace storage is more expensive than metrics (per-request data vs. aggregates)
- Sampling means you don't have every trace (but that's okay for most investigations)
- Instrumentation overhead is low but not zero (~1-2ms per trace)
OpenTelemetry vs. Proprietary Solutions
OpenTelemetry advantages:
- Open standard, no vendor lock-in
- Works with 40+ backends
- Strong community and support
- Auto-instrumentation for common libraries
Proprietary advantages (Datadog APM, New Relic, etc.):
- Tighter integration with their platforms
- Advanced features (profiling, deployment tracking)
- Hosted infrastructure (no Jaeger to manage)
Recommendation: Start with OpenTelemetry. You can always send traces to proprietary backends later without changing application code.
Next Steps:
- Review Logging Best Practices to learn how to correlate trace IDs with logs
- Read Application Metrics to understand how metrics complement tracing
- See framework implementations in Spring Boot Observability
- Explore OpenTelemetry documentation for advanced instrumentation patterns