Distributed Tracing Best Practices

Overview

Distributed tracing is the third pillar of observability, providing visibility into how requests flow through distributed systems. While metrics answer "how many/how fast?" and logs provide "what happened?" detail, tracing answers "where did the request go?" and "where did time get spent?"

In monolithic applications, a stack trace tells you the call path. In microservices, a single user action might traverse 10+ services - tracing reconstructs that distributed call path.

What distributed tracing provides:

Request flow visualization: See the complete path a request takes through your system
Latency breakdown: Understand which services/operations contribute to total response time
Bottleneck identification: Pinpoint slow database queries, external API calls, or inefficient code
Dependency mapping: Visualize how services call each other
Error attribution: Identify exactly where failures occur in the call chain
Performance regression detection: Compare trace patterns over time to spot degradations

This guide covers tracing fundamentals, OpenTelemetry instrumentation, trace context propagation, and integration with Jaeger for visualization.

Tracing vs. Metrics vs. Logs

Understanding when tracing provides unique value:

Aspect	Distributed Tracing	Metrics	Logs
Question	"Where did time go?"	"How much/how fast?"	"What happened?"
Structure	Directed acyclic graph (DAG) of spans	Time-series numbers	Text/JSON events
Scope	Single request across services	Aggregate across all requests	Individual events
Use case	Latency investigation, bottleneck ID	Alerting, trend analysis	Debugging, audit
Example	"Request took 2.5s: 1s in API gateway, 1.2s in database, 0.3s in cache"	"P95 latency: 850ms"	"DB query failed: timeout after 30s"

When to use tracing:

Investigating why specific requests are slow
Understanding service dependencies
Debugging cross-service issues
Performance optimization (finding where time is spent)

Tracing complements but doesn't replace:

Metrics for alerting (trace all requests, but metrics tell you when to investigate)
Logs for detailed why (traces show the slow service, logs show the specific error)

Core Principles

Trace Every Request: Instrument all services to capture complete request flows
Propagate Context: Pass trace and span IDs across service boundaries via headers
Meaningful Span Names: Use descriptive names that identify operations (GET /api/users, db.query.users.findById)
Capture Key Attributes: Add relevant context (user ID, resource IDs, response codes) as span attributes
Sample Intelligently: Balance visibility with cost by sampling strategically (always sample errors, sample % of success)
Correlate with Logs: Use trace IDs in logs to link detailed context with trace visualization

Tracing Concepts

Before implementing tracing, you need to understand its fundamental building blocks.

Trace

A trace represents the complete journey of a single request through your distributed system. It's identified by a unique trace ID that stays constant as the request flows through services.

Think of it as: A story with a beginning (user clicks button) and end (response displayed), covering everything that happened in between across all services.

Key characteristics:

Has a unique trace ID (e.g., abc-123-xyz-789)
Spans multiple services
Contains a hierarchy of spans
Has a duration (total time from start to finish)

Span

A span represents a single operation within a trace - one unit of work. Each span has its own unique span ID and knows its parent span ID.

Think of it as: A chapter in the trace story. Each HTTP call, database query, or significant method execution gets its own span.

Key characteristics:

Has a unique span ID
References parent span ID (except root span)
Has operation name (e.g., HTTP GET /api/users)
Records start time and duration
Contains attributes (metadata about the operation)
Can record events (discrete moments like "cache hit" or "retrying")
Has status (OK, ERROR)

Span types:

Client span: Outgoing call to another service
Server span: Incoming request from another service
Internal span: Operation within your service (database query, cache lookup)

Trace Context

Trace context is the metadata that must be propagated across service boundaries to maintain the trace. It's passed via HTTP headers (or message headers for async systems).

Required context:

Trace ID: Identifies the overall trace
Span ID: Identifies the current span (becomes parent span ID for next service)
Trace flags: Sampling decision, debug flags

Standard: W3C Trace Context

traceparent: 00-0af7651916cd43dd8448eb211c80319c-b7ad6b7169203331-01
             ││ └──────────── trace-id ────────────┘ └── span-id ──┘ └flags
             ││                 32 hex chars            16 hex chars   8 bits
             │└─ version
             └─ header format

Trace Hierarchy Example

Here's how spans nest to form a complete trace:

Trace: Process Payment (trace_id: abc-123, total: 2.5s)
│
├── Span: [SERVER] HTTP POST /api/payments (span_id: 1, duration: 2.5s)
│   │
│   ├── Span: [INTERNAL] Validate Payment (span_id: 2, parent: 1, duration: 50ms)
│   │   └── Event: "Validation passed"
│   │
│   ├── Span: [INTERNAL] Check Balance (span_id: 3, parent: 1, duration: 1.2s)
│   │   └── Span: [CLIENT] DB Query accounts.findById (span_id: 4, parent: 3, duration: 1.15s)
│   │       └── Attributes: db.system=postgresql, db.statement=SELECT...
│   │
│   ├── Span: [CLIENT] HTTP POST /fraud-service/check (span_id: 5, parent: 1, duration: 800ms)
│   │   └── Span: [SERVER] POST /check (span_id: 6, parent: 5, duration: 790ms)
│   │       └── Span: [INTERNAL] ML Model Inference (span_id: 7, parent: 6, duration: 750ms)
│   │           └── Attributes: model.version=v2.1, fraud.score=0.05
│   │
│   └── Span: [INTERNAL] Process Transaction (span_id: 8, parent: 1, duration: 400ms)
│       └── Span: [CLIENT] DB Insert transactions (span_id: 9, parent: 8, duration: 380ms)
│           └── Attributes: db.system=postgresql, db.statement=INSERT...

What this visualization reveals:

Total request took 2.5s
Largest bottleneck: Database query took 1.15s (46% of total time)
Fraud check took 800ms (32% of total time)
Validation was fast (50ms)
Operations happened sequentially (no parallelism opportunities visible)

Spring Boot with OpenTelemetry

Dependencies

// build.gradle — Spring Boot 3.x includes Micrometer Tracing which wraps OpenTelemetry
implementation 'org.springframework.boot:spring-boot-starter-actuator'
implementation 'io.micrometer:micrometer-tracing-bridge-otel'      // OTel bridge
implementation 'io.opentelemetry:opentelemetry-exporter-otlp'       // OTLP exporter (Jaeger, Grafana Tempo, etc.)

Configuration

# application.yml
spring:
  application:
    name: payment-service

otel:
  service:
    name: ${spring.application.name}
  traces:
    exporter: jaeger
  exporter:
    jaeger:
      endpoint: http://jaeger:14250
  resource:
    attributes:
      environment: ${spring.profiles.active}
      service.version: 1.0.0

Automatic Instrumentation

Spring Boot with OpenTelemetry automatically traces:

HTTP requests/responses
Database queries (JDBC)
Redis operations
Kafka producers/consumers
RestTemplate/WebClient calls

Custom Spans

import io.opentelemetry.api.trace.Span;
import io.opentelemetry.api.trace.Tracer;
import io.opentelemetry.context.Scope;

@Service
public class PaymentService {

    private final Tracer tracer;

    public PaymentService(Tracer tracer) {
        this.tracer = tracer;
    }

    public PaymentResult processPayment(Payment payment) {
        // Get current span (automatically created by Spring)
        Span currentSpan = Span.current();
        currentSpan.setAttribute("payment.id", payment.getId());
        currentSpan.setAttribute("payment.amount", payment.getAmount().doubleValue());
        currentSpan.setAttribute("payment.currency", payment.getCurrency());

        // Create custom span
        Span validationSpan = tracer.spanBuilder("validate.payment")
            .startSpan();

        try (Scope scope = validationSpan.makeCurrent()) {
            validatePayment(payment);
            validationSpan.setAttribute("validation.result", "success");
        } catch (ValidationException e) {
            validationSpan.recordException(e);
            validationSpan.setAttribute("validation.result", "failed");
            throw e;
        } finally {
            validationSpan.end();
        }

        // Process payment logic
        return executePayment(payment);
    }

    private void validatePayment(Payment payment) {
        Span span = tracer.spanBuilder("validate.payment.rules")
            .startSpan();

        try (Scope scope = span.makeCurrent()) {
            // Validation logic
            span.addEvent("Validating amount");
            if (payment.getAmount().compareTo(BigDecimal.ZERO) <= 0) {
                span.setAttribute("error", "Invalid amount");
                throw new ValidationException("Amount must be positive");
            }

            span.addEvent("Validating currency");
            // More validation...

            span.setAttribute("rules.checked", 5);
        } finally {
            span.end();
        }
    }
}

Context Propagation

HTTP Headers

OpenTelemetry uses W3C Trace Context standard:

traceparent: trace-id, parent-span-id, trace-flags
tracestate: vendor-specific data

traceparent: 00-0af7651916cd43dd8448eb211c80319c-b7ad6b7169203331-01
tracestate: rojo=00f067aa0ba902b7

RestTemplate Configuration

@Configuration
public class RestTemplateConfig {

    @Bean
    public RestTemplate restTemplate(RestTemplateBuilder builder) {
        return builder
            .interceptors((request, body, execution) -> {
                // OpenTelemetry automatically propagates trace context
                return execution.execute(request, body);
            })
            .build();
    }
}

Manual Propagation (if needed)

import io.opentelemetry.context.propagation.TextMapPropagator;
import io.opentelemetry.context.Context;

@Service
public class ExternalServiceClient {

    private final TextMapPropagator propagator;

    public void callExternalService(String url) {
        HttpHeaders headers = new HttpHeaders();

        // Inject trace context into headers
        propagator.inject(Context.current(), headers, (carrier, key, value) -> {
            carrier.add(key, value);
        });

        // Make HTTP request with headers
        restTemplate.exchange(url, HttpMethod.POST, new HttpEntity<>(headers), String.class);
    }
}

Span Attributes

Standard Attributes

// HTTP attributes
span.setAttribute("http.method", "POST");
span.setAttribute("http.url", "/api/payments");
span.setAttribute("http.status_code", 200);

// Database attributes
span.setAttribute("db.system", "postgresql");
span.setAttribute("db.name", "payments");
span.setAttribute("db.operation", "SELECT");
span.setAttribute("db.statement", "SELECT * FROM payments WHERE id = ?");

// Messaging attributes
span.setAttribute("messaging.system", "kafka");
span.setAttribute("messaging.destination", "payment.events");
span.setAttribute("messaging.operation", "publish");

Business Attributes

// Payment processing
span.setAttribute("payment.id", "PAY-123");
span.setAttribute("payment.amount", 100.00);
span.setAttribute("payment.currency", "USD");
span.setAttribute("user.id", "USER-456");
span.setAttribute("account.id", "ACC-789");

// Fraud detection
span.setAttribute("fraud.score", 0.05);
span.setAttribute("fraud.checked", true);

// Transaction details
span.setAttribute("transaction.id", "TXN-101");
span.setAttribute("transaction.type", "transfer");

Sampling Strategies

Always On (Development)

otel:
  traces:
    sampler: always_on

Probability-Based (Production)

otel:
  traces:
    sampler: traceidratio
    sampler-arg: 0.1  # Sample 10% of traces

Custom Sampling

@Configuration
public class TracingConfig {

    @Bean
    public Sampler sampler() {
        // Sample all errors, 10% of successes
        return Sampler.parentBased(
            new CustomSampler()
        );
    }

    private static class CustomSampler implements Sampler {
        @Override
        public SamplingResult shouldSample(
                Context parentContext,
                String traceId,
                String name,
                SpanKind spanKind,
                Attributes attributes,
                List<LinkData> parentLinks) {

            // Always sample if error
            if (attributes.get(AttributeKey.stringKey("error")) != null) {
                return SamplingResult.recordAndSample();
            }

            // Sample high-value transactions
            Double amount = attributes.get(AttributeKey.doubleKey("payment.amount"));
            if (amount != null && amount > 10000) {
                return SamplingResult.recordAndSample();
            }

            // 10% sampling for everything else
            return Math.random() < 0.1
                ? SamplingResult.recordAndSample()
                : SamplingResult.drop();
        }

        @Override
        public String getDescription() {
            return "CustomSampler";
        }
    }
}

Jaeger Setup

Docker Compose

version: '3'
services:
  jaeger:
    image: jaegertracing/all-in-one:latest
    ports:
      - "16686:16686"  # Jaeger UI
      - "14250:14250"  # Collector gRPC
      - "14268:14268"  # Collector HTTP
      - "6831:6831/udp" # Agent compact thrift
    environment:
      - COLLECTOR_ZIPKIN_HOST_PORT=:9411

Accessing Jaeger UI

Open http://localhost:16686
Select service (e.g., payment-service)
Search traces by:
- Operation
- Tags (payment.id, user.id)
- Duration
- Time range

Correlating Traces with Logs

MDC Integration

import org.slf4j.MDC;
import io.opentelemetry.api.trace.Span;

@Component
public class TraceLoggingFilter implements Filter {

    @Override
    public void doFilter(ServletRequest request, ServletResponse response, FilterChain chain)
            throws IOException, ServletException {

        Span currentSpan = Span.current();
        String traceId = currentSpan.getSpanContext().getTraceId();
        String spanId = currentSpan.getSpanContext().getSpanId();

        // Add to MDC for logging
        MDC.put("traceId", traceId);
        MDC.put("spanId", spanId);

        try {
            chain.doFilter(request, response);
        } finally {
            MDC.remove("traceId");
            MDC.remove("spanId");
        }
    }
}

Log Pattern with Trace IDs

<!-- logback-spring.xml -->
<pattern>
    %d{ISO8601} [%thread] %-5level %logger{36} - [traceId=%X{traceId} spanId=%X{spanId}] - %msg%n
</pattern>

Example Log Output

2025-01-28 10:15:30 [http-nio-8080-exec-1] INFO  c.b.p.PaymentService - [traceId=0af7651916cd43dd8448eb211c80319c spanId=b7ad6b7169203331] - Processing payment PAY-123

Node.js / TypeScript Tracing

OpenTelemetry Setup

import { NodeSDK } from '@opentelemetry/sdk-node';
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';
import { JaegerExporter } from '@opentelemetry/exporter-jaeger';

const sdk = new NodeSDK({
  traceExporter: new JaegerExporter({
    endpoint: 'http://jaeger:14268/api/traces'
  }),
  instrumentations: [getNodeAutoInstrumentations()],
  serviceName: 'payment-service'
});

sdk.start();

Custom Spans

import { trace } from '@opentelemetry/api';

const tracer = trace.getTracer('payment-service');

export class PaymentService {
  async processPayment(payment: Payment): Promise<PaymentResult> {
    const span = tracer.startSpan('process.payment');

    span.setAttributes({
      'payment.id': payment.id,
      'payment.amount': payment.amount,
      'payment.currency': payment.currency
    });

    try {
      const result = await this.executePayment(payment);
      span.setAttribute('result.status', result.status);
      return result;
    } catch (error) {
      span.recordException(error);
      span.setStatus({ code: SpanStatusCode.ERROR });
      throw error;
    } finally {
      span.end();
    }
  }
}

Best Practices

Span Naming

// GOOD: Descriptive operation names
"POST /api/payments"
"validate.payment.rules"
"db.query.accounts.findById"
"kafka.publish.payment.events"

// BAD: Vague names
"process"
"handle"
"do"

Error Recording

try {
    processPayment(payment);
} catch (Exception e) {
    span.recordException(e);
    span.setStatus(StatusCode.ERROR, "Payment processing failed");
    throw e;
}

Avoid Sensitive Data

// BAD: Logging sensitive data
span.setAttribute("card.number", cardNumber);
span.setAttribute("password", password);

// GOOD: Mask or omit
span.setAttribute("card.last4", cardNumber.substring(cardNumber.length() - 4));

Summary

Distributed tracing completes the observability triad. While metrics aggregate data ("P95 latency is 850ms") and logs provide detailed events ("query timeout after 30s"), tracing visualizes the complete request journey ("request spent 1.2s in database, 800ms in fraud service").

Key Takeaways

Tracing reconstructs distributed call paths: In microservices, a single user action crosses multiple services. Tracing rebuilds that path like a stack trace does for monoliths.
Traces are directed acyclic graphs of spans: A trace contains spans (operations) arranged in a parent-child hierarchy. Each span measures one operation (HTTP call, DB query, method execution).
Trace context propagates via standardized headers: The W3C Trace Context standard defines how trace ID, span ID, and sampling flags flow through HTTP headers across service boundaries.
OpenTelemetry is the vendor-neutral standard: Like Micrometer for metrics or SLF4J for logging, OpenTelemetry provides instrumentation APIs that work with multiple backends (Jaeger, Zipkin, Datadog, New Relic).
Automatic instrumentation handles common scenarios: OpenTelemetry agents automatically trace HTTP requests, database calls, Redis operations, and Kafka messaging without code changes.
Custom spans add business context: While auto-instrumentation captures infrastructure, custom spans track business operations ("validate payment," "calculate risk score") with relevant attributes.
Span attributes enable filtering and analysis: Adding metadata (user ID, resource IDs, business values) to spans allows querying traces by business criteria, not just technical dimensions.
Sampling balances cost and visibility: Trace storage is expensive. Sample 100% of errors but only 5-10% of successes. Head-based sampling decides at trace start, tail-based decides after seeing outcomes.
Correlation with logs provides complete context: Add trace ID and span ID to log MDC. When investigating a slow trace, you can query logs for detailed context about specific operations.
Jaeger visualizes traces for human analysis: Raw trace data is JSON. Jaeger (or Zipkin, Tempo) provides waterfall diagrams showing span hierarchy and duration, making bottlenecks obvious.

Relationship to Other Observability Pillars

Tracing is most powerful when combined with the other pillars:

Metrics → Tracing: Metrics alert "P95 latency > 1s," then you search traces for slow requests to understand why
Tracing → Logs: Trace shows database operation took 5s, logs show the specific query and error message
Complete workflow:
1. Metrics dashboard shows latency spike at 10:15 AM
2. Trace query finds slow traces at that time
3. Trace visualization reveals bottleneck is in account-service database call
4. Logs (filtered by trace ID) show "connection pool exhausted"
5. Fix: Increase database connection pool size

Trade-offs to consider:

Trace storage is more expensive than metrics (per-request data vs. aggregates)
Sampling means you don't have every trace (but that's okay for most investigations)
Instrumentation overhead is low but not zero (~1-2ms per trace)

OpenTelemetry vs. Proprietary Solutions

OpenTelemetry advantages:

Open standard, no vendor lock-in
Works with 40+ backends
Strong community and support
Auto-instrumentation for common libraries

Proprietary advantages (Datadog APM, New Relic, etc.):

Tighter integration with their platforms
Advanced features (profiling, deployment tracking)
Hosted infrastructure (no Jaeger to manage)

Recommendation: Start with OpenTelemetry. You can always send traces to proprietary backends later without changing application code.

Next Steps:

Review Logging Best Practices to learn how to correlate trace IDs with logs
Read Application Metrics to understand how metrics complement tracing
See framework implementations in Spring Boot Observability
Explore OpenTelemetry documentation for advanced instrumentation patterns

Overview​

Tracing vs. Metrics vs. Logs​

Core Principles​

Tracing Concepts​

Trace​

Span​

Trace Context​

Trace Hierarchy Example​

Spring Boot with OpenTelemetry​

Dependencies​

Configuration​

Automatic Instrumentation​

Custom Spans​

Context Propagation​

HTTP Headers​

RestTemplate Configuration​

Manual Propagation (if needed)​

Span Attributes​

Standard Attributes​

Business Attributes​

Sampling Strategies​

Always On (Development)​

Probability-Based (Production)​

Custom Sampling​

Jaeger Setup​

Docker Compose​

Accessing Jaeger UI​

Correlating Traces with Logs​

MDC Integration​

Log Pattern with Trace IDs​

Example Log Output​

Node.js / TypeScript Tracing​

OpenTelemetry Setup​

Custom Spans​

Best Practices​

Span Naming​

Error Recording​

Avoid Sensitive Data​

Further Reading​

Internal Documentation​

External Resources​

Summary​

Key Takeaways​

Relationship to Other Observability Pillars​

OpenTelemetry vs. Proprietary Solutions​

Overview

Tracing vs. Metrics vs. Logs

Core Principles

Tracing Concepts

Trace

Span

Trace Context

Trace Hierarchy Example

Spring Boot with OpenTelemetry

Dependencies

Configuration

Automatic Instrumentation

Custom Spans

Context Propagation

HTTP Headers

RestTemplate Configuration

Manual Propagation (if needed)

Span Attributes

Standard Attributes

Business Attributes

Sampling Strategies

Always On (Development)

Probability-Based (Production)

Custom Sampling

Jaeger Setup

Docker Compose

Accessing Jaeger UI

Correlating Traces with Logs

MDC Integration

Log Pattern with Trace IDs

Example Log Output

Node.js / TypeScript Tracing

OpenTelemetry Setup

Custom Spans

Best Practices

Span Naming

Error Recording

Avoid Sensitive Data

Further Reading

Internal Documentation

External Resources

Summary

Key Takeaways

Relationship to Other Observability Pillars

OpenTelemetry vs. Proprietary Solutions