AWS Observability

Implementing comprehensive observability on AWS using CloudWatch and X-Ray for logs, metrics, and distributed tracing.

Overview

AWS provides a comprehensive observability stack through CloudWatch (logs and metrics), X-Ray (distributed tracing), and specialized services like Container Insights for containerized workloads. This guide covers how to instrument your applications, configure monitoring infrastructure, and build effective observability practices on AWS.

For foundational observability concepts, see the Observability Overview. For Spring Boot-specific implementation details, see Spring Boot Observability. This guide focuses on AWS-specific integration patterns and services.

Core Principles

Structured logging: Use JSON format with consistent field names for efficient querying via CloudWatch Logs Insights
Correlation IDs: Propagate trace context through all AWS services and application layers for end-to-end traceability
Cost-aware sampling: Implement intelligent sampling for traces and verbose logs to manage CloudWatch costs
Actionable alarms: Create alarms based on business metrics and user-impacting symptoms, not just infrastructure thresholds
Centralized observability: Aggregate logs and metrics across accounts and regions for unified visibility

CloudWatch Logs

CloudWatch Logs is AWS's centralized logging service. Applications, AWS services, and infrastructure all emit logs to CloudWatch for storage, search, and analysis.

Log Groups and Streams

CloudWatch organizes logs into log groups (logical containers like /aws/ecs/my-service) and log streams (individual sources like container instances or Lambda executions).

Structure:

Log Group: /aws/ecs/payment-service
  ├─ Log Stream: task/abc123-container-1
  ├─ Log Stream: task/abc123-container-2
  └─ Log Stream: task/xyz789-container-1

Each log event within a stream has:

Timestamp: When the event occurred
Message: The actual log content (text or JSON)
Ingestion time: When CloudWatch received it

Best practices:

Use hierarchical naming: /company/environment/service (e.g., /acme/prod/payment-api)
Separate log groups by environment to avoid production data contamination
Use consistent naming across teams for easier cross-service querying

Structured Logging with JSON

CloudWatch Logs Insights can parse and query JSON logs much more efficiently than plain text. Send logs as JSON with consistent field names.

Plain text log (harder to query):

2025-01-15T10:30:45 INFO Payment created for customer 12345 amount 100.50 USD

JSON structured log (queryable):

{
  "timestamp": "2025-01-15T10:30:45.123Z",
  "level": "INFO",
  "message": "Payment created",
  "customerId": "12345",
  "paymentId": "pay-67890",
  "amount": 100.50,
  "currency": "USD",
  "correlationId": "abc-123-def"
}

With JSON logs, you can run queries like:

fields @timestamp, customerId, amount
| filter amount > 1000
| stats sum(amount) by customerId

For implementation details on structured logging in Spring Boot, see Logging Guidelines.

Log Retention and Lifecycle

By default, logs never expire. This can become expensive quickly. Set appropriate retention periods based on compliance requirements and usage patterns.

Common retention policies:

Development logs: 3-7 days (short-term debugging)
Production application logs: 30-90 days (operational troubleshooting)
Audit logs: 1-7 years (compliance requirements like PCI-DSS, SOX)
Access logs: 90-365 days (security analysis)

Cost optimization:

Short retention for high-volume, low-value logs (debug logs, health check requests)
Export older logs to S3 for archival at 1/10th the cost
Use subscription filters to stream logs to S3/Kinesis instead of long retention

Setting retention via Terraform:

resource "aws_cloudwatch_log_group" "payment_service" {
  name              = "/aws/ecs/payment-service"
  retention_in_days = 30  # Automatically delete after 30 days

  tags = {
    Environment = "production"
    Service     = "payment-service"
  }
}

See Terraform Guidelines for infrastructure as code patterns.

CloudWatch Logs Insights

Logs Insights is CloudWatch's built-in query language for analyzing log data. It provides SQL-like syntax for filtering, aggregating, and visualizing logs.

Query anatomy:

fields @timestamp, level, message, correlationId    # Select fields to display
| filter level = "ERROR"                            # Filter conditions
| filter @message like /timeout|connection/         # Regex matching
| stats count(*) by bin(5m)                         # Aggregate by 5-minute bins
| sort @timestamp desc                              # Order results
| limit 100                                          # Limit output

Example: Find slow database queries

fields @timestamp, queryTime, query
| filter queryTime > 1000        # Queries taking over 1 second
| stats avg(queryTime), max(queryTime), count() by query
| sort max(queryTime) desc

Example: Error rate over time

fields @timestamp
| filter level = "ERROR"
| stats count(*) as errorCount by bin(5m)

Example: Trace a specific request

fields @timestamp, level, message, customerId
| filter correlationId = "abc-123-def"    # All logs for one request
| sort @timestamp asc

Performance tips:

Filter early (use filter before stats to reduce data processed)
Query specific time ranges (avoid "all time" queries)
Use bin() for time-series aggregation instead of processing individual events
Limit result sets to avoid timeout

For more query patterns, see Logging Guidelines.

Subscription Filters

Subscription filters stream logs in real-time to other AWS services for processing or archival. This enables log aggregation, real-time alerting, and cost-effective long-term storage.

Common use cases:

S3 archival: Stream logs to S3 via Kinesis Firehose for long-term storage at lower cost
Real-time processing: Trigger Lambda functions for specific log patterns (e.g., error alerts)
Centralized logging: Aggregate logs from multiple accounts into a central account
Security analysis: Stream to security tools for threat detection

Example: Archive logs to S3

resource "aws_cloudwatch_log_subscription_filter" "log_archive" {
  name            = "archive-to-s3"
  log_group_name  = "/aws/ecs/payment-service"
  filter_pattern  = ""  # Empty = all logs
  destination_arn = aws_kinesis_firehose_delivery_stream.logs_to_s3.arn
}

resource "aws_kinesis_firehose_delivery_stream" "logs_to_s3" {
  name        = "logs-to-s3"
  destination = "s3"

  s3_configuration {
    role_arn   = aws_iam_role.firehose.arn
    bucket_arn = aws_s3_bucket.log_archive.arn
    prefix     = "logs/payment-service/"

    # Compress logs to save storage costs
    compression_format = "GZIP"
  }
}

Cost consideration: Streaming logs incurs data transfer and processing costs. For infrequently accessed logs, consider exporting directly to S3 on a schedule instead of real-time streaming.

CloudWatch Metrics

Metrics provide time-series data about system behavior. Unlike logs (which record discrete events), metrics aggregate measurements over time: "average CPU is 45%," "request rate is 1000/sec."

For general metrics patterns and design principles, see Metrics Guidelines.

Metric Namespaces and Dimensions

CloudWatch organizes metrics into namespaces (like AWS/ECS, AWS/RDS, or custom namespaces like MyCompany/Payments). Each metric has:

Metric name: What is being measured (CPUUtilization, PaymentCount)
Dimensions: Tags that identify the specific resource (ServiceName=payment-api, ClusterId=prod-cluster)
Unit: Measurement unit (Percent, Count, Seconds)
Timestamp: When the measurement occurred
Value: The actual measurement

Dimensions enable filtering:

Namespace: MyCompany/Payments
Metric: ProcessingTime
Dimensions: {Service=payment-api, Environment=prod, PaymentType=card}

You can query: "Show me card payment processing time for prod" or aggregate: "Show me all payment processing times across all payment types."

Best practices:

Use consistent dimension names across services (Environment, Service, Region)
Limit dimensions to high-cardinality values (avoid customer IDs as dimensions - too many unique values)
Use CloudWatch Embedded Metric Format (EMF) for efficient metric publishing from logs

Publishing Custom Metrics

AWS services automatically publish metrics (EC2 CPU, RDS connections, ALB requests), but you'll want custom metrics for business logic.

From Application Code (AWS SDK)

Spring Boot example:

@Service
@RequiredArgsConstructor
public class PaymentService {
    private final CloudWatchAsyncClient cloudWatch;

    public Payment createPayment(PaymentRequest request) {
        Instant startTime = Instant.now();

        try {
            Payment payment = processPayment(request);

            // Publish success metric
            publishMetric("PaymentCreated", 1.0, StandardUnit.COUNT,
                Map.of("PaymentType", request.type(), "Status", "success"));

            // Publish processing time
            double durationMs = Duration.between(startTime, Instant.now()).toMillis();
            publishMetric("PaymentProcessingTime", durationMs, StandardUnit.MILLISECONDS,
                Map.of("PaymentType", request.type()));

            return payment;

        } catch (Exception ex) {
            // Track failures separately
            publishMetric("PaymentCreated", 1.0, StandardUnit.COUNT,
                Map.of("PaymentType", request.type(), "Status", "failure"));
            throw ex;
        }
    }

    private void publishMetric(String metricName, double value, StandardUnit unit,
                               Map<String, String> dimensions) {
        var dimensionList = dimensions.entrySet().stream()
            .map(e -> Dimension.builder().name(e.getKey()).value(e.getValue()).build())
            .toList();

        var metricDatum = MetricDatum.builder()
            .metricName(metricName)
            .value(value)
            .unit(unit)
            .timestamp(Instant.now())
            .dimensions(dimensionList)
            .build();

        var request = PutMetricDataRequest.builder()
            .namespace("MyCompany/Payments")  // Custom namespace
            .metricData(metricDatum)
            .build();

        // Async call to avoid blocking business logic
        cloudWatch.putMetricData(request);
    }
}

Key points:

Use async client (CloudWatchAsyncClient) to avoid blocking application threads
Batch multiple metrics into a single PutMetricDataRequest (up to 1000 metrics per call)
Track both successes and failures with dimensions for error analysis
Record timing data for performance analysis

For Spring Boot integration patterns, see Spring Boot Observability.

From Logs (Embedded Metric Format)

CloudWatch can automatically extract metrics from structured JSON logs using Embedded Metric Format (EMF). This is more cost-effective than calling PutMetricData API directly.

EMF log format:

{
  "LogGroup": "/aws/ecs/payment-service",
  "ServiceName": "payment-api",
  "PaymentType": "card",
  "ProcessingTime": 245,
  "Amount": 150.00,
  "_aws": {
    "Timestamp": 1705320645000,
    "CloudWatchMetrics": [{
      "Namespace": "MyCompany/Payments",
      "Dimensions": [["ServiceName", "PaymentType"]],
      "Metrics": [
        {"Name": "ProcessingTime", "Unit": "Milliseconds"},
        {"Name": "Amount", "Unit": "None"}
      ]
    }]
  }
}

Advantages of EMF:

Single write creates both log entry and metric (no separate API call)
Lower cost (log ingestion pricing only, no metric API charges)
Log and metric guaranteed to have same timestamp
Automatic aggregation by CloudWatch

Java library for EMF:

@Service
public class PaymentService {
    private final MetricsLogger metricsLogger = new MetricsLogger();

    public Payment createPayment(PaymentRequest request) {
        metricsLogger.putDimensions(DimensionSet.of(
            "ServiceName", "payment-api",
            "PaymentType", request.type()
        ));

        metricsLogger.putMetric("ProcessingTime", 245, Unit.MILLISECONDS);
        metricsLogger.putProperty("CustomerId", request.customerId());
        metricsLogger.putProperty("Amount", request.amount());

        metricsLogger.flush();  // Writes EMF JSON to stdout
    }
}

For Lambda functions, use aws-embedded-metrics library (Node.js) or aws-embedded-metrics-java which automatically configures CloudWatch destination.

Metric Math

CloudWatch supports mathematical expressions across metrics for derived calculations. This enables creating custom metrics from existing data without writing code.

Examples:

Error rate percentage:

errorRate = (errors / totalRequests) * 100

Available capacity:

availableCapacity = maxCapacity - currentUsage

Custom SLI (99th percentile latency):

sli = 1 - (p99Latency / latencyThreshold)

In CloudWatch console:

Expression: m1 / m2 * 100
  m1 = SUM(Errors)
  m2 = SUM(RequestCount)
Result: Error rate percentage

Use cases:

Calculate business KPIs from multiple metrics
Create composite alarms (alert when multiple conditions are true)
Normalize metrics across different services

Metric Filters

Metric filters extract metric data from log events. This is useful for creating metrics from legacy applications that don't publish metrics directly.

Example: Create metric from error logs

Filter pattern: [timestamp, level=ERROR, ...]
Metric namespace: MyCompany/Payments
Metric name: ErrorCount
Metric value: 1

Every time a log line matches level=ERROR, CloudWatch increments the ErrorCount metric.

When to use metric filters:

Legacy applications that only emit logs
Quick metrics from existing logs without code changes
Counting occurrences of specific log patterns

When NOT to use:

New applications (use EMF or direct API instead for better performance)
Complex aggregations (use CloudWatch Logs Insights queries instead)
High-volume logs (filter pattern evaluation adds cost)

CloudWatch Alarms

Alarms monitor metrics and trigger actions when thresholds are breached. They're the foundation of proactive incident response.

For general alerting principles, see Monitoring and Alerting.

Alarm Anatomy

Each alarm monitors a single metric and evaluates it against a threshold over a time period:

Alarm: HighErrorRate
Metric: MyCompany/Payments:ErrorRate
Threshold: > 5%
Evaluation period: 3 consecutive 1-minute periods
Actions: Send SNS notification to on-call team

Evaluation logic:

CloudWatch evaluates metric every minute
If error rate > 5% for 3 consecutive minutes, alarm enters ALARM state
SNS notification triggers (email, SMS, Lambda, etc.)
If error rate drops below 5% for 3 consecutive minutes, alarm returns to OK state

Alarm states:

OK: Metric is within threshold
ALARM: Metric breached threshold for evaluation periods
INSUFFICIENT_DATA: Not enough data points to evaluate (service just started, metric not published)

Creating Effective Alarms

Bad alarm (noisy, not actionable):

Alarm: HighCPU
Metric: EC2 CPUUtilization > 80%
Problem: CPU spikes are normal. This will alert constantly without indicating actual issues.

Good alarm (actionable, user-impacting):

Alarm: HighErrorRate
Metric: API error rate > 1% for 5 minutes
Action: Page on-call engineer
Rationale: Directly impacts users. Sustained elevation indicates systemic issue.

Alarm design principles:

Alert on symptoms users experience (error rate, latency), not causes (CPU, memory)
Symptoms are universal (users care about errors); causes vary by implementation
Set thresholds based on SLAs and user impact, not arbitrary percentages
Use longer evaluation periods to avoid flapping (brief spikes don't indicate problems)
Different severity levels: page for critical (user-impacting), email for warning (trending toward issue)

Alarm Actions

Alarms can trigger multiple actions in different states:

Terraform example:

resource "aws_cloudwatch_metric_alarm" "high_error_rate" {
  alarm_name          = "payment-api-high-error-rate"
  comparison_operator = "GreaterThanThreshold"
  evaluation_periods  = 3
  metric_name         = "ErrorRate"
  namespace           = "MyCompany/Payments"
  period              = 60  # 1 minute
  statistic           = "Average"
  threshold           = 5.0  # 5% error rate
  treat_missing_data  = "notBreaching"  # Don't alarm if no data

  dimensions = {
    ServiceName = "payment-api"
    Environment = "production"
  }

  # Actions when entering ALARM state
  alarm_actions = [
    aws_sns_topic.pagerduty_critical.arn,  # Page on-call
    aws_sns_topic.slack_alerts.arn          # Post to Slack
  ]

  # Actions when returning to OK state
  ok_actions = [
    aws_sns_topic.slack_alerts.arn  # Notify resolution
  ]

  # Actions when data is insufficient
  insufficient_data_actions = []  # Don't alert on missing data
}

Common actions:

SNS topic: Notification (email, SMS, HTTP endpoint, Lambda)
Auto Scaling action: Scale EC2/ECS capacity
EC2 action: Stop, terminate, reboot instance
Systems Manager action: Run automation document

Action chaining example (auto-remediation):

Composite Alarms

Composite alarms combine multiple alarms with AND/OR logic. This reduces alert noise by requiring multiple conditions simultaneously.

Example: Alert only if BOTH error rate is high AND latency is high

resource "aws_cloudwatch_composite_alarm" "service_degraded" {
  alarm_name          = "payment-api-degraded"
  alarm_description   = "Service is experiencing both high errors and high latency"

  alarm_rule = "ALARM(${aws_cloudwatch_metric_alarm.high_error_rate.alarm_name}) AND ALARM(${aws_cloudwatch_metric_alarm.high_latency.alarm_name})"

  actions_enabled = true
  alarm_actions   = [aws_sns_topic.pagerduty_critical.arn]
}

Use cases:

Reduce false positives (alert only when multiple symptoms confirm an issue)
Create severity levels (warning if one condition, critical if multiple)
Correlated failures (alert if issue affects multiple services)

Anomaly Detection

CloudWatch can use machine learning to detect anomalies in metric patterns without setting static thresholds.

How it works:

CloudWatch analyzes 2+ weeks of metric history
Builds model of normal behavior (daily patterns, weekly cycles, trends)
Creates dynamic thresholds (bands) around expected values
Alerts when metric deviates from band

When to use anomaly detection:

Metrics with daily/weekly patterns (traffic during business hours)
Gradual growth trends (hard to set static threshold)
Seasonal patterns (holiday traffic spikes)

When NOT to use:

New services (not enough history)
Highly variable metrics (too many false positives)
Critical thresholds defined by SLA (use static thresholds for SLAs)

Creating anomaly alarm:

resource "aws_cloudwatch_metric_alarm" "anomaly_detection" {
  alarm_name          = "payment-api-request-anomaly"
  comparison_operator = "LessThanLowerOrGreaterThanUpperThreshold"
  evaluation_periods  = 2
  threshold_metric_id = "anomaly"

  metric_query {
    id          = "traffic"
    return_data = true

    metric {
      metric_name = "RequestCount"
      namespace   = "AWS/ApplicationELB"
      period      = 300
      stat        = "Sum"
      dimensions = {
        LoadBalancer = "app/payment-api/abc123"
      }
    }
  }

  metric_query {
    id          = "anomaly"
    expression  = "ANOMALY_DETECTION_BAND(traffic, 2)"  # 2 standard deviations
    label       = "RequestCount (expected)"
    return_data = true
  }

  alarm_actions = [aws_sns_topic.alerts.arn]
}

AWS X-Ray

X-Ray provides distributed tracing for AWS applications. It visualizes request flow across services, identifies bottlenecks, and helps debug latency issues.

For foundational distributed tracing concepts, see Tracing Guidelines.

How X-Ray Works

When a request enters your application:

Entry point creates trace: API Gateway, ALB, or application generates a trace ID
Trace context propagates: X-Ray SDK adds trace ID to outbound calls (HTTP headers, message attributes)
Services create segments: Each service creates a segment (unit of work) associated with the trace
Subsegments track details: Segments contain subsegments for database calls, external APIs, etc.
X-Ray assembles service map: X-Ray reconstructs the complete request flow

Service map visualization:

Circles = services
Lines = calls between services
Color = health (green = healthy, orange = high latency, red = errors)
Thickness = traffic volume

Instrumenting Applications

Lambda Functions

Lambda has built-in X-Ray support. Enable via function configuration:

Terraform:

resource "aws_lambda_function" "payment_processor" {
  function_name = "payment-processor"
  runtime       = "java17"
  handler       = "com.example.PaymentHandler"

  tracing_config {
    mode = "Active"  # Enable X-Ray tracing
  }
}

No code changes required for basic tracing. AWS SDK calls, HTTP requests, and SQL queries are automatically traced.

Custom subsegments (Java):

public class PaymentHandler implements RequestHandler<APIGatewayProxyRequestEvent, APIGatewayProxyResponseEvent> {

    public APIGatewayProxyResponseEvent handleRequest(APIGatewayProxyRequestEvent event, Context context) {
        // Create custom subsegment for business logic
        Subsegment subsegment = AWSXRay.beginSubsegment("processPayment");

        try {
            subsegment.putMetadata("customerId", event.getQueryStringParameters().get("customerId"));
            subsegment.putAnnotation("paymentType", "card");  // Indexed for filtering

            Payment payment = processPayment(event);

            subsegment.putMetadata("paymentId", payment.getId());

            return new APIGatewayProxyResponseEvent()
                .withStatusCode(200)
                .withBody(toJson(payment));

        } catch (Exception ex) {
            subsegment.addException(ex);
            throw ex;
        } finally {
            AWSXRay.endSubsegment();
        }
    }
}

Metadata vs Annotations:

Annotations: Indexed, filterable, up to 50 per segment (use for filtering traces: paymentType, userId)
Metadata: Not indexed, unlimited, any data structure (use for debugging details: full request/response)

ECS/Fargate

ECS tasks require the X-Ray daemon sidecar container to send trace data to X-Ray service.

Task definition with X-Ray daemon:

{
  "family": "payment-service",
  "containerDefinitions": [
    {
      "name": "payment-api",
      "image": "payment-api:latest",
      "environment": [
        {"name": "AWS_XRAY_DAEMON_ADDRESS", "value": "xray-daemon:2000"}
      ]
    },
    {
      "name": "xray-daemon",
      "image": "public.ecr.aws/xray/aws-xray-daemon:latest",
      "portMappings": [
        {"containerPort": 2000, "protocol": "udp"}
      ],
      "environment": [
        {"name": "AWS_REGION", "value": "us-east-1"}
      ]
    }
  ]
}

Application code (Spring Boot):

// build.gradle
dependencies {
    implementation 'com.amazonaws:aws-xray-recorder-sdk-spring:2.15.0'
}

// Application configuration
@Configuration
@EnableAspectJAutoProxy
public class XRayConfig {

    @Bean
    public Filter TracingFilter() {
        return new AWSXRayServletFilter("payment-service");
    }

    @Bean
    public TracingInterceptor tracingInterceptor() {
        return new TracingInterceptor();
    }
}

Automatic instrumentation:

HTTP requests (incoming/outgoing)
AWS SDK calls (S3, DynamoDB, SQS, etc.)
SQL queries (JDBC)

For detailed Spring Boot integration, see Spring Boot Observability.

API Gateway

API Gateway automatically creates X-Ray traces when tracing is enabled:

Enable via Terraform:

resource "aws_api_gateway_stage" "prod" {
  stage_name    = "prod"
  rest_api_id   = aws_api_gateway_rest_api.api.id
  deployment_id = aws_api_gateway_deployment.deployment.id

  xray_tracing_enabled = true
}

API Gateway propagates trace context to downstream Lambda functions or HTTP endpoints via X-Amzn-Trace-Id header.

Sampling Rules

Tracing every request is expensive and generates massive data volume. Sampling reduces cost while maintaining visibility.

Default sampling rule:

1 request per second: Always trace at least 1 req/sec per service
5% of additional requests: Randomly sample 5% of traffic above 1/sec

Why this works:

Guarantees some traces even during low traffic
Reduces volume during high traffic (tracing 5% of 10,000 req/sec = 500 traces/sec is plenty)

Custom sampling rules:

{
  "version": 2,
  "rules": [
    {
      "description": "Trace all errors",
      "http_method": "*",
      "url_path": "*",
      "fixed_target": 0,
      "rate": 1.0,
      "service_name": "*",
      "service_type": "*",
      "resource_arn": "*",
      "priority": 100,
      "attributes": {
        "http.status_code": "5*"  # Match all 5xx errors
      }
    },
    {
      "description": "Trace 10% of checkout flow",
      "http_method": "*",
      "url_path": "/api/checkout",
      "fixed_target": 1,
      "rate": 0.1,  # 10%
      "priority": 200
    },
    {
      "description": "Trace 1% of everything else",
      "http_method": "*",
      "url_path": "*",
      "fixed_target": 1,
      "rate": 0.01,  # 1%
      "priority": 1000
    }
  ],
  "default": {
    "fixed_target": 1,
    "rate": 0.01
  }
}

Rule priority:

Lower number = higher priority
First matching rule wins
Always trace errors (100% sampling) for debugging
Higher sampling for critical paths (checkout: 10%)
Lower sampling for high-volume, low-value paths (health checks: 0%)

Analyzing Traces

Finding slow requests:

Go to X-Ray console → Traces
Filter: responsetime > 2 (traces taking over 2 seconds)
Select trace to see waterfall visualization
Identify longest segment/subsegment

Finding errors:

filter: error = true AND http.status = "500"

Finding specific user requests:

annotation.userId = "user-12345"

Trace map use cases:

Identify service dependencies (what does my service call?)
Find highest latency services (color-coded by response time)
Detect cascading failures (errors propagating downstream)
Understand traffic patterns (line thickness shows volume)

X-Ray Cost Optimization

X-Ray charges per trace recorded and scanned:

Recording: $5 per 1 million traces
Scanning: $0.50 per 1 million traces scanned (queries)

Cost reduction strategies:

Lower sampling rates: 1% sampling on 1B requests/month = $50/month (vs $5000 at 100%)
Sample critical paths more: 10% on checkout, 0.1% on health checks
Trace errors heavily: 100% of errors, 1% of successes
Short retention: 30 days default (vs CloudWatch Logs which you control)

Container Insights

Container Insights provides metrics and logs for ECS, EKS, and Kubernetes clusters. It automatically collects, aggregates, and visualizes performance data.

ECS Container Insights

Enable at cluster or task level to collect:

Task-level metrics: CPU, memory, network, disk I/O per task
Service-level metrics: Aggregated across all tasks in a service
Container-level metrics: Per-container granularity

Enable via Terraform:

resource "aws_ecs_cluster" "main" {
  name = "production-cluster"

  setting {
    name  = "containerInsights"
    value = "enabled"
  }
}

Metrics available:

CpuUtilized, CpuReserved: Task/container CPU usage
MemoryUtilized, MemoryReserved: Memory usage
NetworkRxBytes, NetworkTxBytes: Network traffic
StorageReadBytes, StorageWriteBytes: Disk I/O

Use cases:

Right-size task definitions (are you over-allocating CPU/memory?)
Identify memory leaks (gradual memory increase over time)
Network bottlenecks (high network utilization)
Cost optimization (reduce reserved CPU/memory if underutilized)

EKS Container Insights

For EKS, Container Insights requires deploying CloudWatch agent and Fluent Bit as DaemonSets.

Install via Helm:

helm repo add eks https://aws.github.io/eks-charts
helm install aws-cloudwatch-metrics eks/aws-cloudwatch-metrics \
  --namespace amazon-cloudwatch \
  --set clusterName=my-cluster

Metrics collected:

Cluster-level: Node count, pod count, CPU/memory across cluster
Namespace-level: Resources used per namespace
Pod-level: CPU, memory, network per pod
Container-level: Resource usage per container

Performance dashboard: Container Insights automatically creates CloudWatch dashboards showing:

Cluster resource utilization over time
Top pods by CPU/memory
Pod/container failures
Node resource allocation vs usage

For EKS-specific details, see AWS EKS. For general Kubernetes patterns, see Kubernetes Guidelines.

Log Aggregation

Container Insights aggregates logs from all containers into CloudWatch Logs:

Log structure:

Log Group: /aws/containerinsights/{cluster-name}/application
  ├─ Log Stream: pod-{namespace}_{pod-name}_{container-name}
  └─ Log format: JSON with Kubernetes metadata

Kubernetes metadata added automatically:

{
  "log": "Payment created for customer 12345",
  "stream": "stdout",
  "time": "2025-01-15T10:30:45.123Z",
  "kubernetes": {
    "pod_name": "payment-api-abc123",
    "namespace_name": "production",
    "pod_id": "xyz-789",
    "labels": {
      "app": "payment-api",
      "version": "v2.1.0"
    },
    "container_name": "payment-api",
    "docker_id": "docker://123abc"
  }
}

Query logs by pod label:

fields @timestamp, log, kubernetes.pod_name
| filter kubernetes.labels.app = "payment-api"
| filter kubernetes.labels.version = "v2.1.0"
| sort @timestamp desc

This enables filtering by deployment version, finding logs for specific releases, or aggregating across all pods of a service.

Integration with Existing Observability Stack

Correlation IDs Across AWS Services

Maintain correlation throughout the request lifecycle:

API Gateway → Lambda:

// Lambda receives correlation ID from API Gateway request
export const handler = async (event, context) => {
    const correlationId = event.headers['X-Correlation-ID'] || uuidv4();

    // Add to all logs
    console.log(JSON.stringify({
        message: 'Processing payment',
        correlationId,
        customerId: event.pathParameters.customerId
    }));

    // Pass to downstream services
    await sqsClient.send(new SendMessageCommand({
        QueueUrl: QUEUE_URL,
        MessageBody: JSON.stringify({...}),
        MessageAttributes: {
            CorrelationId: {
                DataType: 'String',
                StringValue: correlationId
            }
        }
    }));
};

SQS → Lambda:

export const handler = async (event, context) => {
    for (const record of event.Records) {
        // Extract correlation ID from message attributes
        const correlationId = record.messageAttributes.CorrelationId?.stringValue;

        console.log(JSON.stringify({
            message: 'Processing SQS message',
            correlationId,
            messageId: record.messageId
        }));
    }
};

Spring Boot with AWS SDK:

@Component
public class SqsPublisher {
    private final SqsClient sqsClient;

    public void publishEvent(PaymentEvent event) {
        // Get correlation ID from MDC (set by filter)
        String correlationId = MDC.get("correlationId");

        sqsClient.sendMessage(SendMessageRequest.builder()
            .queueUrl(queueUrl)
            .messageBody(toJson(event))
            .messageAttributes(Map.of(
                "CorrelationId", MessageAttributeValue.builder()
                    .dataType("String")
                    .stringValue(correlationId)
                    .build()
            ))
            .build());
    }
}

For more on correlation ID patterns, see Logging Guidelines.

Cross-Account Observability

In multi-account architectures, centralize observability data for unified visibility:

Pattern: Central logging account

Setup:

Create central logging account with S3 bucket and Kinesis stream
Configure subscription filters in each workload account
Grant cross-account permissions (IAM roles)
Logs from all accounts aggregate in central location

Benefits:

Unified search across all environments
Compliance and audit (logs in separate account from workloads)
Cost optimization (single S3 bucket with lifecycle policies)
Security (logs immutable in separate account)

CloudWatch cross-account dashboard: CloudWatch supports cross-account dashboards to visualize metrics from multiple accounts:

resource "aws_cloudwatch_dashboard" "multi_account" {
  dashboard_name = "cross-account-overview"

  dashboard_body = jsonencode({
    widgets = [
      {
        type = "metric"
        properties = {
          metrics = [
            ["AWS/ECS", "CPUUtilization", {region: "us-east-1", accountId: "111111111111"}],
            ["AWS/ECS", "CPUUtilization", {region: "us-east-1", accountId: "222222222222"}]
          ]
          title = "ECS CPU Across Accounts"
        }
      }
    ]
  })
}

Third-Party Integration

CloudWatch integrates with third-party observability platforms for enhanced visualization and analysis:

Prometheus:

Use CloudWatch Exporter to scrape CloudWatch metrics into Prometheus
Query with PromQL, visualize in Grafana
Combine AWS metrics with application metrics

Datadog/New Relic:

Install agent on EC2/ECS for deeper instrumentation
Forward CloudWatch Logs to platform via Lambda forwarder
Unified dashboard with AWS metrics + APM data

Grafana:

Native CloudWatch data source
Query CloudWatch Logs Insights directly from Grafana
Combine with Prometheus, Jaeger for unified view

See Observability Overview for multi-tool integration patterns.

Best Practices

Logging Best Practices

Use structured JSON logs: Enable efficient CloudWatch Logs Insights queries
Add correlation IDs: Track requests across all services and accounts
Set appropriate retention: 30 days for operational logs, longer for audit logs
Never log sensitive data: PII, passwords, API keys, credit card numbers
Use log levels correctly: ERROR for user-impacting failures, WARN for recoverable issues, INFO for business events
Export to S3 for archival: 10x cheaper than CloudWatch Logs long-term retention

Metrics Best Practices

Track business metrics: Payment success rate, checkout completion time (not just infrastructure metrics)
Use consistent dimensions: Environment, Service, Region across all services
Publish metrics asynchronously: Don't block business logic waiting for CloudWatch API
Use EMF for cost efficiency: Embedded Metric Format extracts metrics from logs without API calls
Avoid high-cardinality dimensions: Don't use customer IDs or request IDs as dimensions (millions of unique values)

Tracing Best Practices

Sample intelligently: 100% of errors, 10% of critical paths, 1% of normal traffic
Add business context: Annotate traces with customer ID, payment type, order ID for filtering
Trace external dependencies: Ensure all HTTP, database, and message queue calls are captured
Set meaningful subsegment names: processPayment not method1
Always end spans: Use try-finally blocks to prevent memory leaks

Alarming Best Practices

Alert on symptoms, not causes: Error rate (symptom) not CPU usage (cause)
Set thresholds based on SLAs: If SLA is 99.9% uptime, alert at 99.95% to catch degradation early
Use composite alarms: Reduce noise by requiring multiple symptoms
Different severity levels: Page for critical user-impacting issues, email for warnings
Test your alarms: Trigger intentionally to verify on-call gets notified

Cost Optimization

Short retention for verbose logs: 3-7 days for debug logs, longer for audit logs
Sample traces: 1% sampling reduces costs by 99% with minimal visibility loss
Use metric filters sparingly: High-volume log parsing is expensive
Archive to S3: CloudWatch Logs costs $0.50/GB/month, S3 costs $0.023/GB/month
Aggregation at source: Use EMF to create metrics from logs instead of storing verbose logs

Anti-Patterns

Logging Anti-Patterns

Plain text logs: Difficult to query and aggregate
No correlation IDs: Impossible to trace requests across services
Logging PII: Compliance violations (GDPR, PCI-DSS)
Infinite retention: Logs stored forever cost thousands per month
Excessive logging: Logging every method call generates noise and cost

Metrics Anti-Patterns

Only infrastructure metrics: CPU and memory don't tell you if users are happy
High-cardinality dimensions: Using customer ID as dimension creates millions of metric combinations
Synchronous publishing: Blocking business logic waiting for CloudWatch API
No aggregation: Sending individual events instead of aggregated metrics

Tracing Anti-Patterns

100% sampling: Expensive and unnecessary (1% is often sufficient)
No sampling of errors: Errors should always be traced for debugging
Missing propagation: Trace context not passed to downstream services (breaks distributed trace)
Trace ID as correlation ID: Use separate correlation ID for logs (traces are sampled, logs aren't)

Alarming Anti-Patterns

Alert on everything: Too many alarms = alert fatigue = ignored alarms
No alarm runbooks: On-call doesn't know what to do when alarm fires
Static thresholds on variable metrics: Daily traffic patterns trigger false alarms at night
Alerting on causes: CPU high (cause) instead of latency high (symptom)

Summary

AWS provides a comprehensive observability stack:

CloudWatch Logs:

Centralized logging with JSON for structured data
CloudWatch Logs Insights for powerful querying
Subscription filters for real-time streaming and archival

CloudWatch Metrics:

Time-series data for trends and alerting
Custom metrics via SDK or Embedded Metric Format
Metric math for derived calculations

CloudWatch Alarms:

Threshold-based alerting with SNS actions
Composite alarms for reduced noise
Anomaly detection for dynamic thresholds

AWS X-Ray:

Distributed tracing across services
Service map visualization
Custom segments for business logic

Container Insights:

ECS/EKS performance metrics
Pod/container-level visibility
Automatic log aggregation with Kubernetes metadata

Key Practices:

Use correlation IDs across all services
Structured JSON logs for queryability
Sample traces intelligently (errors + random sample)
Alert on user-impacting symptoms
Cost-optimize with retention policies and archival

Cross-References:

Observability Overview - Foundational concepts
Logging Guidelines - General logging patterns
Metrics Guidelines - Metric design principles
Tracing Guidelines - Distributed tracing patterns
Monitoring and Alerting - Alerting best practices
Spring Boot Observability - Spring Boot implementation
AWS EKS - EKS Container Insights
AWS Lambda - Lambda tracing
Terraform Guidelines - Infrastructure as code

Further Reading:

Overview​

Core Principles​

CloudWatch Logs​

Log Groups and Streams​

Structured Logging with JSON​

Log Retention and Lifecycle​

CloudWatch Logs Insights​

Subscription Filters​

CloudWatch Metrics​

Metric Namespaces and Dimensions​

Publishing Custom Metrics​

From Application Code (AWS SDK)​

From Logs (Embedded Metric Format)​

Metric Math​

Metric Filters​

CloudWatch Alarms​

Alarm Anatomy​

Creating Effective Alarms​

Alarm Actions​

Composite Alarms​

Anomaly Detection​

AWS X-Ray​

How X-Ray Works​

Instrumenting Applications​

Lambda Functions​

ECS/Fargate​

API Gateway​

Sampling Rules​

Analyzing Traces​

X-Ray Cost Optimization​

Container Insights​

ECS Container Insights​

EKS Container Insights​

Log Aggregation​

Integration with Existing Observability Stack​

Correlation IDs Across AWS Services​

Cross-Account Observability​

Third-Party Integration​

Best Practices​

Logging Best Practices​

Metrics Best Practices​

Tracing Best Practices​

Alarming Best Practices​

Cost Optimization​

Anti-Patterns​

Logging Anti-Patterns​

Metrics Anti-Patterns​

Tracing Anti-Patterns​

Alarming Anti-Patterns​

Summary​

Overview

Core Principles

CloudWatch Logs

Log Groups and Streams

Structured Logging with JSON

Log Retention and Lifecycle

CloudWatch Logs Insights

Subscription Filters

CloudWatch Metrics

Metric Namespaces and Dimensions

Publishing Custom Metrics

From Application Code (AWS SDK)

From Logs (Embedded Metric Format)

Metric Math

Metric Filters

CloudWatch Alarms

Alarm Anatomy

Creating Effective Alarms

Alarm Actions

Composite Alarms

Anomaly Detection

AWS X-Ray

How X-Ray Works

Instrumenting Applications

Lambda Functions

ECS/Fargate

API Gateway

Sampling Rules

Analyzing Traces

X-Ray Cost Optimization

Container Insights

ECS Container Insights

EKS Container Insights

Log Aggregation

Integration with Existing Observability Stack

Correlation IDs Across AWS Services

Cross-Account Observability

Third-Party Integration

Best Practices

Logging Best Practices

Metrics Best Practices

Tracing Best Practices

Alarming Best Practices

Cost Optimization

Anti-Patterns

Logging Anti-Patterns

Metrics Anti-Patterns

Tracing Anti-Patterns

Alarming Anti-Patterns

Summary