Skip to main content

AWS Observability

Implementing comprehensive observability on AWS using CloudWatch and X-Ray for logs, metrics, and distributed tracing.

Overview

AWS provides a comprehensive observability stack through CloudWatch (logs and metrics), X-Ray (distributed tracing), and specialized services like Container Insights for containerized workloads. This guide covers how to instrument your applications, configure monitoring infrastructure, and build effective observability practices on AWS.

For foundational observability concepts, see the Observability Overview. For Spring Boot-specific implementation details, see Spring Boot Observability. This guide focuses on AWS-specific integration patterns and services.


Core Principles

  • Structured logging: Use JSON format with consistent field names for efficient querying via CloudWatch Logs Insights
  • Correlation IDs: Propagate trace context through all AWS services and application layers for end-to-end traceability
  • Cost-aware sampling: Implement intelligent sampling for traces and verbose logs to manage CloudWatch costs
  • Actionable alarms: Create alarms based on business metrics and user-impacting symptoms, not just infrastructure thresholds
  • Centralized observability: Aggregate logs and metrics across accounts and regions for unified visibility

CloudWatch Logs

CloudWatch Logs is AWS's centralized logging service. Applications, AWS services, and infrastructure all emit logs to CloudWatch for storage, search, and analysis.

Log Groups and Streams

CloudWatch organizes logs into log groups (logical containers like /aws/ecs/my-service) and log streams (individual sources like container instances or Lambda executions).

Structure:

Log Group: /aws/ecs/payment-service
├─ Log Stream: task/abc123-container-1
├─ Log Stream: task/abc123-container-2
└─ Log Stream: task/xyz789-container-1

Each log event within a stream has:

  • Timestamp: When the event occurred
  • Message: The actual log content (text or JSON)
  • Ingestion time: When CloudWatch received it

Best practices:

  • Use hierarchical naming: /company/environment/service (e.g., /acme/prod/payment-api)
  • Separate log groups by environment to avoid production data contamination
  • Use consistent naming across teams for easier cross-service querying

Structured Logging with JSON

CloudWatch Logs Insights can parse and query JSON logs much more efficiently than plain text. Send logs as JSON with consistent field names.

Plain text log (harder to query):

2025-01-15T10:30:45 INFO Payment created for customer 12345 amount 100.50 USD

JSON structured log (queryable):

{
"timestamp": "2025-01-15T10:30:45.123Z",
"level": "INFO",
"message": "Payment created",
"customerId": "12345",
"paymentId": "pay-67890",
"amount": 100.50,
"currency": "USD",
"correlationId": "abc-123-def"
}

With JSON logs, you can run queries like:

fields @timestamp, customerId, amount
| filter amount > 1000
| stats sum(amount) by customerId

For implementation details on structured logging in Spring Boot, see Logging Guidelines.

Log Retention and Lifecycle

By default, logs never expire. This can become expensive quickly. Set appropriate retention periods based on compliance requirements and usage patterns.

Common retention policies:

  • Development logs: 3-7 days (short-term debugging)
  • Production application logs: 30-90 days (operational troubleshooting)
  • Audit logs: 1-7 years (compliance requirements like PCI-DSS, SOX)
  • Access logs: 90-365 days (security analysis)

Cost optimization:

  • Short retention for high-volume, low-value logs (debug logs, health check requests)
  • Export older logs to S3 for archival at 1/10th the cost
  • Use subscription filters to stream logs to S3/Kinesis instead of long retention

Setting retention via Terraform:

resource "aws_cloudwatch_log_group" "payment_service" {
name = "/aws/ecs/payment-service"
retention_in_days = 30 # Automatically delete after 30 days

tags = {
Environment = "production"
Service = "payment-service"
}
}

See Terraform Guidelines for infrastructure as code patterns.

CloudWatch Logs Insights

Logs Insights is CloudWatch's built-in query language for analyzing log data. It provides SQL-like syntax for filtering, aggregating, and visualizing logs.

Query anatomy:

fields @timestamp, level, message, correlationId    # Select fields to display
| filter level = "ERROR" # Filter conditions
| filter @message like /timeout|connection/ # Regex matching
| stats count(*) by bin(5m) # Aggregate by 5-minute bins
| sort @timestamp desc # Order results
| limit 100 # Limit output

Example: Find slow database queries

fields @timestamp, queryTime, query
| filter queryTime > 1000 # Queries taking over 1 second
| stats avg(queryTime), max(queryTime), count() by query
| sort max(queryTime) desc

Example: Error rate over time

fields @timestamp
| filter level = "ERROR"
| stats count(*) as errorCount by bin(5m)

Example: Trace a specific request

fields @timestamp, level, message, customerId
| filter correlationId = "abc-123-def" # All logs for one request
| sort @timestamp asc

Performance tips:

  • Filter early (use filter before stats to reduce data processed)
  • Query specific time ranges (avoid "all time" queries)
  • Use bin() for time-series aggregation instead of processing individual events
  • Limit result sets to avoid timeout

For more query patterns, see Logging Guidelines.

Subscription Filters

Subscription filters stream logs in real-time to other AWS services for processing or archival. This enables log aggregation, real-time alerting, and cost-effective long-term storage.

Common use cases:

  1. S3 archival: Stream logs to S3 via Kinesis Firehose for long-term storage at lower cost
  2. Real-time processing: Trigger Lambda functions for specific log patterns (e.g., error alerts)
  3. Centralized logging: Aggregate logs from multiple accounts into a central account
  4. Security analysis: Stream to security tools for threat detection

Example: Archive logs to S3

resource "aws_cloudwatch_log_subscription_filter" "log_archive" {
name = "archive-to-s3"
log_group_name = "/aws/ecs/payment-service"
filter_pattern = "" # Empty = all logs
destination_arn = aws_kinesis_firehose_delivery_stream.logs_to_s3.arn
}

resource "aws_kinesis_firehose_delivery_stream" "logs_to_s3" {
name = "logs-to-s3"
destination = "s3"

s3_configuration {
role_arn = aws_iam_role.firehose.arn
bucket_arn = aws_s3_bucket.log_archive.arn
prefix = "logs/payment-service/"

# Compress logs to save storage costs
compression_format = "GZIP"
}
}

Cost consideration: Streaming logs incurs data transfer and processing costs. For infrequently accessed logs, consider exporting directly to S3 on a schedule instead of real-time streaming.


CloudWatch Metrics

Metrics provide time-series data about system behavior. Unlike logs (which record discrete events), metrics aggregate measurements over time: "average CPU is 45%," "request rate is 1000/sec."

For general metrics patterns and design principles, see Metrics Guidelines.

Metric Namespaces and Dimensions

CloudWatch organizes metrics into namespaces (like AWS/ECS, AWS/RDS, or custom namespaces like MyCompany/Payments). Each metric has:

  • Metric name: What is being measured (CPUUtilization, PaymentCount)
  • Dimensions: Tags that identify the specific resource (ServiceName=payment-api, ClusterId=prod-cluster)
  • Unit: Measurement unit (Percent, Count, Seconds)
  • Timestamp: When the measurement occurred
  • Value: The actual measurement

Dimensions enable filtering:

Namespace: MyCompany/Payments
Metric: ProcessingTime
Dimensions: {Service=payment-api, Environment=prod, PaymentType=card}

You can query: "Show me card payment processing time for prod" or aggregate: "Show me all payment processing times across all payment types."

Best practices:

  • Use consistent dimension names across services (Environment, Service, Region)
  • Limit dimensions to high-cardinality values (avoid customer IDs as dimensions - too many unique values)
  • Use CloudWatch Embedded Metric Format (EMF) for efficient metric publishing from logs

Publishing Custom Metrics

AWS services automatically publish metrics (EC2 CPU, RDS connections, ALB requests), but you'll want custom metrics for business logic.

From Application Code (AWS SDK)

Spring Boot example:

@Service
@RequiredArgsConstructor
public class PaymentService {
private final CloudWatchAsyncClient cloudWatch;

public Payment createPayment(PaymentRequest request) {
Instant startTime = Instant.now();

try {
Payment payment = processPayment(request);

// Publish success metric
publishMetric("PaymentCreated", 1.0, StandardUnit.COUNT,
Map.of("PaymentType", request.type(), "Status", "success"));

// Publish processing time
double durationMs = Duration.between(startTime, Instant.now()).toMillis();
publishMetric("PaymentProcessingTime", durationMs, StandardUnit.MILLISECONDS,
Map.of("PaymentType", request.type()));

return payment;

} catch (Exception ex) {
// Track failures separately
publishMetric("PaymentCreated", 1.0, StandardUnit.COUNT,
Map.of("PaymentType", request.type(), "Status", "failure"));
throw ex;
}
}

private void publishMetric(String metricName, double value, StandardUnit unit,
Map<String, String> dimensions) {
var dimensionList = dimensions.entrySet().stream()
.map(e -> Dimension.builder().name(e.getKey()).value(e.getValue()).build())
.toList();

var metricDatum = MetricDatum.builder()
.metricName(metricName)
.value(value)
.unit(unit)
.timestamp(Instant.now())
.dimensions(dimensionList)
.build();

var request = PutMetricDataRequest.builder()
.namespace("MyCompany/Payments") // Custom namespace
.metricData(metricDatum)
.build();

// Async call to avoid blocking business logic
cloudWatch.putMetricData(request);
}
}

Key points:

  • Use async client (CloudWatchAsyncClient) to avoid blocking application threads
  • Batch multiple metrics into a single PutMetricDataRequest (up to 1000 metrics per call)
  • Track both successes and failures with dimensions for error analysis
  • Record timing data for performance analysis

For Spring Boot integration patterns, see Spring Boot Observability.

From Logs (Embedded Metric Format)

CloudWatch can automatically extract metrics from structured JSON logs using Embedded Metric Format (EMF). This is more cost-effective than calling PutMetricData API directly.

EMF log format:

{
"LogGroup": "/aws/ecs/payment-service",
"ServiceName": "payment-api",
"PaymentType": "card",
"ProcessingTime": 245,
"Amount": 150.00,
"_aws": {
"Timestamp": 1705320645000,
"CloudWatchMetrics": [{
"Namespace": "MyCompany/Payments",
"Dimensions": [["ServiceName", "PaymentType"]],
"Metrics": [
{"Name": "ProcessingTime", "Unit": "Milliseconds"},
{"Name": "Amount", "Unit": "None"}
]
}]
}
}

Advantages of EMF:

  • Single write creates both log entry and metric (no separate API call)
  • Lower cost (log ingestion pricing only, no metric API charges)
  • Log and metric guaranteed to have same timestamp
  • Automatic aggregation by CloudWatch

Java library for EMF:

@Service
public class PaymentService {
private final MetricsLogger metricsLogger = new MetricsLogger();

public Payment createPayment(PaymentRequest request) {
metricsLogger.putDimensions(DimensionSet.of(
"ServiceName", "payment-api",
"PaymentType", request.type()
));

metricsLogger.putMetric("ProcessingTime", 245, Unit.MILLISECONDS);
metricsLogger.putProperty("CustomerId", request.customerId());
metricsLogger.putProperty("Amount", request.amount());

metricsLogger.flush(); // Writes EMF JSON to stdout
}
}

For Lambda functions, use aws-embedded-metrics library (Node.js) or aws-embedded-metrics-java which automatically configures CloudWatch destination.

Metric Math

CloudWatch supports mathematical expressions across metrics for derived calculations. This enables creating custom metrics from existing data without writing code.

Examples:

Error rate percentage:

errorRate = (errors / totalRequests) * 100

Available capacity:

availableCapacity = maxCapacity - currentUsage

Custom SLI (99th percentile latency):

sli = 1 - (p99Latency / latencyThreshold)

In CloudWatch console:

Expression: m1 / m2 * 100
m1 = SUM(Errors)
m2 = SUM(RequestCount)
Result: Error rate percentage

Use cases:

  • Calculate business KPIs from multiple metrics
  • Create composite alarms (alert when multiple conditions are true)
  • Normalize metrics across different services

Metric Filters

Metric filters extract metric data from log events. This is useful for creating metrics from legacy applications that don't publish metrics directly.

Example: Create metric from error logs

Filter pattern: [timestamp, level=ERROR, ...]
Metric namespace: MyCompany/Payments
Metric name: ErrorCount
Metric value: 1

Every time a log line matches level=ERROR, CloudWatch increments the ErrorCount metric.

When to use metric filters:

  • Legacy applications that only emit logs
  • Quick metrics from existing logs without code changes
  • Counting occurrences of specific log patterns

When NOT to use:

  • New applications (use EMF or direct API instead for better performance)
  • Complex aggregations (use CloudWatch Logs Insights queries instead)
  • High-volume logs (filter pattern evaluation adds cost)

CloudWatch Alarms

Alarms monitor metrics and trigger actions when thresholds are breached. They're the foundation of proactive incident response.

For general alerting principles, see Monitoring and Alerting.

Alarm Anatomy

Each alarm monitors a single metric and evaluates it against a threshold over a time period:

Alarm: HighErrorRate
Metric: MyCompany/Payments:ErrorRate
Threshold: > 5%
Evaluation period: 3 consecutive 1-minute periods
Actions: Send SNS notification to on-call team

Evaluation logic:

  1. CloudWatch evaluates metric every minute
  2. If error rate > 5% for 3 consecutive minutes, alarm enters ALARM state
  3. SNS notification triggers (email, SMS, Lambda, etc.)
  4. If error rate drops below 5% for 3 consecutive minutes, alarm returns to OK state

Alarm states:

  • OK: Metric is within threshold
  • ALARM: Metric breached threshold for evaluation periods
  • INSUFFICIENT_DATA: Not enough data points to evaluate (service just started, metric not published)

Creating Effective Alarms

Bad alarm (noisy, not actionable):

Alarm: HighCPU
Metric: EC2 CPUUtilization > 80%
Problem: CPU spikes are normal. This will alert constantly without indicating actual issues.

Good alarm (actionable, user-impacting):

Alarm: HighErrorRate
Metric: API error rate > 1% for 5 minutes
Action: Page on-call engineer
Rationale: Directly impacts users. Sustained elevation indicates systemic issue.

Alarm design principles:

  • Alert on symptoms users experience (error rate, latency), not causes (CPU, memory)
  • Symptoms are universal (users care about errors); causes vary by implementation
  • Set thresholds based on SLAs and user impact, not arbitrary percentages
  • Use longer evaluation periods to avoid flapping (brief spikes don't indicate problems)
  • Different severity levels: page for critical (user-impacting), email for warning (trending toward issue)

Alarm Actions

Alarms can trigger multiple actions in different states:

Terraform example:

resource "aws_cloudwatch_metric_alarm" "high_error_rate" {
alarm_name = "payment-api-high-error-rate"
comparison_operator = "GreaterThanThreshold"
evaluation_periods = 3
metric_name = "ErrorRate"
namespace = "MyCompany/Payments"
period = 60 # 1 minute
statistic = "Average"
threshold = 5.0 # 5% error rate
treat_missing_data = "notBreaching" # Don't alarm if no data

dimensions = {
ServiceName = "payment-api"
Environment = "production"
}

# Actions when entering ALARM state
alarm_actions = [
aws_sns_topic.pagerduty_critical.arn, # Page on-call
aws_sns_topic.slack_alerts.arn # Post to Slack
]

# Actions when returning to OK state
ok_actions = [
aws_sns_topic.slack_alerts.arn # Notify resolution
]

# Actions when data is insufficient
insufficient_data_actions = [] # Don't alert on missing data
}

Common actions:

  • SNS topic: Notification (email, SMS, HTTP endpoint, Lambda)
  • Auto Scaling action: Scale EC2/ECS capacity
  • EC2 action: Stop, terminate, reboot instance
  • Systems Manager action: Run automation document

Action chaining example (auto-remediation):

Composite Alarms

Composite alarms combine multiple alarms with AND/OR logic. This reduces alert noise by requiring multiple conditions simultaneously.

Example: Alert only if BOTH error rate is high AND latency is high

resource "aws_cloudwatch_composite_alarm" "service_degraded" {
alarm_name = "payment-api-degraded"
alarm_description = "Service is experiencing both high errors and high latency"

alarm_rule = "ALARM(${aws_cloudwatch_metric_alarm.high_error_rate.alarm_name}) AND ALARM(${aws_cloudwatch_metric_alarm.high_latency.alarm_name})"

actions_enabled = true
alarm_actions = [aws_sns_topic.pagerduty_critical.arn]
}

Use cases:

  • Reduce false positives (alert only when multiple symptoms confirm an issue)
  • Create severity levels (warning if one condition, critical if multiple)
  • Correlated failures (alert if issue affects multiple services)

Anomaly Detection

CloudWatch can use machine learning to detect anomalies in metric patterns without setting static thresholds.

How it works:

  1. CloudWatch analyzes 2+ weeks of metric history
  2. Builds model of normal behavior (daily patterns, weekly cycles, trends)
  3. Creates dynamic thresholds (bands) around expected values
  4. Alerts when metric deviates from band

When to use anomaly detection:

  • Metrics with daily/weekly patterns (traffic during business hours)
  • Gradual growth trends (hard to set static threshold)
  • Seasonal patterns (holiday traffic spikes)

When NOT to use:

  • New services (not enough history)
  • Highly variable metrics (too many false positives)
  • Critical thresholds defined by SLA (use static thresholds for SLAs)

Creating anomaly alarm:

resource "aws_cloudwatch_metric_alarm" "anomaly_detection" {
alarm_name = "payment-api-request-anomaly"
comparison_operator = "LessThanLowerOrGreaterThanUpperThreshold"
evaluation_periods = 2
threshold_metric_id = "anomaly"

metric_query {
id = "traffic"
return_data = true

metric {
metric_name = "RequestCount"
namespace = "AWS/ApplicationELB"
period = 300
stat = "Sum"
dimensions = {
LoadBalancer = "app/payment-api/abc123"
}
}
}

metric_query {
id = "anomaly"
expression = "ANOMALY_DETECTION_BAND(traffic, 2)" # 2 standard deviations
label = "RequestCount (expected)"
return_data = true
}

alarm_actions = [aws_sns_topic.alerts.arn]
}

AWS X-Ray

X-Ray provides distributed tracing for AWS applications. It visualizes request flow across services, identifies bottlenecks, and helps debug latency issues.

For foundational distributed tracing concepts, see Tracing Guidelines.

How X-Ray Works

When a request enters your application:

  1. Entry point creates trace: API Gateway, ALB, or application generates a trace ID
  2. Trace context propagates: X-Ray SDK adds trace ID to outbound calls (HTTP headers, message attributes)
  3. Services create segments: Each service creates a segment (unit of work) associated with the trace
  4. Subsegments track details: Segments contain subsegments for database calls, external APIs, etc.
  5. X-Ray assembles service map: X-Ray reconstructs the complete request flow

Service map visualization:

  • Circles = services
  • Lines = calls between services
  • Color = health (green = healthy, orange = high latency, red = errors)
  • Thickness = traffic volume

Instrumenting Applications

Lambda Functions

Lambda has built-in X-Ray support. Enable via function configuration:

Terraform:

resource "aws_lambda_function" "payment_processor" {
function_name = "payment-processor"
runtime = "java17"
handler = "com.example.PaymentHandler"

tracing_config {
mode = "Active" # Enable X-Ray tracing
}
}

No code changes required for basic tracing. AWS SDK calls, HTTP requests, and SQL queries are automatically traced.

Custom subsegments (Java):

public class PaymentHandler implements RequestHandler<APIGatewayProxyRequestEvent, APIGatewayProxyResponseEvent> {

public APIGatewayProxyResponseEvent handleRequest(APIGatewayProxyRequestEvent event, Context context) {
// Create custom subsegment for business logic
Subsegment subsegment = AWSXRay.beginSubsegment("processPayment");

try {
subsegment.putMetadata("customerId", event.getQueryStringParameters().get("customerId"));
subsegment.putAnnotation("paymentType", "card"); // Indexed for filtering

Payment payment = processPayment(event);

subsegment.putMetadata("paymentId", payment.getId());

return new APIGatewayProxyResponseEvent()
.withStatusCode(200)
.withBody(toJson(payment));

} catch (Exception ex) {
subsegment.addException(ex);
throw ex;
} finally {
AWSXRay.endSubsegment();
}
}
}

Metadata vs Annotations:

  • Annotations: Indexed, filterable, up to 50 per segment (use for filtering traces: paymentType, userId)
  • Metadata: Not indexed, unlimited, any data structure (use for debugging details: full request/response)

ECS/Fargate

ECS tasks require the X-Ray daemon sidecar container to send trace data to X-Ray service.

Task definition with X-Ray daemon:

{
"family": "payment-service",
"containerDefinitions": [
{
"name": "payment-api",
"image": "payment-api:latest",
"environment": [
{"name": "AWS_XRAY_DAEMON_ADDRESS", "value": "xray-daemon:2000"}
]
},
{
"name": "xray-daemon",
"image": "public.ecr.aws/xray/aws-xray-daemon:latest",
"portMappings": [
{"containerPort": 2000, "protocol": "udp"}
],
"environment": [
{"name": "AWS_REGION", "value": "us-east-1"}
]
}
]
}

Application code (Spring Boot):

// build.gradle
dependencies {
implementation 'com.amazonaws:aws-xray-recorder-sdk-spring:2.15.0'
}

// Application configuration
@Configuration
@EnableAspectJAutoProxy
public class XRayConfig {

@Bean
public Filter TracingFilter() {
return new AWSXRayServletFilter("payment-service");
}

@Bean
public TracingInterceptor tracingInterceptor() {
return new TracingInterceptor();
}
}

Automatic instrumentation:

  • HTTP requests (incoming/outgoing)
  • AWS SDK calls (S3, DynamoDB, SQS, etc.)
  • SQL queries (JDBC)

For detailed Spring Boot integration, see Spring Boot Observability.

API Gateway

API Gateway automatically creates X-Ray traces when tracing is enabled:

Enable via Terraform:

resource "aws_api_gateway_stage" "prod" {
stage_name = "prod"
rest_api_id = aws_api_gateway_rest_api.api.id
deployment_id = aws_api_gateway_deployment.deployment.id

xray_tracing_enabled = true
}

API Gateway propagates trace context to downstream Lambda functions or HTTP endpoints via X-Amzn-Trace-Id header.

Sampling Rules

Tracing every request is expensive and generates massive data volume. Sampling reduces cost while maintaining visibility.

Default sampling rule:

  • 1 request per second: Always trace at least 1 req/sec per service
  • 5% of additional requests: Randomly sample 5% of traffic above 1/sec

Why this works:

  • Guarantees some traces even during low traffic
  • Reduces volume during high traffic (tracing 5% of 10,000 req/sec = 500 traces/sec is plenty)

Custom sampling rules:

{
"version": 2,
"rules": [
{
"description": "Trace all errors",
"http_method": "*",
"url_path": "*",
"fixed_target": 0,
"rate": 1.0,
"service_name": "*",
"service_type": "*",
"resource_arn": "*",
"priority": 100,
"attributes": {
"http.status_code": "5*" # Match all 5xx errors
}
},
{
"description": "Trace 10% of checkout flow",
"http_method": "*",
"url_path": "/api/checkout",
"fixed_target": 1,
"rate": 0.1, # 10%
"priority": 200
},
{
"description": "Trace 1% of everything else",
"http_method": "*",
"url_path": "*",
"fixed_target": 1,
"rate": 0.01, # 1%
"priority": 1000
}
],
"default": {
"fixed_target": 1,
"rate": 0.01
}
}

Rule priority:

  • Lower number = higher priority
  • First matching rule wins
  • Always trace errors (100% sampling) for debugging
  • Higher sampling for critical paths (checkout: 10%)
  • Lower sampling for high-volume, low-value paths (health checks: 0%)

Analyzing Traces

Finding slow requests:

  1. Go to X-Ray console → Traces
  2. Filter: responsetime > 2 (traces taking over 2 seconds)
  3. Select trace to see waterfall visualization
  4. Identify longest segment/subsegment

Finding errors:

filter: error = true AND http.status = "500"

Finding specific user requests:

annotation.userId = "user-12345"

Trace map use cases:

  • Identify service dependencies (what does my service call?)
  • Find highest latency services (color-coded by response time)
  • Detect cascading failures (errors propagating downstream)
  • Understand traffic patterns (line thickness shows volume)

X-Ray Cost Optimization

X-Ray charges per trace recorded and scanned:

  • Recording: $5 per 1 million traces
  • Scanning: $0.50 per 1 million traces scanned (queries)

Cost reduction strategies:

  1. Lower sampling rates: 1% sampling on 1B requests/month = $50/month (vs $5000 at 100%)
  2. Sample critical paths more: 10% on checkout, 0.1% on health checks
  3. Trace errors heavily: 100% of errors, 1% of successes
  4. Short retention: 30 days default (vs CloudWatch Logs which you control)

Container Insights

Container Insights provides metrics and logs for ECS, EKS, and Kubernetes clusters. It automatically collects, aggregates, and visualizes performance data.

ECS Container Insights

Enable at cluster or task level to collect:

  • Task-level metrics: CPU, memory, network, disk I/O per task
  • Service-level metrics: Aggregated across all tasks in a service
  • Container-level metrics: Per-container granularity

Enable via Terraform:

resource "aws_ecs_cluster" "main" {
name = "production-cluster"

setting {
name = "containerInsights"
value = "enabled"
}
}

Metrics available:

  • CpuUtilized, CpuReserved: Task/container CPU usage
  • MemoryUtilized, MemoryReserved: Memory usage
  • NetworkRxBytes, NetworkTxBytes: Network traffic
  • StorageReadBytes, StorageWriteBytes: Disk I/O

Use cases:

  • Right-size task definitions (are you over-allocating CPU/memory?)
  • Identify memory leaks (gradual memory increase over time)
  • Network bottlenecks (high network utilization)
  • Cost optimization (reduce reserved CPU/memory if underutilized)

EKS Container Insights

For EKS, Container Insights requires deploying CloudWatch agent and Fluent Bit as DaemonSets.

Install via Helm:

helm repo add eks https://aws.github.io/eks-charts
helm install aws-cloudwatch-metrics eks/aws-cloudwatch-metrics \
--namespace amazon-cloudwatch \
--set clusterName=my-cluster

Metrics collected:

  • Cluster-level: Node count, pod count, CPU/memory across cluster
  • Namespace-level: Resources used per namespace
  • Pod-level: CPU, memory, network per pod
  • Container-level: Resource usage per container

Performance dashboard: Container Insights automatically creates CloudWatch dashboards showing:

  • Cluster resource utilization over time
  • Top pods by CPU/memory
  • Pod/container failures
  • Node resource allocation vs usage

For EKS-specific details, see AWS EKS. For general Kubernetes patterns, see Kubernetes Guidelines.

Log Aggregation

Container Insights aggregates logs from all containers into CloudWatch Logs:

Log structure:

Log Group: /aws/containerinsights/{cluster-name}/application
├─ Log Stream: pod-{namespace}_{pod-name}_{container-name}
└─ Log format: JSON with Kubernetes metadata

Kubernetes metadata added automatically:

{
"log": "Payment created for customer 12345",
"stream": "stdout",
"time": "2025-01-15T10:30:45.123Z",
"kubernetes": {
"pod_name": "payment-api-abc123",
"namespace_name": "production",
"pod_id": "xyz-789",
"labels": {
"app": "payment-api",
"version": "v2.1.0"
},
"container_name": "payment-api",
"docker_id": "docker://123abc"
}
}

Query logs by pod label:

fields @timestamp, log, kubernetes.pod_name
| filter kubernetes.labels.app = "payment-api"
| filter kubernetes.labels.version = "v2.1.0"
| sort @timestamp desc

This enables filtering by deployment version, finding logs for specific releases, or aggregating across all pods of a service.


Integration with Existing Observability Stack

Correlation IDs Across AWS Services

Maintain correlation throughout the request lifecycle:

API Gateway → Lambda:

// Lambda receives correlation ID from API Gateway request
export const handler = async (event, context) => {
const correlationId = event.headers['X-Correlation-ID'] || uuidv4();

// Add to all logs
console.log(JSON.stringify({
message: 'Processing payment',
correlationId,
customerId: event.pathParameters.customerId
}));

// Pass to downstream services
await sqsClient.send(new SendMessageCommand({
QueueUrl: QUEUE_URL,
MessageBody: JSON.stringify({...}),
MessageAttributes: {
CorrelationId: {
DataType: 'String',
StringValue: correlationId
}
}
}));
};

SQS → Lambda:

export const handler = async (event, context) => {
for (const record of event.Records) {
// Extract correlation ID from message attributes
const correlationId = record.messageAttributes.CorrelationId?.stringValue;

console.log(JSON.stringify({
message: 'Processing SQS message',
correlationId,
messageId: record.messageId
}));
}
};

Spring Boot with AWS SDK:

@Component
public class SqsPublisher {
private final SqsClient sqsClient;

public void publishEvent(PaymentEvent event) {
// Get correlation ID from MDC (set by filter)
String correlationId = MDC.get("correlationId");

sqsClient.sendMessage(SendMessageRequest.builder()
.queueUrl(queueUrl)
.messageBody(toJson(event))
.messageAttributes(Map.of(
"CorrelationId", MessageAttributeValue.builder()
.dataType("String")
.stringValue(correlationId)
.build()
))
.build());
}
}

For more on correlation ID patterns, see Logging Guidelines.

Cross-Account Observability

In multi-account architectures, centralize observability data for unified visibility:

Pattern: Central logging account

Setup:

  1. Create central logging account with S3 bucket and Kinesis stream
  2. Configure subscription filters in each workload account
  3. Grant cross-account permissions (IAM roles)
  4. Logs from all accounts aggregate in central location

Benefits:

  • Unified search across all environments
  • Compliance and audit (logs in separate account from workloads)
  • Cost optimization (single S3 bucket with lifecycle policies)
  • Security (logs immutable in separate account)

CloudWatch cross-account dashboard: CloudWatch supports cross-account dashboards to visualize metrics from multiple accounts:

resource "aws_cloudwatch_dashboard" "multi_account" {
dashboard_name = "cross-account-overview"

dashboard_body = jsonencode({
widgets = [
{
type = "metric"
properties = {
metrics = [
["AWS/ECS", "CPUUtilization", {region: "us-east-1", accountId: "111111111111"}],
["AWS/ECS", "CPUUtilization", {region: "us-east-1", accountId: "222222222222"}]
]
title = "ECS CPU Across Accounts"
}
}
]
})
}

Third-Party Integration

CloudWatch integrates with third-party observability platforms for enhanced visualization and analysis:

Prometheus:

  • Use CloudWatch Exporter to scrape CloudWatch metrics into Prometheus
  • Query with PromQL, visualize in Grafana
  • Combine AWS metrics with application metrics

Datadog/New Relic:

  • Install agent on EC2/ECS for deeper instrumentation
  • Forward CloudWatch Logs to platform via Lambda forwarder
  • Unified dashboard with AWS metrics + APM data

Grafana:

  • Native CloudWatch data source
  • Query CloudWatch Logs Insights directly from Grafana
  • Combine with Prometheus, Jaeger for unified view

See Observability Overview for multi-tool integration patterns.


Best Practices

Logging Best Practices

  • Use structured JSON logs: Enable efficient CloudWatch Logs Insights queries
  • Add correlation IDs: Track requests across all services and accounts
  • Set appropriate retention: 30 days for operational logs, longer for audit logs
  • Never log sensitive data: PII, passwords, API keys, credit card numbers
  • Use log levels correctly: ERROR for user-impacting failures, WARN for recoverable issues, INFO for business events
  • Export to S3 for archival: 10x cheaper than CloudWatch Logs long-term retention

Metrics Best Practices

  • Track business metrics: Payment success rate, checkout completion time (not just infrastructure metrics)
  • Use consistent dimensions: Environment, Service, Region across all services
  • Publish metrics asynchronously: Don't block business logic waiting for CloudWatch API
  • Use EMF for cost efficiency: Embedded Metric Format extracts metrics from logs without API calls
  • Avoid high-cardinality dimensions: Don't use customer IDs or request IDs as dimensions (millions of unique values)

Tracing Best Practices

  • Sample intelligently: 100% of errors, 10% of critical paths, 1% of normal traffic
  • Add business context: Annotate traces with customer ID, payment type, order ID for filtering
  • Trace external dependencies: Ensure all HTTP, database, and message queue calls are captured
  • Set meaningful subsegment names: processPayment not method1
  • Always end spans: Use try-finally blocks to prevent memory leaks

Alarming Best Practices

  • Alert on symptoms, not causes: Error rate (symptom) not CPU usage (cause)
  • Set thresholds based on SLAs: If SLA is 99.9% uptime, alert at 99.95% to catch degradation early
  • Use composite alarms: Reduce noise by requiring multiple symptoms
  • Different severity levels: Page for critical user-impacting issues, email for warnings
  • Test your alarms: Trigger intentionally to verify on-call gets notified

Cost Optimization

  • Short retention for verbose logs: 3-7 days for debug logs, longer for audit logs
  • Sample traces: 1% sampling reduces costs by 99% with minimal visibility loss
  • Use metric filters sparingly: High-volume log parsing is expensive
  • Archive to S3: CloudWatch Logs costs $0.50/GB/month, S3 costs $0.023/GB/month
  • Aggregation at source: Use EMF to create metrics from logs instead of storing verbose logs

Anti-Patterns

Logging Anti-Patterns

  • Plain text logs: Difficult to query and aggregate
  • No correlation IDs: Impossible to trace requests across services
  • Logging PII: Compliance violations (GDPR, PCI-DSS)
  • Infinite retention: Logs stored forever cost thousands per month
  • Excessive logging: Logging every method call generates noise and cost

Metrics Anti-Patterns

  • Only infrastructure metrics: CPU and memory don't tell you if users are happy
  • High-cardinality dimensions: Using customer ID as dimension creates millions of metric combinations
  • Synchronous publishing: Blocking business logic waiting for CloudWatch API
  • No aggregation: Sending individual events instead of aggregated metrics

Tracing Anti-Patterns

  • 100% sampling: Expensive and unnecessary (1% is often sufficient)
  • No sampling of errors: Errors should always be traced for debugging
  • Missing propagation: Trace context not passed to downstream services (breaks distributed trace)
  • Trace ID as correlation ID: Use separate correlation ID for logs (traces are sampled, logs aren't)

Alarming Anti-Patterns

  • Alert on everything: Too many alarms = alert fatigue = ignored alarms
  • No alarm runbooks: On-call doesn't know what to do when alarm fires
  • Static thresholds on variable metrics: Daily traffic patterns trigger false alarms at night
  • Alerting on causes: CPU high (cause) instead of latency high (symptom)

Summary

AWS provides a comprehensive observability stack:

CloudWatch Logs:

  • Centralized logging with JSON for structured data
  • CloudWatch Logs Insights for powerful querying
  • Subscription filters for real-time streaming and archival

CloudWatch Metrics:

  • Time-series data for trends and alerting
  • Custom metrics via SDK or Embedded Metric Format
  • Metric math for derived calculations

CloudWatch Alarms:

  • Threshold-based alerting with SNS actions
  • Composite alarms for reduced noise
  • Anomaly detection for dynamic thresholds

AWS X-Ray:

  • Distributed tracing across services
  • Service map visualization
  • Custom segments for business logic

Container Insights:

  • ECS/EKS performance metrics
  • Pod/container-level visibility
  • Automatic log aggregation with Kubernetes metadata

Key Practices:

  • Use correlation IDs across all services
  • Structured JSON logs for queryability
  • Sample traces intelligently (errors + random sample)
  • Alert on user-impacting symptoms
  • Cost-optimize with retention policies and archival

Cross-References:

Further Reading: