AWS Observability
Implementing comprehensive observability on AWS using CloudWatch and X-Ray for logs, metrics, and distributed tracing.
Overview
AWS provides a comprehensive observability stack through CloudWatch (logs and metrics), X-Ray (distributed tracing), and specialized services like Container Insights for containerized workloads. This guide covers how to instrument your applications, configure monitoring infrastructure, and build effective observability practices on AWS.
For foundational observability concepts, see the Observability Overview. For Spring Boot-specific implementation details, see Spring Boot Observability. This guide focuses on AWS-specific integration patterns and services.
Core Principles
- Structured logging: Use JSON format with consistent field names for efficient querying via CloudWatch Logs Insights
- Correlation IDs: Propagate trace context through all AWS services and application layers for end-to-end traceability
- Cost-aware sampling: Implement intelligent sampling for traces and verbose logs to manage CloudWatch costs
- Actionable alarms: Create alarms based on business metrics and user-impacting symptoms, not just infrastructure thresholds
- Centralized observability: Aggregate logs and metrics across accounts and regions for unified visibility
CloudWatch Logs
CloudWatch Logs is AWS's centralized logging service. Applications, AWS services, and infrastructure all emit logs to CloudWatch for storage, search, and analysis.
Log Groups and Streams
CloudWatch organizes logs into log groups (logical containers like /aws/ecs/my-service) and log streams (individual sources like container instances or Lambda executions).
Structure:
Log Group: /aws/ecs/payment-service
├─ Log Stream: task/abc123-container-1
├─ Log Stream: task/abc123-container-2
└─ Log Stream: task/xyz789-container-1
Each log event within a stream has:
- Timestamp: When the event occurred
- Message: The actual log content (text or JSON)
- Ingestion time: When CloudWatch received it
Best practices:
- Use hierarchical naming:
/company/environment/service(e.g.,/acme/prod/payment-api) - Separate log groups by environment to avoid production data contamination
- Use consistent naming across teams for easier cross-service querying
Structured Logging with JSON
CloudWatch Logs Insights can parse and query JSON logs much more efficiently than plain text. Send logs as JSON with consistent field names.
Plain text log (harder to query):
2025-01-15T10:30:45 INFO Payment created for customer 12345 amount 100.50 USD
JSON structured log (queryable):
{
"timestamp": "2025-01-15T10:30:45.123Z",
"level": "INFO",
"message": "Payment created",
"customerId": "12345",
"paymentId": "pay-67890",
"amount": 100.50,
"currency": "USD",
"correlationId": "abc-123-def"
}
With JSON logs, you can run queries like:
fields @timestamp, customerId, amount
| filter amount > 1000
| stats sum(amount) by customerId
For implementation details on structured logging in Spring Boot, see Logging Guidelines.
Log Retention and Lifecycle
By default, logs never expire. This can become expensive quickly. Set appropriate retention periods based on compliance requirements and usage patterns.
Common retention policies:
- Development logs: 3-7 days (short-term debugging)
- Production application logs: 30-90 days (operational troubleshooting)
- Audit logs: 1-7 years (compliance requirements like PCI-DSS, SOX)
- Access logs: 90-365 days (security analysis)
Cost optimization:
- Short retention for high-volume, low-value logs (debug logs, health check requests)
- Export older logs to S3 for archival at 1/10th the cost
- Use subscription filters to stream logs to S3/Kinesis instead of long retention
Setting retention via Terraform:
resource "aws_cloudwatch_log_group" "payment_service" {
name = "/aws/ecs/payment-service"
retention_in_days = 30 # Automatically delete after 30 days
tags = {
Environment = "production"
Service = "payment-service"
}
}
See Terraform Guidelines for infrastructure as code patterns.
CloudWatch Logs Insights
Logs Insights is CloudWatch's built-in query language for analyzing log data. It provides SQL-like syntax for filtering, aggregating, and visualizing logs.
Query anatomy:
fields @timestamp, level, message, correlationId # Select fields to display
| filter level = "ERROR" # Filter conditions
| filter @message like /timeout|connection/ # Regex matching
| stats count(*) by bin(5m) # Aggregate by 5-minute bins
| sort @timestamp desc # Order results
| limit 100 # Limit output
Example: Find slow database queries
fields @timestamp, queryTime, query
| filter queryTime > 1000 # Queries taking over 1 second
| stats avg(queryTime), max(queryTime), count() by query
| sort max(queryTime) desc
Example: Error rate over time
fields @timestamp
| filter level = "ERROR"
| stats count(*) as errorCount by bin(5m)
Example: Trace a specific request
fields @timestamp, level, message, customerId
| filter correlationId = "abc-123-def" # All logs for one request
| sort @timestamp asc
Performance tips:
- Filter early (use
filterbeforestatsto reduce data processed) - Query specific time ranges (avoid "all time" queries)
- Use
bin()for time-series aggregation instead of processing individual events - Limit result sets to avoid timeout
For more query patterns, see Logging Guidelines.
Subscription Filters
Subscription filters stream logs in real-time to other AWS services for processing or archival. This enables log aggregation, real-time alerting, and cost-effective long-term storage.
Common use cases:
- S3 archival: Stream logs to S3 via Kinesis Firehose for long-term storage at lower cost
- Real-time processing: Trigger Lambda functions for specific log patterns (e.g., error alerts)
- Centralized logging: Aggregate logs from multiple accounts into a central account
- Security analysis: Stream to security tools for threat detection
Example: Archive logs to S3
resource "aws_cloudwatch_log_subscription_filter" "log_archive" {
name = "archive-to-s3"
log_group_name = "/aws/ecs/payment-service"
filter_pattern = "" # Empty = all logs
destination_arn = aws_kinesis_firehose_delivery_stream.logs_to_s3.arn
}
resource "aws_kinesis_firehose_delivery_stream" "logs_to_s3" {
name = "logs-to-s3"
destination = "s3"
s3_configuration {
role_arn = aws_iam_role.firehose.arn
bucket_arn = aws_s3_bucket.log_archive.arn
prefix = "logs/payment-service/"
# Compress logs to save storage costs
compression_format = "GZIP"
}
}
Cost consideration: Streaming logs incurs data transfer and processing costs. For infrequently accessed logs, consider exporting directly to S3 on a schedule instead of real-time streaming.
CloudWatch Metrics
Metrics provide time-series data about system behavior. Unlike logs (which record discrete events), metrics aggregate measurements over time: "average CPU is 45%," "request rate is 1000/sec."
For general metrics patterns and design principles, see Metrics Guidelines.
Metric Namespaces and Dimensions
CloudWatch organizes metrics into namespaces (like AWS/ECS, AWS/RDS, or custom namespaces like MyCompany/Payments). Each metric has:
- Metric name: What is being measured (
CPUUtilization,PaymentCount) - Dimensions: Tags that identify the specific resource (
ServiceName=payment-api,ClusterId=prod-cluster) - Unit: Measurement unit (
Percent,Count,Seconds) - Timestamp: When the measurement occurred
- Value: The actual measurement
Dimensions enable filtering:
Namespace: MyCompany/Payments
Metric: ProcessingTime
Dimensions: {Service=payment-api, Environment=prod, PaymentType=card}
You can query: "Show me card payment processing time for prod" or aggregate: "Show me all payment processing times across all payment types."
Best practices:
- Use consistent dimension names across services (
Environment,Service,Region) - Limit dimensions to high-cardinality values (avoid customer IDs as dimensions - too many unique values)
- Use CloudWatch Embedded Metric Format (EMF) for efficient metric publishing from logs
Publishing Custom Metrics
AWS services automatically publish metrics (EC2 CPU, RDS connections, ALB requests), but you'll want custom metrics for business logic.
From Application Code (AWS SDK)
Spring Boot example:
@Service
@RequiredArgsConstructor
public class PaymentService {
private final CloudWatchAsyncClient cloudWatch;
public Payment createPayment(PaymentRequest request) {
Instant startTime = Instant.now();
try {
Payment payment = processPayment(request);
// Publish success metric
publishMetric("PaymentCreated", 1.0, StandardUnit.COUNT,
Map.of("PaymentType", request.type(), "Status", "success"));
// Publish processing time
double durationMs = Duration.between(startTime, Instant.now()).toMillis();
publishMetric("PaymentProcessingTime", durationMs, StandardUnit.MILLISECONDS,
Map.of("PaymentType", request.type()));
return payment;
} catch (Exception ex) {
// Track failures separately
publishMetric("PaymentCreated", 1.0, StandardUnit.COUNT,
Map.of("PaymentType", request.type(), "Status", "failure"));
throw ex;
}
}
private void publishMetric(String metricName, double value, StandardUnit unit,
Map<String, String> dimensions) {
var dimensionList = dimensions.entrySet().stream()
.map(e -> Dimension.builder().name(e.getKey()).value(e.getValue()).build())
.toList();
var metricDatum = MetricDatum.builder()
.metricName(metricName)
.value(value)
.unit(unit)
.timestamp(Instant.now())
.dimensions(dimensionList)
.build();
var request = PutMetricDataRequest.builder()
.namespace("MyCompany/Payments") // Custom namespace
.metricData(metricDatum)
.build();
// Async call to avoid blocking business logic
cloudWatch.putMetricData(request);
}
}
Key points:
- Use async client (
CloudWatchAsyncClient) to avoid blocking application threads - Batch multiple metrics into a single
PutMetricDataRequest(up to 1000 metrics per call) - Track both successes and failures with dimensions for error analysis
- Record timing data for performance analysis
For Spring Boot integration patterns, see Spring Boot Observability.
From Logs (Embedded Metric Format)
CloudWatch can automatically extract metrics from structured JSON logs using Embedded Metric Format (EMF). This is more cost-effective than calling PutMetricData API directly.
EMF log format:
{
"LogGroup": "/aws/ecs/payment-service",
"ServiceName": "payment-api",
"PaymentType": "card",
"ProcessingTime": 245,
"Amount": 150.00,
"_aws": {
"Timestamp": 1705320645000,
"CloudWatchMetrics": [{
"Namespace": "MyCompany/Payments",
"Dimensions": [["ServiceName", "PaymentType"]],
"Metrics": [
{"Name": "ProcessingTime", "Unit": "Milliseconds"},
{"Name": "Amount", "Unit": "None"}
]
}]
}
}
Advantages of EMF:
- Single write creates both log entry and metric (no separate API call)
- Lower cost (log ingestion pricing only, no metric API charges)
- Log and metric guaranteed to have same timestamp
- Automatic aggregation by CloudWatch
Java library for EMF:
@Service
public class PaymentService {
private final MetricsLogger metricsLogger = new MetricsLogger();
public Payment createPayment(PaymentRequest request) {
metricsLogger.putDimensions(DimensionSet.of(
"ServiceName", "payment-api",
"PaymentType", request.type()
));
metricsLogger.putMetric("ProcessingTime", 245, Unit.MILLISECONDS);
metricsLogger.putProperty("CustomerId", request.customerId());
metricsLogger.putProperty("Amount", request.amount());
metricsLogger.flush(); // Writes EMF JSON to stdout
}
}
For Lambda functions, use aws-embedded-metrics library (Node.js) or aws-embedded-metrics-java which automatically configures CloudWatch destination.
Metric Math
CloudWatch supports mathematical expressions across metrics for derived calculations. This enables creating custom metrics from existing data without writing code.
Examples:
Error rate percentage:
errorRate = (errors / totalRequests) * 100
Available capacity:
availableCapacity = maxCapacity - currentUsage
Custom SLI (99th percentile latency):
sli = 1 - (p99Latency / latencyThreshold)
In CloudWatch console:
Expression: m1 / m2 * 100
m1 = SUM(Errors)
m2 = SUM(RequestCount)
Result: Error rate percentage
Use cases:
- Calculate business KPIs from multiple metrics
- Create composite alarms (alert when multiple conditions are true)
- Normalize metrics across different services
Metric Filters
Metric filters extract metric data from log events. This is useful for creating metrics from legacy applications that don't publish metrics directly.
Example: Create metric from error logs
Filter pattern: [timestamp, level=ERROR, ...]
Metric namespace: MyCompany/Payments
Metric name: ErrorCount
Metric value: 1
Every time a log line matches level=ERROR, CloudWatch increments the ErrorCount metric.
When to use metric filters:
- Legacy applications that only emit logs
- Quick metrics from existing logs without code changes
- Counting occurrences of specific log patterns
When NOT to use:
- New applications (use EMF or direct API instead for better performance)
- Complex aggregations (use CloudWatch Logs Insights queries instead)
- High-volume logs (filter pattern evaluation adds cost)
CloudWatch Alarms
Alarms monitor metrics and trigger actions when thresholds are breached. They're the foundation of proactive incident response.
For general alerting principles, see Monitoring and Alerting.
Alarm Anatomy
Each alarm monitors a single metric and evaluates it against a threshold over a time period:
Alarm: HighErrorRate
Metric: MyCompany/Payments:ErrorRate
Threshold: > 5%
Evaluation period: 3 consecutive 1-minute periods
Actions: Send SNS notification to on-call team
Evaluation logic:
- CloudWatch evaluates metric every minute
- If error rate > 5% for 3 consecutive minutes, alarm enters
ALARMstate - SNS notification triggers (email, SMS, Lambda, etc.)
- If error rate drops below 5% for 3 consecutive minutes, alarm returns to
OKstate
Alarm states:
OK: Metric is within thresholdALARM: Metric breached threshold for evaluation periodsINSUFFICIENT_DATA: Not enough data points to evaluate (service just started, metric not published)
Creating Effective Alarms
Bad alarm (noisy, not actionable):
Alarm: HighCPU
Metric: EC2 CPUUtilization > 80%
Problem: CPU spikes are normal. This will alert constantly without indicating actual issues.
Good alarm (actionable, user-impacting):
Alarm: HighErrorRate
Metric: API error rate > 1% for 5 minutes
Action: Page on-call engineer
Rationale: Directly impacts users. Sustained elevation indicates systemic issue.
Alarm design principles:
- Alert on symptoms users experience (error rate, latency), not causes (CPU, memory)
- Symptoms are universal (users care about errors); causes vary by implementation
- Set thresholds based on SLAs and user impact, not arbitrary percentages
- Use longer evaluation periods to avoid flapping (brief spikes don't indicate problems)
- Different severity levels: page for critical (user-impacting), email for warning (trending toward issue)
Alarm Actions
Alarms can trigger multiple actions in different states:
Terraform example:
resource "aws_cloudwatch_metric_alarm" "high_error_rate" {
alarm_name = "payment-api-high-error-rate"
comparison_operator = "GreaterThanThreshold"
evaluation_periods = 3
metric_name = "ErrorRate"
namespace = "MyCompany/Payments"
period = 60 # 1 minute
statistic = "Average"
threshold = 5.0 # 5% error rate
treat_missing_data = "notBreaching" # Don't alarm if no data
dimensions = {
ServiceName = "payment-api"
Environment = "production"
}
# Actions when entering ALARM state
alarm_actions = [
aws_sns_topic.pagerduty_critical.arn, # Page on-call
aws_sns_topic.slack_alerts.arn # Post to Slack
]
# Actions when returning to OK state
ok_actions = [
aws_sns_topic.slack_alerts.arn # Notify resolution
]
# Actions when data is insufficient
insufficient_data_actions = [] # Don't alert on missing data
}
Common actions:
- SNS topic: Notification (email, SMS, HTTP endpoint, Lambda)
- Auto Scaling action: Scale EC2/ECS capacity
- EC2 action: Stop, terminate, reboot instance
- Systems Manager action: Run automation document
Action chaining example (auto-remediation):
Composite Alarms
Composite alarms combine multiple alarms with AND/OR logic. This reduces alert noise by requiring multiple conditions simultaneously.
Example: Alert only if BOTH error rate is high AND latency is high
resource "aws_cloudwatch_composite_alarm" "service_degraded" {
alarm_name = "payment-api-degraded"
alarm_description = "Service is experiencing both high errors and high latency"
alarm_rule = "ALARM(${aws_cloudwatch_metric_alarm.high_error_rate.alarm_name}) AND ALARM(${aws_cloudwatch_metric_alarm.high_latency.alarm_name})"
actions_enabled = true
alarm_actions = [aws_sns_topic.pagerduty_critical.arn]
}
Use cases:
- Reduce false positives (alert only when multiple symptoms confirm an issue)
- Create severity levels (warning if one condition, critical if multiple)
- Correlated failures (alert if issue affects multiple services)
Anomaly Detection
CloudWatch can use machine learning to detect anomalies in metric patterns without setting static thresholds.
How it works:
- CloudWatch analyzes 2+ weeks of metric history
- Builds model of normal behavior (daily patterns, weekly cycles, trends)
- Creates dynamic thresholds (bands) around expected values
- Alerts when metric deviates from band
When to use anomaly detection:
- Metrics with daily/weekly patterns (traffic during business hours)
- Gradual growth trends (hard to set static threshold)
- Seasonal patterns (holiday traffic spikes)
When NOT to use:
- New services (not enough history)
- Highly variable metrics (too many false positives)
- Critical thresholds defined by SLA (use static thresholds for SLAs)
Creating anomaly alarm:
resource "aws_cloudwatch_metric_alarm" "anomaly_detection" {
alarm_name = "payment-api-request-anomaly"
comparison_operator = "LessThanLowerOrGreaterThanUpperThreshold"
evaluation_periods = 2
threshold_metric_id = "anomaly"
metric_query {
id = "traffic"
return_data = true
metric {
metric_name = "RequestCount"
namespace = "AWS/ApplicationELB"
period = 300
stat = "Sum"
dimensions = {
LoadBalancer = "app/payment-api/abc123"
}
}
}
metric_query {
id = "anomaly"
expression = "ANOMALY_DETECTION_BAND(traffic, 2)" # 2 standard deviations
label = "RequestCount (expected)"
return_data = true
}
alarm_actions = [aws_sns_topic.alerts.arn]
}
AWS X-Ray
X-Ray provides distributed tracing for AWS applications. It visualizes request flow across services, identifies bottlenecks, and helps debug latency issues.
For foundational distributed tracing concepts, see Tracing Guidelines.
How X-Ray Works
When a request enters your application:
- Entry point creates trace: API Gateway, ALB, or application generates a trace ID
- Trace context propagates: X-Ray SDK adds trace ID to outbound calls (HTTP headers, message attributes)
- Services create segments: Each service creates a segment (unit of work) associated with the trace
- Subsegments track details: Segments contain subsegments for database calls, external APIs, etc.
- X-Ray assembles service map: X-Ray reconstructs the complete request flow
Service map visualization:
- Circles = services
- Lines = calls between services
- Color = health (green = healthy, orange = high latency, red = errors)
- Thickness = traffic volume
Instrumenting Applications
Lambda Functions
Lambda has built-in X-Ray support. Enable via function configuration:
Terraform:
resource "aws_lambda_function" "payment_processor" {
function_name = "payment-processor"
runtime = "java17"
handler = "com.example.PaymentHandler"
tracing_config {
mode = "Active" # Enable X-Ray tracing
}
}
No code changes required for basic tracing. AWS SDK calls, HTTP requests, and SQL queries are automatically traced.
Custom subsegments (Java):
public class PaymentHandler implements RequestHandler<APIGatewayProxyRequestEvent, APIGatewayProxyResponseEvent> {
public APIGatewayProxyResponseEvent handleRequest(APIGatewayProxyRequestEvent event, Context context) {
// Create custom subsegment for business logic
Subsegment subsegment = AWSXRay.beginSubsegment("processPayment");
try {
subsegment.putMetadata("customerId", event.getQueryStringParameters().get("customerId"));
subsegment.putAnnotation("paymentType", "card"); // Indexed for filtering
Payment payment = processPayment(event);
subsegment.putMetadata("paymentId", payment.getId());
return new APIGatewayProxyResponseEvent()
.withStatusCode(200)
.withBody(toJson(payment));
} catch (Exception ex) {
subsegment.addException(ex);
throw ex;
} finally {
AWSXRay.endSubsegment();
}
}
}
Metadata vs Annotations:
- Annotations: Indexed, filterable, up to 50 per segment (use for filtering traces:
paymentType,userId) - Metadata: Not indexed, unlimited, any data structure (use for debugging details: full request/response)
ECS/Fargate
ECS tasks require the X-Ray daemon sidecar container to send trace data to X-Ray service.
Task definition with X-Ray daemon:
{
"family": "payment-service",
"containerDefinitions": [
{
"name": "payment-api",
"image": "payment-api:latest",
"environment": [
{"name": "AWS_XRAY_DAEMON_ADDRESS", "value": "xray-daemon:2000"}
]
},
{
"name": "xray-daemon",
"image": "public.ecr.aws/xray/aws-xray-daemon:latest",
"portMappings": [
{"containerPort": 2000, "protocol": "udp"}
],
"environment": [
{"name": "AWS_REGION", "value": "us-east-1"}
]
}
]
}
Application code (Spring Boot):
// build.gradle
dependencies {
implementation 'com.amazonaws:aws-xray-recorder-sdk-spring:2.15.0'
}
// Application configuration
@Configuration
@EnableAspectJAutoProxy
public class XRayConfig {
@Bean
public Filter TracingFilter() {
return new AWSXRayServletFilter("payment-service");
}
@Bean
public TracingInterceptor tracingInterceptor() {
return new TracingInterceptor();
}
}
Automatic instrumentation:
- HTTP requests (incoming/outgoing)
- AWS SDK calls (S3, DynamoDB, SQS, etc.)
- SQL queries (JDBC)
For detailed Spring Boot integration, see Spring Boot Observability.
API Gateway
API Gateway automatically creates X-Ray traces when tracing is enabled:
Enable via Terraform:
resource "aws_api_gateway_stage" "prod" {
stage_name = "prod"
rest_api_id = aws_api_gateway_rest_api.api.id
deployment_id = aws_api_gateway_deployment.deployment.id
xray_tracing_enabled = true
}
API Gateway propagates trace context to downstream Lambda functions or HTTP endpoints via X-Amzn-Trace-Id header.
Sampling Rules
Tracing every request is expensive and generates massive data volume. Sampling reduces cost while maintaining visibility.
Default sampling rule:
- 1 request per second: Always trace at least 1 req/sec per service
- 5% of additional requests: Randomly sample 5% of traffic above 1/sec
Why this works:
- Guarantees some traces even during low traffic
- Reduces volume during high traffic (tracing 5% of 10,000 req/sec = 500 traces/sec is plenty)
Custom sampling rules:
{
"version": 2,
"rules": [
{
"description": "Trace all errors",
"http_method": "*",
"url_path": "*",
"fixed_target": 0,
"rate": 1.0,
"service_name": "*",
"service_type": "*",
"resource_arn": "*",
"priority": 100,
"attributes": {
"http.status_code": "5*" # Match all 5xx errors
}
},
{
"description": "Trace 10% of checkout flow",
"http_method": "*",
"url_path": "/api/checkout",
"fixed_target": 1,
"rate": 0.1, # 10%
"priority": 200
},
{
"description": "Trace 1% of everything else",
"http_method": "*",
"url_path": "*",
"fixed_target": 1,
"rate": 0.01, # 1%
"priority": 1000
}
],
"default": {
"fixed_target": 1,
"rate": 0.01
}
}
Rule priority:
- Lower number = higher priority
- First matching rule wins
- Always trace errors (100% sampling) for debugging
- Higher sampling for critical paths (checkout: 10%)
- Lower sampling for high-volume, low-value paths (health checks: 0%)
Analyzing Traces
Finding slow requests:
- Go to X-Ray console → Traces
- Filter:
responsetime > 2(traces taking over 2 seconds) - Select trace to see waterfall visualization
- Identify longest segment/subsegment
Finding errors:
filter: error = true AND http.status = "500"
Finding specific user requests:
annotation.userId = "user-12345"
Trace map use cases:
- Identify service dependencies (what does my service call?)
- Find highest latency services (color-coded by response time)
- Detect cascading failures (errors propagating downstream)
- Understand traffic patterns (line thickness shows volume)
X-Ray Cost Optimization
X-Ray charges per trace recorded and scanned:
- Recording: $5 per 1 million traces
- Scanning: $0.50 per 1 million traces scanned (queries)
Cost reduction strategies:
- Lower sampling rates: 1% sampling on 1B requests/month = $50/month (vs $5000 at 100%)
- Sample critical paths more: 10% on checkout, 0.1% on health checks
- Trace errors heavily: 100% of errors, 1% of successes
- Short retention: 30 days default (vs CloudWatch Logs which you control)
Container Insights
Container Insights provides metrics and logs for ECS, EKS, and Kubernetes clusters. It automatically collects, aggregates, and visualizes performance data.
ECS Container Insights
Enable at cluster or task level to collect:
- Task-level metrics: CPU, memory, network, disk I/O per task
- Service-level metrics: Aggregated across all tasks in a service
- Container-level metrics: Per-container granularity
Enable via Terraform:
resource "aws_ecs_cluster" "main" {
name = "production-cluster"
setting {
name = "containerInsights"
value = "enabled"
}
}
Metrics available:
CpuUtilized,CpuReserved: Task/container CPU usageMemoryUtilized,MemoryReserved: Memory usageNetworkRxBytes,NetworkTxBytes: Network trafficStorageReadBytes,StorageWriteBytes: Disk I/O
Use cases:
- Right-size task definitions (are you over-allocating CPU/memory?)
- Identify memory leaks (gradual memory increase over time)
- Network bottlenecks (high network utilization)
- Cost optimization (reduce reserved CPU/memory if underutilized)
EKS Container Insights
For EKS, Container Insights requires deploying CloudWatch agent and Fluent Bit as DaemonSets.
Install via Helm:
helm repo add eks https://aws.github.io/eks-charts
helm install aws-cloudwatch-metrics eks/aws-cloudwatch-metrics \
--namespace amazon-cloudwatch \
--set clusterName=my-cluster
Metrics collected:
- Cluster-level: Node count, pod count, CPU/memory across cluster
- Namespace-level: Resources used per namespace
- Pod-level: CPU, memory, network per pod
- Container-level: Resource usage per container
Performance dashboard: Container Insights automatically creates CloudWatch dashboards showing:
- Cluster resource utilization over time
- Top pods by CPU/memory
- Pod/container failures
- Node resource allocation vs usage
For EKS-specific details, see AWS EKS. For general Kubernetes patterns, see Kubernetes Guidelines.
Log Aggregation
Container Insights aggregates logs from all containers into CloudWatch Logs:
Log structure:
Log Group: /aws/containerinsights/{cluster-name}/application
├─ Log Stream: pod-{namespace}_{pod-name}_{container-name}
└─ Log format: JSON with Kubernetes metadata
Kubernetes metadata added automatically:
{
"log": "Payment created for customer 12345",
"stream": "stdout",
"time": "2025-01-15T10:30:45.123Z",
"kubernetes": {
"pod_name": "payment-api-abc123",
"namespace_name": "production",
"pod_id": "xyz-789",
"labels": {
"app": "payment-api",
"version": "v2.1.0"
},
"container_name": "payment-api",
"docker_id": "docker://123abc"
}
}
Query logs by pod label:
fields @timestamp, log, kubernetes.pod_name
| filter kubernetes.labels.app = "payment-api"
| filter kubernetes.labels.version = "v2.1.0"
| sort @timestamp desc
This enables filtering by deployment version, finding logs for specific releases, or aggregating across all pods of a service.
Integration with Existing Observability Stack
Correlation IDs Across AWS Services
Maintain correlation throughout the request lifecycle:
API Gateway → Lambda:
// Lambda receives correlation ID from API Gateway request
export const handler = async (event, context) => {
const correlationId = event.headers['X-Correlation-ID'] || uuidv4();
// Add to all logs
console.log(JSON.stringify({
message: 'Processing payment',
correlationId,
customerId: event.pathParameters.customerId
}));
// Pass to downstream services
await sqsClient.send(new SendMessageCommand({
QueueUrl: QUEUE_URL,
MessageBody: JSON.stringify({...}),
MessageAttributes: {
CorrelationId: {
DataType: 'String',
StringValue: correlationId
}
}
}));
};
SQS → Lambda:
export const handler = async (event, context) => {
for (const record of event.Records) {
// Extract correlation ID from message attributes
const correlationId = record.messageAttributes.CorrelationId?.stringValue;
console.log(JSON.stringify({
message: 'Processing SQS message',
correlationId,
messageId: record.messageId
}));
}
};
Spring Boot with AWS SDK:
@Component
public class SqsPublisher {
private final SqsClient sqsClient;
public void publishEvent(PaymentEvent event) {
// Get correlation ID from MDC (set by filter)
String correlationId = MDC.get("correlationId");
sqsClient.sendMessage(SendMessageRequest.builder()
.queueUrl(queueUrl)
.messageBody(toJson(event))
.messageAttributes(Map.of(
"CorrelationId", MessageAttributeValue.builder()
.dataType("String")
.stringValue(correlationId)
.build()
))
.build());
}
}
For more on correlation ID patterns, see Logging Guidelines.
Cross-Account Observability
In multi-account architectures, centralize observability data for unified visibility:
Pattern: Central logging account
Setup:
- Create central logging account with S3 bucket and Kinesis stream
- Configure subscription filters in each workload account
- Grant cross-account permissions (IAM roles)
- Logs from all accounts aggregate in central location
Benefits:
- Unified search across all environments
- Compliance and audit (logs in separate account from workloads)
- Cost optimization (single S3 bucket with lifecycle policies)
- Security (logs immutable in separate account)
CloudWatch cross-account dashboard: CloudWatch supports cross-account dashboards to visualize metrics from multiple accounts:
resource "aws_cloudwatch_dashboard" "multi_account" {
dashboard_name = "cross-account-overview"
dashboard_body = jsonencode({
widgets = [
{
type = "metric"
properties = {
metrics = [
["AWS/ECS", "CPUUtilization", {region: "us-east-1", accountId: "111111111111"}],
["AWS/ECS", "CPUUtilization", {region: "us-east-1", accountId: "222222222222"}]
]
title = "ECS CPU Across Accounts"
}
}
]
})
}
Third-Party Integration
CloudWatch integrates with third-party observability platforms for enhanced visualization and analysis:
Prometheus:
- Use CloudWatch Exporter to scrape CloudWatch metrics into Prometheus
- Query with PromQL, visualize in Grafana
- Combine AWS metrics with application metrics
Datadog/New Relic:
- Install agent on EC2/ECS for deeper instrumentation
- Forward CloudWatch Logs to platform via Lambda forwarder
- Unified dashboard with AWS metrics + APM data
Grafana:
- Native CloudWatch data source
- Query CloudWatch Logs Insights directly from Grafana
- Combine with Prometheus, Jaeger for unified view
See Observability Overview for multi-tool integration patterns.
Best Practices
Logging Best Practices
- Use structured JSON logs: Enable efficient CloudWatch Logs Insights queries
- Add correlation IDs: Track requests across all services and accounts
- Set appropriate retention: 30 days for operational logs, longer for audit logs
- Never log sensitive data: PII, passwords, API keys, credit card numbers
- Use log levels correctly: ERROR for user-impacting failures, WARN for recoverable issues, INFO for business events
- Export to S3 for archival: 10x cheaper than CloudWatch Logs long-term retention
Metrics Best Practices
- Track business metrics: Payment success rate, checkout completion time (not just infrastructure metrics)
- Use consistent dimensions:
Environment,Service,Regionacross all services - Publish metrics asynchronously: Don't block business logic waiting for CloudWatch API
- Use EMF for cost efficiency: Embedded Metric Format extracts metrics from logs without API calls
- Avoid high-cardinality dimensions: Don't use customer IDs or request IDs as dimensions (millions of unique values)
Tracing Best Practices
- Sample intelligently: 100% of errors, 10% of critical paths, 1% of normal traffic
- Add business context: Annotate traces with customer ID, payment type, order ID for filtering
- Trace external dependencies: Ensure all HTTP, database, and message queue calls are captured
- Set meaningful subsegment names:
processPaymentnotmethod1 - Always end spans: Use try-finally blocks to prevent memory leaks
Alarming Best Practices
- Alert on symptoms, not causes: Error rate (symptom) not CPU usage (cause)
- Set thresholds based on SLAs: If SLA is 99.9% uptime, alert at 99.95% to catch degradation early
- Use composite alarms: Reduce noise by requiring multiple symptoms
- Different severity levels: Page for critical user-impacting issues, email for warnings
- Test your alarms: Trigger intentionally to verify on-call gets notified
Cost Optimization
- Short retention for verbose logs: 3-7 days for debug logs, longer for audit logs
- Sample traces: 1% sampling reduces costs by 99% with minimal visibility loss
- Use metric filters sparingly: High-volume log parsing is expensive
- Archive to S3: CloudWatch Logs costs $0.50/GB/month, S3 costs $0.023/GB/month
- Aggregation at source: Use EMF to create metrics from logs instead of storing verbose logs
Anti-Patterns
Logging Anti-Patterns
- Plain text logs: Difficult to query and aggregate
- No correlation IDs: Impossible to trace requests across services
- Logging PII: Compliance violations (GDPR, PCI-DSS)
- Infinite retention: Logs stored forever cost thousands per month
- Excessive logging: Logging every method call generates noise and cost
Metrics Anti-Patterns
- Only infrastructure metrics: CPU and memory don't tell you if users are happy
- High-cardinality dimensions: Using customer ID as dimension creates millions of metric combinations
- Synchronous publishing: Blocking business logic waiting for CloudWatch API
- No aggregation: Sending individual events instead of aggregated metrics
Tracing Anti-Patterns
- 100% sampling: Expensive and unnecessary (1% is often sufficient)
- No sampling of errors: Errors should always be traced for debugging
- Missing propagation: Trace context not passed to downstream services (breaks distributed trace)
- Trace ID as correlation ID: Use separate correlation ID for logs (traces are sampled, logs aren't)
Alarming Anti-Patterns
- Alert on everything: Too many alarms = alert fatigue = ignored alarms
- No alarm runbooks: On-call doesn't know what to do when alarm fires
- Static thresholds on variable metrics: Daily traffic patterns trigger false alarms at night
- Alerting on causes: CPU high (cause) instead of latency high (symptom)
Summary
AWS provides a comprehensive observability stack:
CloudWatch Logs:
- Centralized logging with JSON for structured data
- CloudWatch Logs Insights for powerful querying
- Subscription filters for real-time streaming and archival
CloudWatch Metrics:
- Time-series data for trends and alerting
- Custom metrics via SDK or Embedded Metric Format
- Metric math for derived calculations
CloudWatch Alarms:
- Threshold-based alerting with SNS actions
- Composite alarms for reduced noise
- Anomaly detection for dynamic thresholds
AWS X-Ray:
- Distributed tracing across services
- Service map visualization
- Custom segments for business logic
Container Insights:
- ECS/EKS performance metrics
- Pod/container-level visibility
- Automatic log aggregation with Kubernetes metadata
Key Practices:
- Use correlation IDs across all services
- Structured JSON logs for queryability
- Sample traces intelligently (errors + random sample)
- Alert on user-impacting symptoms
- Cost-optimize with retention policies and archival
Cross-References:
- Observability Overview - Foundational concepts
- Logging Guidelines - General logging patterns
- Metrics Guidelines - Metric design principles
- Tracing Guidelines - Distributed tracing patterns
- Monitoring and Alerting - Alerting best practices
- Spring Boot Observability - Spring Boot implementation
- AWS EKS - EKS Container Insights
- AWS Lambda - Lambda tracing
- Terraform Guidelines - Infrastructure as code
Further Reading: