Application Metrics Best Practices
Overview
Metrics are the second pillar of observability, providing real-time quantitative data about system behavior. While logs give you detailed narratives of specific events and traces show request flows, metrics answer questions like "How many?" and "How fast?" through aggregated time-series data.
Metrics enable you to:
- Detect anomalies: Spot unusual patterns (error rate spikes, latency increases) before they cause outages
- Validate SLAs: Measure actual performance against service level objectives
- Trigger alerts: Automatically notify teams when thresholds are breached
- Inform capacity planning: Understand resource usage trends over time
- Provide business insights: Track domain-specific KPIs (orders per second, conversion rates)
This guide covers metrics fundamentals, the Four Golden Signals framework, implementation with Micrometer and Prometheus, and best practices for dashboards and alerting.
Metrics vs. Logs vs. Traces
Understanding when to use each observability pillar is crucial:
| Aspect | Metrics | Logs | Traces |
|---|---|---|---|
| Purpose | Quantitative aggregates | Detailed event narrative | Request flow visualization |
| Question answered | "How many? How fast?" | "What happened and why?" | "Where did the request go?" |
| Data structure | Time-series numbers | Text/JSON events | Directed acyclic graph of spans |
| Volume | Low (aggregated) | High (per event) | Medium (sampled) |
| Storage | Time-series database (Prometheus) | Search engine (Elasticsearch) | Trace store (Jaeger) |
| Example | "Error rate: 5% (500/10000)" | "Payment PAY-123 failed: insufficient balance" | "Request took 2.5s: 1s in API, 1.5s in DB" |
When to use each:
- Metrics: Continuous monitoring, alerting on thresholds, trend analysis
- Logs: Investigating specific errors, debugging, audit trails
- Traces: Understanding latency distribution across services, identifying bottlenecks
In practice, you need all three working together - see the summary for how they complement each other.
Core Principles
- Four Golden Signals: Latency, Traffic, Errors, Saturation - the foundation of effective monitoring
- Business Metrics: Track domain-specific events (orders placed, checkouts completed, API calls)
- Technical Metrics: Monitor infrastructure (JVM, HTTP, database, cache, message queues)
- High Cardinality: Avoid unbounded tag values (no user IDs, transaction IDs) to prevent memory issues
- Consistent Naming: Follow conventions (metric.name.unit) for discoverability
- Percentiles Over Averages: P95/P99 reveal outliers that averages hide
Metric Types
Metrics libraries provide different types optimized for specific measurement patterns. Understanding which type to use is fundamental to effective metrics.
Counter
A counter is a cumulative metric that only increases (or resets to zero on restart). Use counters for things you want to count: requests received, errors encountered, tasks completed.
Characteristics:
- Monotonically increasing (never decreases)
- Resets to 0 on application restart
- Query using
rate()orincrease()to see change over time
When to use:
- Total requests processed
- Total errors occurred
- Items created/completed
- Bytes transferred
Example queries:
# Requests per second over last 5 minutes
rate(http_requests_total[5m])
# Total errors in last hour
increase(errors_total[1h])
Why not track rates directly? Counters are more resilient than tracking rates. If metric scraping fails for a period, counters can reconstruct the rate from the delta, whereas a rate metric would show gaps.
Gauge
A gauge represents a single numerical value that can arbitrarily go up or down. Think of it like a thermometer reading.
Characteristics:
- Can increase or decrease
- Represents current state at measurement time
- Directly queryable (no rate calculation needed)
When to use:
- Current memory usage
- Active database connections
- Queue depth
- Number of in-progress tasks
- Temperature, CPU percentage
Example queries:
# Current memory usage
jvm_memory_used_bytes
# Connection pool utilization
db_connections_active / db_connections_max * 100
Anti-pattern: Don't use gauges for things that always increase (use Counter instead). Gauges can miss spikes between scrapes.
Timer
A timer measures both the duration of events and their frequency. Timers are actually a combination of multiple metrics: count, sum, and histograms.
Characteristics:
- Records duration of operations
- Tracks call count (how many times operation called)
- Produces percentiles (P50, P95, P99)
- Can calculate rate and average duration
When to use:
- HTTP request latency
- Database query duration
- Method execution time
- External API call time
What you get from a timer:
timer_count: Total number of calls (acts like a counter)timer_sum: Total time spent across all callstimer_max: Maximum recorded durationtimer_bucket: Histogram buckets for percentile calculation
Example queries:
# Average latency over last 5 minutes
rate(http_request_duration_seconds_sum[5m]) / rate(http_request_duration_seconds_count[5m])
# 95th percentile latency
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))
# Requests per second
rate(http_request_duration_seconds_count[5m])
Distribution Summary
Similar to timers but for tracking the distribution of non-time values.
Characteristics:
- Records distribution of values
- Tracks count and sum
- Produces percentiles
- Does not measure time
When to use:
- Request/response payload sizes
- Order amounts
- Batch sizes
- Any measured value where distribution matters
Example:
DistributionSummary.builder("order.amount")
.baseUnit("dollars")
.register(registry)
.record(orderAmount);
Difference from Timer: Timer is specifically for durations (nanoseconds), while DistributionSummary is for arbitrary value distributions.
The Four Golden Signals
Google's SRE book defines four key metrics that effectively monitor any system. Focus on these before adding custom metrics.
1. Latency
Definition: Time it takes to service a request.
Why it matters: Latency directly impacts user experience. A 10ms increase in latency might drop conversion rates.
What to measure:
- HTTP request duration (P50, P95, P99)
- Database query time
- External API call duration
- Message queue processing time
Key insight: Always use percentiles (P95, P99) not averages. An average latency of 100ms might hide that 5% of requests take 5 seconds.
2. Traffic
Definition: How much demand is being placed on your system.
Why it matters: Understanding traffic patterns helps capacity planning and identifying unusual load.
What to measure:
- HTTP requests per second
- Database queries per second
- Messages published/consumed per second
- Active WebSocket connections
Key insight: Monitor traffic by endpoint/operation to identify hotspots.
3. Errors
Definition: Rate of failed requests.
Why it matters: Errors directly impact users and indicate system health problems.
What to measure:
- HTTP 5xx error rate
- Exception count by type
- Failed database transactions
- Circuit breaker trips
Key insight: Track error rate (errors per second) and error ratio (errors / total requests) - they tell different stories.
4. Saturation
Definition: How "full" your service is (resource utilization).
Why it matters: Saturation predicts future problems. At 90% CPU, you're close to degraded performance.
What to measure:
- CPU utilization
- Memory usage (heap, non-heap)
- Database connection pool usage
- Disk I/O, Network bandwidth
- Thread pool utilization
Key insight: Set alerts before hitting 100%. Alert at 80-85% to give time to scale.
Spring Boot with Micrometer
Micrometer is a metrics instrumentation library that provides a vendor-neutral facade - similar to how SLF4J works for logging. You write metrics code once using Micrometer's API, then choose a backend (Prometheus, Datadog, New Relic, etc.) through configuration.
Why Micrometer matters:
- Vendor neutrality: Switch monitoring backends without changing application code
- Spring Boot integration: Auto-configuration for common metrics (HTTP, JVM, database)
- Rich API: Simple builder patterns for all metric types
- Production ready: Battle-tested in thousands of Spring Boot applications
Dependencies
// build.gradle
implementation 'org.springframework.boot:spring-boot-starter-actuator'
runtimeOnly 'io.micrometer:micrometer-registry-prometheus'
Configuration
# application.yml
management:
endpoints:
web:
exposure:
include: health,info,metrics,prometheus
metrics:
tags:
application: ${spring.application.name}
environment: ${spring.profiles.active}
distribution:
percentiles-histogram:
http.server.requests: true
percentiles:
http.server.requests: 0.5, 0.95, 0.99
Custom Metrics
import io.micrometer.core.instrument.Counter;
import io.micrometer.core.instrument.MeterRegistry;
import io.micrometer.core.instrument.Timer;
import org.springframework.stereotype.Service;
@Service
public class PaymentService {
private final MeterRegistry meterRegistry;
private final Counter paymentsProcessedCounter;
private final Counter paymentsFailedCounter;
private final Timer paymentProcessingTimer;
public PaymentService(MeterRegistry meterRegistry) {
this.meterRegistry = meterRegistry;
// Counter: Total payments processed
this.paymentsProcessedCounter = Counter.builder("payments.processed.total")
.description("Total number of payments processed")
.tag("status", "success")
.register(meterRegistry);
// Counter: Failed payments
this.paymentsFailedCounter = Counter.builder("payments.processed.total")
.description("Total number of payments processed")
.tag("status", "failed")
.register(meterRegistry);
// Timer: Payment processing duration
this.paymentProcessingTimer = Timer.builder("payments.processing.duration")
.description("Payment processing duration")
.register(meterRegistry);
}
public PaymentResult processPayment(Payment payment) {
return paymentProcessingTimer.recordCallable(() -> {
try {
PaymentResult result = executePayment(payment);
// Increment success counter
paymentsProcessedCounter.increment();
// Record payment amount as gauge
meterRegistry.gauge("payments.amount.latest",
Tags.of("currency", payment.getCurrency()),
payment.getAmount());
return result;
} catch (Exception e) {
// Increment failure counter
paymentsFailedCounter.increment();
throw e;
}
});
}
// Gauge: Active payment processing count
@PostConstruct
public void init() {
meterRegistry.gauge("payments.processing.active",
Tags.empty(),
this,
service -> getActivePaymentCount());
}
private long getActivePaymentCount() {
// Return current active payments
return 0;
}
}
Business Metrics
Payment Metrics
@Component
public class PaymentMetrics {
private final MeterRegistry registry;
public PaymentMetrics(MeterRegistry registry) {
this.registry = registry;
}
public void recordPaymentProcessed(Payment payment, PaymentResult result) {
// Counter by currency
Counter.builder("payments.processed.total")
.tag("currency", payment.getCurrency())
.tag("status", result.getStatus().name())
.register(registry)
.increment();
// Distribution summary for payment amounts
DistributionSummary.builder("payments.amount")
.baseUnit("dollars")
.tag("currency", payment.getCurrency())
.register(registry)
.record(payment.getAmount().doubleValue());
// Timer for payment processing by type
Timer.builder("payments.processing.time")
.tag("type", payment.getType().name())
.register(registry)
.record(result.getDuration());
}
public void recordPaymentFailed(Payment payment, String errorType) {
Counter.builder("payments.failed.total")
.tag("currency", payment.getCurrency())
.tag("error_type", errorType)
.register(registry)
.increment();
}
public void recordAccountCreated(Account account) {
Counter.builder("accounts.created.total")
.tag("type", account.getType().name())
.register(registry)
.increment();
}
}
Technical Metrics
HTTP Metrics (Auto-configured)
Spring Boot automatically provides:
http.server.requests- Request count and latencyhttp.client.requests- Outgoing HTTP calls
JVM Metrics (Auto-configured)
jvm.memory.used- Memory usagejvm.gc.pause- GC pause timesjvm.threads.live- Thread count
Database Metrics
@Configuration
public class DataSourceMetricsConfig {
@Bean
public DataSourcePoolMetadataProvider dataSourcePoolMetadataProvider(
DataSource dataSource,
MeterRegistry registry) {
// Hikari pool metrics
HikariDataSource hikariDataSource = (HikariDataSource) dataSource;
registry.gauge("db.pool.active", hikariDataSource, HikariDataSource::getHikariPoolMXBean.getActiveConnections);
registry.gauge("db.pool.idle", hikariDataSource, HikariDataSource::getHikariPoolMXBean.getIdleConnections);
registry.gauge("db.pool.total", hikariDataSource, HikariDataSource::getHikariPoolMXBean.getTotalConnections);
return new HikariDataSourcePoolMetadata(hikariDataSource);
}
}
Prometheus Integration
Exposing Metrics Endpoint
// Metrics available at /actuator/prometheus
// Example output:
// # HELP payments_processed_total Total number of payments processed
// # TYPE payments_processed_total counter
// payments_processed_total{currency="USD",status="success"} 1234.0
// payments_processed_total{currency="EUR",status="success"} 567.0
Prometheus Configuration
# prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_configs:
- job_name: 'payment-service'
metrics_path: '/actuator/prometheus'
static_configs:
- targets: ['payment-service:8080']
labels:
environment: 'production'
service: 'payment-service'
- job_name: 'account-service'
metrics_path: '/actuator/prometheus'
static_configs:
- targets: ['account-service:8080']
labels:
environment: 'production'
service: 'account-service'
Grafana Dashboards
Example Dashboard Query
# Request rate (requests per second)
rate(http_server_requests_seconds_count[5m])
# 95th percentile latency
histogram_quantile(0.95, sum(rate(http_server_requests_seconds_bucket[5m])) by (le, uri))
# Error rate
rate(http_server_requests_seconds_count{status=~"5.."}[5m])
# Success rate
sum(rate(payments_processed_total{status="success"}[5m])) by (currency)
# Database connection pool usage
db_pool_active / db_pool_total * 100
# JVM memory usage
jvm_memory_used_bytes{area="heap"} / jvm_memory_max_bytes{area="heap"} * 100
Dashboard Panels
{
"dashboard": {
"title": "Payment Service Dashboard",
"panels": [
{
"title": "Request Rate",
"targets": [
{
"expr": "rate(http_server_requests_seconds_count[5m])"
}
]
},
{
"title": "Error Rate",
"targets": [
{
"expr": "rate(http_server_requests_seconds_count{status=~\"5..\"}[5m])"
}
]
},
{
"title": "P95 Latency",
"targets": [
{
"expr": "histogram_quantile(0.95, rate(http_server_requests_seconds_bucket[5m]))"
}
]
}
]
}
}
Alerting Rules
Prometheus Alerting
# alerts.yml
groups:
- name: payment_service_alerts
interval: 30s
rules:
# High error rate
- alert: HighErrorRate
expr: rate(http_server_requests_seconds_count{status=~"5.."}[5m]) > 0.05
for: 5m
labels:
severity: critical
annotations:
summary: "High error rate detected"
description: "Error rate is {{ $value | humanizePercentage }} for {{ $labels.service }}"
# High latency
- alert: HighLatency
expr: histogram_quantile(0.95, rate(http_server_requests_seconds_bucket[5m])) > 1.0
for: 5m
labels:
severity: warning
annotations:
summary: "High latency detected"
description: "P95 latency is {{ $value }}s for {{ $labels.uri }}"
# Payment failures
- alert: PaymentFailuresHigh
expr: rate(payments_failed_total[5m]) > 0.1
for: 5m
labels:
severity: critical
annotations:
summary: "High payment failure rate"
description: "Payment failures: {{ $value | humanize }} per second"
# Database connection pool exhaustion
- alert: DBPoolExhausted
expr: db_pool_active / db_pool_total > 0.9
for: 5m
labels:
severity: warning
annotations:
summary: "Database pool near exhaustion"
description: "Pool usage: {{ $value | humanizePercentage }}"
# JVM memory pressure
- alert: HighMemoryUsage
expr: jvm_memory_used_bytes{area="heap"} / jvm_memory_max_bytes{area="heap"} > 0.85
for: 5m
labels:
severity: warning
annotations:
summary: "High JVM memory usage"
description: "Heap usage: {{ $value | humanizePercentage }}"
Best Practices
Naming Conventions
// GOOD: Descriptive names with units
Counter.builder("payments.processed.total")
Timer.builder("payments.processing.duration.seconds")
Gauge.builder("db.connections.active.count")
// BAD: Vague names without units
Counter.builder("payments")
Timer.builder("time")
Gauge.builder("connections")
Tag Guidelines
// GOOD: Bounded cardinality
.tag("currency", "USD") // Limited set of currencies
.tag("status", "success") // Limited set of statuses
.tag("payment_type", "transfer") // Limited set of types
// BAD: Unbounded cardinality
.tag("user_id", userId) // Millions of users
.tag("transaction_id", txnId) // Unbounded unique values
.tag("timestamp", timestamp.toString()) // Infinite values
Avoid High Cardinality
High cardinality metrics (millions of unique tag combinations) cause memory issues in Prometheus. Use exemplars or logs for unique identifiers.
Node.js / TypeScript Metrics
Prometheus Client
import { register, Counter, Histogram, Gauge } from 'prom-client';
// Counter
const paymentsProcessedCounter = new Counter({
name: 'payments_processed_total',
help: 'Total number of payments processed',
labelNames: ['currency', 'status']
});
// Histogram (for latency)
const paymentProcessingDuration = new Histogram({
name: 'payment_processing_duration_seconds',
help: 'Payment processing duration',
labelNames: ['type'],
buckets: [0.1, 0.5, 1, 2, 5]
});
// Gauge
const activePaymentsGauge = new Gauge({
name: 'payments_active_count',
help: 'Number of active payment processing'
});
export class PaymentService {
async processPayment(payment: Payment): Promise<PaymentResult> {
const end = paymentProcessingDuration.startTimer({ type: payment.type });
try {
const result = await this.executePayment(payment);
paymentsProcessedCounter.inc({
currency: payment.currency,
status: 'success'
});
end();
return result;
} catch (error) {
paymentsProcessedCounter.inc({
currency: payment.currency,
status: 'failed'
});
throw error;
}
}
}
// Expose metrics endpoint
app.get('/metrics', async (req, res) => {
res.set('Content-Type', register.contentType);
res.end(await register.metrics());
});
Further Reading
Internal Documentation
- Observability Logging - Structured logging
- Observability Tracing - Distributed tracing
- Spring Boot Observability - Spring Boot integration
- Performance Testing - Load testing
External Resources
Summary
Metrics provide the quantitative foundation of observability. While logs tell you "what happened" in detail and traces show you "where time was spent," metrics answer "how much" and "how fast" through aggregated time-series data.
Key Takeaways
-
Metrics measure aggregates, not individual events: Unlike logs (which capture each event) or traces (which follow each request), metrics aggregate data into numbers like "requests per second" or "95th percentile latency."
-
The Four Golden Signals provide comprehensive coverage: Latency (how long), Traffic (how much demand), Errors (what's failing), and Saturation (how full) - these four metrics from Google's SRE book cover most monitoring needs.
-
Metric types serve different purposes: Counters for cumulative totals (errors, requests), Gauges for point-in-time values (memory, connections), Timers for durations with percentiles, Distribution Summaries for non-time value distributions.
-
Percentiles beat averages: Average latency of 100ms might hide that 5% of users experience 5-second delays. Always track P95 and P99 to understand the tail latency affecting real users.
-
Micrometer provides vendor neutrality: Like SLF4J for logging, Micrometer lets you write metrics code once and switch backends (Prometheus, Datadog, New Relic) through configuration.
-
High cardinality kills memory: Unbounded tag values (user IDs, transaction IDs, email addresses) create millions of unique metric combinations, exhausting Prometheus memory. Use bounded tags only (status, endpoint, region).
-
Naming conventions matter: Consistent naming (
metric.name.unitlikehttp.requests.totalorpayment.processing.duration.seconds) makes metrics discoverable and self-documenting. -
Business metrics complement technical metrics: Technical metrics (JVM, HTTP, database) show system health, but business metrics (orders per second, checkout conversion rate) directly measure business value.
-
Alerting on symptoms, not causes: Alert on user-visible problems (high latency, error rate) not infrastructure details (CPU, memory). Infrastructure metrics help investigate alerts but shouldn't trigger them directly.
-
Prometheus + Grafana form a powerful stack: Prometheus efficiently scrapes and stores time-series data with powerful PromQL queries, while Grafana visualizes with rich dashboards. Together they're the most popular open-source metrics solution.
Relationship to Other Observability Pillars
Metrics excel at different aspects than logs and traces:
- Metrics → Logs: Metrics alert you to a problem ("error rate spiked to 15%"), logs tell you why specific errors happened
- Metrics → Traces: Metrics show aggregate latency increased, traces reveal which specific service in the call chain slowed down
- Combined workflow:
- Metrics alert fires: "P95 latency > 1s"
- Grafana dashboard shows it's the
/api/paymentsendpoint - Traces reveal bottleneck is in database queries
- Logs show specific query failing due to missing index
The three pillars are complementary, not redundant. You need all three for complete observability.
Next Steps:
- Review Distributed Tracing to understand request flow visualization with OpenTelemetry
- Read Logging Best Practices to learn how logs provide detailed context
- See framework implementations in Spring Boot Observability