Application Metrics Best Practices

Overview

Metrics are the second pillar of observability, providing real-time quantitative data about system behavior. While logs give you detailed narratives of specific events and traces show request flows, metrics answer questions like "How many?" and "How fast?" through aggregated time-series data.

Metrics enable you to:

Detect anomalies: Spot unusual patterns (error rate spikes, latency increases) before they cause outages
Validate SLAs: Measure actual performance against service level objectives
Trigger alerts: Automatically notify teams when thresholds are breached
Inform capacity planning: Understand resource usage trends over time
Provide business insights: Track domain-specific KPIs (orders per second, conversion rates)

This guide covers metrics fundamentals, the Four Golden Signals framework, implementation with Micrometer and Prometheus, and best practices for dashboards and alerting.

Metrics vs. Logs vs. Traces

Understanding when to use each observability pillar is crucial:

Aspect	Metrics	Logs	Traces
Purpose	Quantitative aggregates	Detailed event narrative	Request flow visualization
Question answered	"How many? How fast?"	"What happened and why?"	"Where did the request go?"
Data structure	Time-series numbers	Text/JSON events	Directed acyclic graph of spans
Volume	Low (aggregated)	High (per event)	Medium (sampled)
Storage	Time-series database (Prometheus)	Search engine (Elasticsearch)	Trace store (Jaeger)
Example	"Error rate: 5% (500/10000)"	"Payment PAY-123 failed: insufficient balance"	"Request took 2.5s: 1s in API, 1.5s in DB"

When to use each:

Metrics: Continuous monitoring, alerting on thresholds, trend analysis
Logs: Investigating specific errors, debugging, audit trails
Traces: Understanding latency distribution across services, identifying bottlenecks

In practice, you need all three working together - see the summary for how they complement each other.

Core Principles

Four Golden Signals: Latency, Traffic, Errors, Saturation - the foundation of effective monitoring
Business Metrics: Track domain-specific events (orders placed, checkouts completed, API calls)
Technical Metrics: Monitor infrastructure (JVM, HTTP, database, cache, message queues)
High Cardinality: Avoid unbounded tag values (no user IDs, transaction IDs) to prevent memory issues
Consistent Naming: Follow conventions (metric.name.unit) for discoverability
Percentiles Over Averages: P95/P99 reveal outliers that averages hide

Metric Types

Metrics libraries provide different types optimized for specific measurement patterns. Understanding which type to use is fundamental to effective metrics.

Counter

A counter is a cumulative metric that only increases (or resets to zero on restart). Use counters for things you want to count: requests received, errors encountered, tasks completed.

Characteristics:

Monotonically increasing (never decreases)
Resets to 0 on application restart
Query using rate() or increase() to see change over time

When to use:

Total requests processed
Total errors occurred
Items created/completed
Bytes transferred

Example queries:

# Requests per second over last 5 minutes
rate(http_requests_total[5m])

# Total errors in last hour
increase(errors_total[1h])

Why not track rates directly? Counters are more resilient than tracking rates. If metric scraping fails for a period, counters can reconstruct the rate from the delta, whereas a rate metric would show gaps.

Gauge

A gauge represents a single numerical value that can arbitrarily go up or down. Think of it like a thermometer reading.

Characteristics:

Can increase or decrease
Represents current state at measurement time
Directly queryable (no rate calculation needed)

When to use:

Current memory usage
Active database connections
Queue depth
Number of in-progress tasks
Temperature, CPU percentage

Example queries:

# Current memory usage
jvm_memory_used_bytes

# Connection pool utilization
db_connections_active / db_connections_max * 100

Anti-pattern: Don't use gauges for things that always increase (use Counter instead). Gauges can miss spikes between scrapes.

Timer

A timer measures both the duration of events and their frequency. Timers are actually a combination of multiple metrics: count, sum, and histograms.

Characteristics:

Records duration of operations
Tracks call count (how many times operation called)
Produces percentiles (P50, P95, P99)
Can calculate rate and average duration

When to use:

HTTP request latency
Database query duration
Method execution time
External API call time

What you get from a timer:

timer_count: Total number of calls (acts like a counter)
timer_sum: Total time spent across all calls
timer_max: Maximum recorded duration
timer_bucket: Histogram buckets for percentile calculation

Example queries:

# Average latency over last 5 minutes
rate(http_request_duration_seconds_sum[5m]) / rate(http_request_duration_seconds_count[5m])

# 95th percentile latency
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))

# Requests per second
rate(http_request_duration_seconds_count[5m])

Distribution Summary

Similar to timers but for tracking the distribution of non-time values.

Characteristics:

Records distribution of values
Tracks count and sum
Produces percentiles
Does not measure time

When to use:

Request/response payload sizes
Order amounts
Batch sizes
Any measured value where distribution matters

Example:

DistributionSummary.builder("order.amount")
    .baseUnit("dollars")
    .register(registry)
    .record(orderAmount);

Difference from Timer: Timer is specifically for durations (nanoseconds), while DistributionSummary is for arbitrary value distributions.

The Four Golden Signals

Google's SRE book defines four key metrics that effectively monitor any system. Focus on these before adding custom metrics.

1. Latency

Definition: Time it takes to service a request.

Why it matters: Latency directly impacts user experience. A 10ms increase in latency might drop conversion rates.

What to measure:

HTTP request duration (P50, P95, P99)
Database query time
External API call duration
Message queue processing time

Key insight: Always use percentiles (P95, P99) not averages. An average latency of 100ms might hide that 5% of requests take 5 seconds.

2. Traffic

Definition: How much demand is being placed on your system.

Why it matters: Understanding traffic patterns helps capacity planning and identifying unusual load.

What to measure:

HTTP requests per second
Database queries per second
Messages published/consumed per second
Active WebSocket connections

Key insight: Monitor traffic by endpoint/operation to identify hotspots.

3. Errors

Definition: Rate of failed requests.

Why it matters: Errors directly impact users and indicate system health problems.

What to measure:

HTTP 5xx error rate
Exception count by type
Failed database transactions
Circuit breaker trips

Key insight: Track error rate (errors per second) and error ratio (errors / total requests) - they tell different stories.

4. Saturation

Definition: How "full" your service is (resource utilization).

Why it matters: Saturation predicts future problems. At 90% CPU, you're close to degraded performance.

What to measure:

CPU utilization
Memory usage (heap, non-heap)
Database connection pool usage
Disk I/O, Network bandwidth
Thread pool utilization

Key insight: Set alerts before hitting 100%. Alert at 80-85% to give time to scale.

Spring Boot with Micrometer

Micrometer is a metrics instrumentation library that provides a vendor-neutral facade - similar to how SLF4J works for logging. You write metrics code once using Micrometer's API, then choose a backend (Prometheus, Datadog, New Relic, etc.) through configuration.

Why Micrometer matters:

Vendor neutrality: Switch monitoring backends without changing application code
Spring Boot integration: Auto-configuration for common metrics (HTTP, JVM, database)
Rich API: Simple builder patterns for all metric types
Production ready: Battle-tested in thousands of Spring Boot applications

Dependencies

// build.gradle
implementation 'org.springframework.boot:spring-boot-starter-actuator'
runtimeOnly 'io.micrometer:micrometer-registry-prometheus'

Configuration

# application.yml
management:
  endpoints:
    web:
      exposure:
        include: health,info,metrics,prometheus
  metrics:
    tags:
      application: ${spring.application.name}
      environment: ${spring.profiles.active}
    distribution:
      percentiles-histogram:
        http.server.requests: true
      percentiles:
        http.server.requests: 0.5, 0.95, 0.99

Custom Metrics

import io.micrometer.core.instrument.Counter;
import io.micrometer.core.instrument.MeterRegistry;
import io.micrometer.core.instrument.Timer;
import org.springframework.stereotype.Service;

@Service
public class PaymentService {

    private final MeterRegistry meterRegistry;
    private final Counter paymentsProcessedCounter;
    private final Counter paymentsFailedCounter;
    private final Timer paymentProcessingTimer;

    public PaymentService(MeterRegistry meterRegistry) {
        this.meterRegistry = meterRegistry;

        // Counter: Total payments processed
        this.paymentsProcessedCounter = Counter.builder("payments.processed.total")
            .description("Total number of payments processed")
            .tag("status", "success")
            .register(meterRegistry);

        // Counter: Failed payments
        this.paymentsFailedCounter = Counter.builder("payments.processed.total")
            .description("Total number of payments processed")
            .tag("status", "failed")
            .register(meterRegistry);

        // Timer: Payment processing duration
        this.paymentProcessingTimer = Timer.builder("payments.processing.duration")
            .description("Payment processing duration")
            .register(meterRegistry);
    }

    public PaymentResult processPayment(Payment payment) {
        return paymentProcessingTimer.recordCallable(() -> {
            try {
                PaymentResult result = executePayment(payment);

                // Increment success counter
                paymentsProcessedCounter.increment();

                // Record payment amount as gauge
                meterRegistry.gauge("payments.amount.latest",
                    Tags.of("currency", payment.getCurrency()),
                    payment.getAmount());

                return result;

            } catch (Exception e) {
                // Increment failure counter
                paymentsFailedCounter.increment();
                throw e;
            }
        });
    }

    // Gauge: Active payment processing count
    @PostConstruct
    public void init() {
        meterRegistry.gauge("payments.processing.active",
            Tags.empty(),
            this,
            service -> getActivePaymentCount());
    }

    private long getActivePaymentCount() {
        // Return current active payments
        return 0;
    }
}

Business Metrics

Payment Metrics

@Component
public class PaymentMetrics {

    private final MeterRegistry registry;

    public PaymentMetrics(MeterRegistry registry) {
        this.registry = registry;
    }

    public void recordPaymentProcessed(Payment payment, PaymentResult result) {
        // Counter by currency
        Counter.builder("payments.processed.total")
            .tag("currency", payment.getCurrency())
            .tag("status", result.getStatus().name())
            .register(registry)
            .increment();

        // Distribution summary for payment amounts
        DistributionSummary.builder("payments.amount")
            .baseUnit("dollars")
            .tag("currency", payment.getCurrency())
            .register(registry)
            .record(payment.getAmount().doubleValue());

        // Timer for payment processing by type
        Timer.builder("payments.processing.time")
            .tag("type", payment.getType().name())
            .register(registry)
            .record(result.getDuration());
    }

    public void recordPaymentFailed(Payment payment, String errorType) {
        Counter.builder("payments.failed.total")
            .tag("currency", payment.getCurrency())
            .tag("error_type", errorType)
            .register(registry)
            .increment();
    }

    public void recordAccountCreated(Account account) {
        Counter.builder("accounts.created.total")
            .tag("type", account.getType().name())
            .register(registry)
            .increment();
    }
}

Technical Metrics

HTTP Metrics (Auto-configured)

Spring Boot automatically provides:

http.server.requests - Request count and latency
http.client.requests - Outgoing HTTP calls

JVM Metrics (Auto-configured)

jvm.memory.used - Memory usage
jvm.gc.pause - GC pause times
jvm.threads.live - Thread count

Database Metrics

@Configuration
public class DataSourceMetricsConfig {

    @Bean
    public DataSourcePoolMetadataProvider dataSourcePoolMetadataProvider(
            DataSource dataSource,
            MeterRegistry registry) {

        // Hikari pool metrics
        HikariDataSource hikariDataSource = (HikariDataSource) dataSource;

        registry.gauge("db.pool.active", hikariDataSource, HikariDataSource::getHikariPoolMXBean.getActiveConnections);
        registry.gauge("db.pool.idle", hikariDataSource, HikariDataSource::getHikariPoolMXBean.getIdleConnections);
        registry.gauge("db.pool.total", hikariDataSource, HikariDataSource::getHikariPoolMXBean.getTotalConnections);

        return new HikariDataSourcePoolMetadata(hikariDataSource);
    }
}

Prometheus Integration

Exposing Metrics Endpoint

// Metrics available at /actuator/prometheus
// Example output:
// # HELP payments_processed_total Total number of payments processed
// # TYPE payments_processed_total counter
// payments_processed_total{currency="USD",status="success"} 1234.0
// payments_processed_total{currency="EUR",status="success"} 567.0

Prometheus Configuration

# prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
  - job_name: 'payment-service'
    metrics_path: '/actuator/prometheus'
    static_configs:
      - targets: ['payment-service:8080']
        labels:
          environment: 'production'
          service: 'payment-service'

  - job_name: 'account-service'
    metrics_path: '/actuator/prometheus'
    static_configs:
      - targets: ['account-service:8080']
        labels:
          environment: 'production'
          service: 'account-service'

Grafana Dashboards

Example Dashboard Query

# Request rate (requests per second)
rate(http_server_requests_seconds_count[5m])

# 95th percentile latency
histogram_quantile(0.95, sum(rate(http_server_requests_seconds_bucket[5m])) by (le, uri))

# Error rate
rate(http_server_requests_seconds_count{status=~"5.."}[5m])

# Success rate
sum(rate(payments_processed_total{status="success"}[5m])) by (currency)

# Database connection pool usage
db_pool_active / db_pool_total * 100

# JVM memory usage
jvm_memory_used_bytes{area="heap"} / jvm_memory_max_bytes{area="heap"} * 100

Dashboard Panels

{
  "dashboard": {
    "title": "Payment Service Dashboard",
    "panels": [
      {
        "title": "Request Rate",
        "targets": [
          {
            "expr": "rate(http_server_requests_seconds_count[5m])"
          }
        ]
      },
      {
        "title": "Error Rate",
        "targets": [
          {
            "expr": "rate(http_server_requests_seconds_count{status=~\"5..\"}[5m])"
          }
        ]
      },
      {
        "title": "P95 Latency",
        "targets": [
          {
            "expr": "histogram_quantile(0.95, rate(http_server_requests_seconds_bucket[5m]))"
          }
        ]
      }
    ]
  }
}

Alerting Rules

Prometheus Alerting

# alerts.yml
groups:
  - name: payment_service_alerts
    interval: 30s
    rules:
      # High error rate
      - alert: HighErrorRate
        expr: rate(http_server_requests_seconds_count{status=~"5.."}[5m]) > 0.05
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "High error rate detected"
          description: "Error rate is {{ $value | humanizePercentage }} for {{ $labels.service }}"

      # High latency
      - alert: HighLatency
        expr: histogram_quantile(0.95, rate(http_server_requests_seconds_bucket[5m])) > 1.0
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High latency detected"
          description: "P95 latency is {{ $value }}s for {{ $labels.uri }}"

      # Payment failures
      - alert: PaymentFailuresHigh
        expr: rate(payments_failed_total[5m]) > 0.1
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "High payment failure rate"
          description: "Payment failures: {{ $value | humanize }} per second"

      # Database connection pool exhaustion
      - alert: DBPoolExhausted
        expr: db_pool_active / db_pool_total > 0.9
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Database pool near exhaustion"
          description: "Pool usage: {{ $value | humanizePercentage }}"

      # JVM memory pressure
      - alert: HighMemoryUsage
        expr: jvm_memory_used_bytes{area="heap"} / jvm_memory_max_bytes{area="heap"} > 0.85
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High JVM memory usage"
          description: "Heap usage: {{ $value | humanizePercentage }}"

Best Practices

Naming Conventions

// GOOD: Descriptive names with units
Counter.builder("payments.processed.total")
Timer.builder("payments.processing.duration.seconds")
Gauge.builder("db.connections.active.count")

// BAD: Vague names without units
Counter.builder("payments")
Timer.builder("time")
Gauge.builder("connections")

Tag Guidelines

// GOOD: Bounded cardinality
.tag("currency", "USD")  // Limited set of currencies
.tag("status", "success")  // Limited set of statuses
.tag("payment_type", "transfer")  // Limited set of types

// BAD: Unbounded cardinality
.tag("user_id", userId)  // Millions of users
.tag("transaction_id", txnId)  // Unbounded unique values
.tag("timestamp", timestamp.toString())  // Infinite values

Avoid High Cardinality

High cardinality metrics (millions of unique tag combinations) cause memory issues in Prometheus. Use exemplars or logs for unique identifiers.

Node.js / TypeScript Metrics

Prometheus Client

import { register, Counter, Histogram, Gauge } from 'prom-client';

// Counter
const paymentsProcessedCounter = new Counter({
  name: 'payments_processed_total',
  help: 'Total number of payments processed',
  labelNames: ['currency', 'status']
});

// Histogram (for latency)
const paymentProcessingDuration = new Histogram({
  name: 'payment_processing_duration_seconds',
  help: 'Payment processing duration',
  labelNames: ['type'],
  buckets: [0.1, 0.5, 1, 2, 5]
});

// Gauge
const activePaymentsGauge = new Gauge({
  name: 'payments_active_count',
  help: 'Number of active payment processing'
});

export class PaymentService {
  async processPayment(payment: Payment): Promise<PaymentResult> {
    const end = paymentProcessingDuration.startTimer({ type: payment.type });

    try {
      const result = await this.executePayment(payment);

      paymentsProcessedCounter.inc({
        currency: payment.currency,
        status: 'success'
      });

      end();
      return result;

    } catch (error) {
      paymentsProcessedCounter.inc({
        currency: payment.currency,
        status: 'failed'
      });

      throw error;
    }
  }
}

// Expose metrics endpoint
app.get('/metrics', async (req, res) => {
  res.set('Content-Type', register.contentType);
  res.end(await register.metrics());
});

Summary

Metrics provide the quantitative foundation of observability. While logs tell you "what happened" in detail and traces show you "where time was spent," metrics answer "how much" and "how fast" through aggregated time-series data.

Key Takeaways

Metrics measure aggregates, not individual events: Unlike logs (which capture each event) or traces (which follow each request), metrics aggregate data into numbers like "requests per second" or "95th percentile latency."
The Four Golden Signals provide comprehensive coverage: Latency (how long), Traffic (how much demand), Errors (what's failing), and Saturation (how full) - these four metrics from Google's SRE book cover most monitoring needs.
Metric types serve different purposes: Counters for cumulative totals (errors, requests), Gauges for point-in-time values (memory, connections), Timers for durations with percentiles, Distribution Summaries for non-time value distributions.
Percentiles beat averages: Average latency of 100ms might hide that 5% of users experience 5-second delays. Always track P95 and P99 to understand the tail latency affecting real users.
Micrometer provides vendor neutrality: Like SLF4J for logging, Micrometer lets you write metrics code once and switch backends (Prometheus, Datadog, New Relic) through configuration.
High cardinality kills memory: Unbounded tag values (user IDs, transaction IDs, email addresses) create millions of unique metric combinations, exhausting Prometheus memory. Use bounded tags only (status, endpoint, region).
Naming conventions matter: Consistent naming (metric.name.unit like http.requests.total or payment.processing.duration.seconds) makes metrics discoverable and self-documenting.
Business metrics complement technical metrics: Technical metrics (JVM, HTTP, database) show system health, but business metrics (orders per second, checkout conversion rate) directly measure business value.
Alerting on symptoms, not causes: Alert on user-visible problems (high latency, error rate) not infrastructure details (CPU, memory). Infrastructure metrics help investigate alerts but shouldn't trigger them directly.
Prometheus + Grafana form a powerful stack: Prometheus efficiently scrapes and stores time-series data with powerful PromQL queries, while Grafana visualizes with rich dashboards. Together they're the most popular open-source metrics solution.

Relationship to Other Observability Pillars

Metrics excel at different aspects than logs and traces:

Metrics → Logs: Metrics alert you to a problem ("error rate spiked to 15%"), logs tell you why specific errors happened
Metrics → Traces: Metrics show aggregate latency increased, traces reveal which specific service in the call chain slowed down
Combined workflow:
1. Metrics alert fires: "P95 latency > 1s"
2. Grafana dashboard shows it's the /api/payments endpoint
3. Traces reveal bottleneck is in database queries
4. Logs show specific query failing due to missing index

The three pillars are complementary, not redundant. You need all three for complete observability.

Next Steps:

Review Distributed Tracing to understand request flow visualization with OpenTelemetry
Read Logging Best Practices to learn how logs provide detailed context
See framework implementations in Spring Boot Observability

Overview​

Metrics vs. Logs vs. Traces​

Core Principles​

Metric Types​

Counter​

Gauge​

Timer​

Distribution Summary​

The Four Golden Signals​

1. Latency​

2. Traffic​

3. Errors​

4. Saturation​

Spring Boot with Micrometer​

Dependencies​

Configuration​

Custom Metrics​

Business Metrics​

Payment Metrics​

Technical Metrics​

HTTP Metrics (Auto-configured)​

JVM Metrics (Auto-configured)​

Database Metrics​

Prometheus Integration​

Exposing Metrics Endpoint​

Prometheus Configuration​

Grafana Dashboards​

Example Dashboard Query​

Dashboard Panels​

Alerting Rules​

Prometheus Alerting​

Best Practices​

Naming Conventions​

Tag Guidelines​

Avoid High Cardinality​

Node.js / TypeScript Metrics​

Prometheus Client​

Further Reading​

Internal Documentation​

External Resources​

Summary​

Key Takeaways​

Relationship to Other Observability Pillars​

Overview

Metrics vs. Logs vs. Traces

Core Principles

Metric Types

Counter

Gauge

Timer

Distribution Summary

The Four Golden Signals

1. Latency

2. Traffic

3. Errors

4. Saturation

Spring Boot with Micrometer

Dependencies

Configuration

Custom Metrics

Business Metrics

Payment Metrics

Technical Metrics

HTTP Metrics (Auto-configured)

JVM Metrics (Auto-configured)

Database Metrics

Prometheus Integration

Exposing Metrics Endpoint

Prometheus Configuration

Grafana Dashboards

Example Dashboard Query

Dashboard Panels

Alerting Rules

Prometheus Alerting

Best Practices

Naming Conventions

Tag Guidelines

Avoid High Cardinality

Node.js / TypeScript Metrics

Prometheus Client

Further Reading

Internal Documentation

External Resources

Summary

Key Takeaways

Relationship to Other Observability Pillars