Monitoring and Alerting Strategy

Purpose of Monitoring and Alerting

Effective monitoring tells you when something is wrong, while effective alerting tells you what action to take. The goal is not to monitor everything, but to monitor what matters to your users and to alert only when human intervention is needed.

Key principles:

User-centric: Monitor what affects user experience, not just infrastructure
Actionable: Every alert should require immediate action or investigation
Proactive: Detect issues before users report them
Scalable: System works as you grow from 10 to 10,000 services

This guide covers the frameworks and practices for building reliable, sustainable monitoring and alerting systems.

SLIs, SLOs, SLAs, and Error Budgets

These four concepts form the foundation of modern service reliability management. Understanding their relationship is critical for effective monitoring strategy.

Service Level Indicators (SLIs)

An SLI is a quantitative measurement of a specific aspect of service quality. SLIs are what you actually measure.

Definition: A ratio of good events to total events, expressed as a percentage:

SLI = (Good Events / Total Events) × 100%

Common SLIs:

SLI Type	Measures	Good Event	Total Events	Example Target
Availability	Service uptime	Successful requests (2xx/3xx)	All requests	99.9%
Latency	Response speed	Requests < threshold (e.g., 500ms)	All requests	95% < 500ms
Error Rate	Request success	Successful requests (non-5xx)	All requests	99.5% success
Throughput	Request handling capacity	Requests processed	Requests received	99% processed
Durability	Data preservation	Successfully stored items	All write attempts	99.999%

Example calculation:

# Latency SLI: Percentage of requests under 500ms
Total requests: 10,000
Requests under 500ms: 9,500

SLI = (9,500 / 10,000) × 100% = 95%

PromQL query for latency SLI:

# Calculate percentage of requests under 500ms
sum(rate(http_request_duration_seconds_bucket{le="0.5"}[5m]))
  /
sum(rate(http_request_duration_seconds_count[5m]))
  * 100

SLI Selection Criteria:

Choose SLIs that:

Directly impact user experience: Latency affects every user interaction
Are measurable from user perspective: End-to-end request time, not internal queue depth
Correlate with user satisfaction: Slow requests lead to user complaints
Cover different failure modes: Availability + latency + errors gives comprehensive view

Anti-pattern: Monitoring infrastructure metrics (CPU %, memory %) as SLIs. These are symptoms, not user-facing outcomes. Monitor them separately for capacity planning and debugging.

Service Level Objectives (SLOs)

An SLO is a target value for an SLI. It defines "how good is good enough."

Structure: SLI ≥ Target over Time Window

Examples:

# Payment Service SLOs
slos:
  - name: "payment-availability"
    sli: "successful_requests_percentage"
    target: 99.9
    window: "30 days"
    description: "99.9% of payment requests succeed (non-5xx responses)"

  - name: "payment-latency"
    sli: "requests_under_500ms_percentage"
    target: 95.0
    window: "30 days"
    description: "95% of payment requests complete within 500ms"

  - name: "payment-error-rate"
    sli: "error_free_requests_percentage"
    target: 99.5
    window: "30 days"
    description: "99.5% of payment requests have no errors"

SLO as a State Machine:

An SLO exists in one of three states at any given time:

Setting SLO Targets:

Don't aim for 100% - it's expensive and unnecessary. The target depends on:

User expectations: Consumer apps (99.9%) vs. critical infrastructure (99.99%)
Cost of improvement: Going from 99.9% to 99.99% costs exponentially more
Dependencies: Your SLO can't exceed your dependencies' SLOs
Historical performance: Start with current performance, then iterate

Rule of thumb: If users aren't complaining and you're meeting business goals, your current SLO might be good enough. Focus optimization effort elsewhere.

Service Level Agreements (SLAs)

An SLA is a contractual commitment with consequences if violated. SLAs are always more lenient than SLOs.

SLO vs SLA relationship:

SLA < SLO < Actual Performance

Example:

Actual Performance: 99.95% availability (what you deliver)
SLO: 99.9% availability (internal target with buffer)
SLA: 99.5% availability (contractual commitment to customers)

The gap between SLO and SLA is your safety buffer for unexpected issues without contractual penalties.

SLA Example:

## Payment Processing SLA

### Availability Commitment
We guarantee 99.5% availability measured monthly.

### Latency Commitment
95% of payment requests will complete within 1 second.

### Remedies
If SLA is not met:
- 99.5% - 99.0%: 10% service credit
- 99.0% - 95.0%: 25% service credit
- < 95.0%: 50% service credit

### Exclusions
Downtime caused by:
- Customer's infrastructure failures
- Scheduled maintenance (with 48hr notice)
- Force majeure events

SLA Considerations:

Financial consequences: Credits, refunds, or penalties for violations
Legal review required: SLAs are contracts, involve legal team
Customer communication: Customers see SLAs, not internal SLOs
Incident reporting: SLA violations typically require customer notification

Error Budgets

An error budget is the allowed amount of unreliability before violating your SLO. It's calculated from your SLO target.

Calculation:

Error Budget = 100% - SLO Target

Example: 99.9% availability SLO

Error Budget = 100% - 99.9% = 0.1% downtime allowed

Time-based error budget (30 days):

Total time: 30 days × 24 hours × 60 minutes = 43,200 minutes
Error budget: 43,200 × 0.1% = 43.2 minutes of downtime allowed per month

Error Budget as a Shared Resource:

Error budget enables objective conversations between product and engineering:

Error Budget Policy:

Define what happens when budget is exhausted:

error_budget_policy:
  - budget_remaining: "> 50%"
    action: "Business as usual - all changes allowed"

  - budget_remaining: "20% - 50%"
    action: "Elevated caution - review risky changes"

  - budget_remaining: "0% - 20%"
    action: "Freeze non-critical features, focus on reliability"

  - budget_remaining: "< 0%"
    action: "Feature freeze until next window, on-call postmortem required"

Tracking Error Budget:

Example: Payment Service Error Budget Calculation

SLO Target: 99.9% availability
Error Budget: 100% - 99.9% = 0.1%

Current period (30 days):
- Total requests: 1,000,000
- Failed requests: 800
- Actual SLI: (1,000,000 - 800) / 1,000,000 = 99.92%

Budget calculation:
- Allowed failures: 1,000,000 × 0.1% = 1,000 requests
- Actual failures: 800 requests
- Budget consumed: (800 / 1,000) × 100% = 80%
- Budget remaining: 20%

Status: WARNING (20% remaining)

PromQL query for error budget:

# Error budget remaining percentage
(1 - (
  sum(rate(http_server_requests_total{status=~"5.."}[30d]))
    /
  sum(rate(http_server_requests_total[30d]))
) / (1 - 0.999)) * 100

Benefits of Error Budgets:

Objective decision-making: Data-driven vs. opinion-driven
Balance innovation and reliability: Known risk tolerance
Prevent alert fatigue: Small SLO misses don't trigger incidents if budget remains
Incentive alignment: Product and engineering share reliability goals

The Four Golden Signals

Google's Site Reliability Engineering (SRE) book identifies four metrics that matter most for monitoring user-facing systems. These signals provide comprehensive coverage with minimal metrics.

1. Latency

What it measures: How long requests take to complete.

Why it matters: Users perceive slow systems as broken. Latency directly impacts user satisfaction and conversion rates.

Key insight: Measure latency for successful and failed requests separately. A failed request that returns immediately (e.g., 401 Unauthorized) shouldn't skew your latency metrics.

Metrics structure:

# Track latency separately by outcome
payment_latency_seconds:
  type: histogram
  labels:
    - outcome: success | failure
    - payment_method: card | bank_transfer | wallet
    - error_type: validation | gateway_timeout | network  # only for failures

Example PromQL for separate latency tracking:

# P95 latency for successful requests only
histogram_quantile(0.95,
  sum(rate(payment_latency_seconds_bucket{outcome="success"}[5m])) by (le)
)

# P95 latency for failed requests (for debugging slow failures)
histogram_quantile(0.95,
  sum(rate(payment_latency_seconds_bucket{outcome="failure"}[5m])) by (le, error_type)
)

For framework-specific implementation of latency tracking, see Spring Boot Observability.

What to monitor:

Percentiles, not averages: P50 (median), P95, P99, P99.9
Distribution: Histogram to see full latency distribution
By endpoint: Different endpoints have different latency expectations

Example PromQL query:

# P95 latency over 5 minutes for successful payment requests
histogram_quantile(0.95,
  sum(rate(payment_latency_bucket{outcome="success"}[5m])) by (le)
)

Anti-pattern: Alerting on average latency. Averages hide outliers. A system averaging 100ms could have 5% of requests taking 10 seconds - users experiencing those slow requests will be frustrated.

2. Traffic

What it measures: Demand on your system (requests per second, transactions per minute, etc.).

Why it matters: Helps identify capacity constraints, detect unusual patterns (DDoS, viral growth), and forecast infrastructure needs.

Metrics to track:

# Traffic metrics
http_server_requests_total:
  type: counter
  labels:
    - method: GET | POST | PUT | DELETE
    - uri: /api/payments | /api/users | ...
    - status: 200 | 400 | 500 | ...

# Most frameworks auto-instrument this metric
# Spring Boot: http.server.requests
# Express.js: http_request_duration_seconds_count
# FastAPI: http_requests_total

What to monitor:

Request rate: Requests per second (RPS) overall and by endpoint
Connection rate: New connections/second (for connection pools, databases)
Business metrics: Payments/minute, user signups/hour

Example PromQL query:

# Requests per second over last 5 minutes
sum(rate(http_server_requests_total[5m])) by (uri, method)

Pattern: Compare current traffic to historical baselines to detect anomalies:

# Traffic is 3x higher than usual (comparing to last week same time)
sum(rate(http_server_requests_total[5m]))
  >
3 * sum(rate(http_server_requests_total[5m] offset 7d))

3. Errors

What it measures: Rate of failed requests.

Why it matters: Errors directly degrade user experience. A 1% error rate means 1 in 100 users has a bad experience.

Error categorization:

Explicit errors are easy to detect (500 status codes, exceptions). Implicit errors are correct technically but wrong functionally (e.g., returning stale data after cache timeout).

Metrics structure for error tracking:

# Track both explicit and implicit errors
payment_errors_total:
  type: counter
  labels:
    - type: gateway_exception | declined | validation | timeout
    - reason: insufficient_funds | card_expired | ...  # for declined
    - error: NetworkException | TimeoutException | ... # for exceptions

Example: Tracking both error types

Explicit error (HTTP 500):
  - Gateway throws exception
  - Increment: payment_errors_total{type="gateway_exception", error="NetworkException"}

Implicit error (HTTP 200 but business failure):
  - Payment declined by gateway
  - Increment: payment_errors_total{type="declined", reason="insufficient_funds"}

For framework-specific implementation, see Spring Boot Observability.

What to monitor:

Error rate: Errors per second and percentage of total requests
Error types: 4xx vs 5xx, exception types, business error codes
Error ratio by endpoint: Some endpoints naturally have higher error rates (authentication)

Example PromQL query:

# Error rate: percentage of 5xx responses
sum(rate(http_server_requests_total{status=~"5.."}[5m]))
  /
sum(rate(http_server_requests_total[5m]))
  * 100

4. Saturation

What it measures: How "full" your service is - resource utilization and queueing.

Why it matters: Saturation precedes failure. A database connection pool at 95% usage will soon start rejecting requests. Monitoring saturation enables proactive scaling before users are impacted.

Common saturation metrics:

Resource	Metric	Warning Threshold	Critical Threshold
CPU	% utilization	> 70%	> 85%
Memory	% used	> 80%	> 90%
Disk	% used	> 75%	> 90%
Connection pool	Active connections / Max	> 70%	> 85%
Thread pool	Active threads / Max	> 70%	> 85%
Queue depth	Messages in queue	> 1000	> 10,000
Network	% bandwidth used	> 70%	> 85%

Common saturation metrics by technology:

# Database connection pools
# HikariCP (Java): hikaricp_connections_active, hikaricp_connections_max
# pg-pool (Node.js): pg_pool_size, pg_pool_max
# SQLAlchemy (Python): sqlalchemy_pool_checkedout, sqlalchemy_pool_size

# Thread/worker pools
# JVM: jvm_threads_live, jvm_threads_peak
# Node.js: nodejs_active_handles, nodejs_active_requests
# Python: process_threads

# Message queues
# RabbitMQ: rabbitmq_queue_messages, rabbitmq_queue_consumers
# Kafka: kafka_consumer_lag

For framework-specific saturation monitoring setup, see Spring Boot Observability.

Saturation alerting pattern:

Alert when resources approach limits, not when they're exceeded:

# Alert when connection pool is 80% utilized
(hikaricp_connections_active / hikaricp_connections_max) > 0.8

Why 80% and not 100%? At 80% utilization, you have time to scale before hitting limits. At 100%, users are already experiencing failures.

Combining the Four Signals

The signals work together to diagnose issues:

Symptom	Saturation	Likely Cause
Slow requests	CPU/DB	Resource contention
High errors		Application bug
Both slow + errors	All	Traffic spike overwhelming system
Slow DB queries	DB conn	Connection pool exhaustion

Alert Design Principles

Alerts are notifications that demand immediate human action. Poorly designed alerts lead to alert fatigue, where engineers ignore or silence alerts, missing real incidents.

Characteristics of Good Alerts

Every alert should be:

Actionable: The person receiving the alert knows what to do
Urgent: Requires immediate investigation or action
User-impacting: Affects user experience or will soon
Clear: Obvious what's wrong and where
Unique: Not redundant with other alerts

Test your alert: Ask "If this woke me at 3am, would I be able to take meaningful action?" If no, it's not an alert - it's a notification or dashboard metric.

Alert Severity Levels

Define clear severity levels with corresponding response expectations:

severity_levels:
  P1_CRITICAL:
    description: "Service down or severely degraded, users affected NOW"
    response_time: "Immediate (page on-call)"
    examples:
      - "API returning 100% errors for 5+ minutes"
      - "Database unreachable"
      - "Payment processing completely unavailable"
    escalation: "After 15 minutes, escalate to senior engineer"

  P2_HIGH:
    description: "Partial degradation, subset of users affected or imminent total failure"
    response_time: "15 minutes"
    examples:
      - "Error rate above 5% for 10+ minutes"
      - "P95 latency 3x normal"
      - "Connection pool 95% saturated (about to fail)"
    escalation: "After 30 minutes, escalate to lead"

  P3_MEDIUM:
    description: "Minor degradation, small user impact or early warning"
    response_time: "1 hour (during business hours only)"
    examples:
      - "Error rate above 1% for 30+ minutes"
      - "Disk space 85% full (trending toward full)"
      - "Certificate expiring in 7 days"
    escalation: "Create ticket if not resolved in 4 hours"

  P4_LOW:
    description: "Potential issue, no current user impact"
    response_time: "Next business day"
    examples:
      - "Increased latency but within SLO"
      - "Non-critical background job failing"
      - "Staging environment issue"
    escalation: "Weekly review if recurring"

Key rule: Only P1 and P2 should page/wake people. P3 and P4 are tracked but don't require immediate response.

Alerting on Symptoms, Not Causes

Anti-pattern: Alert on CPU usage exceeding 80% Better: Alert on latency exceeding SLO

Why? High CPU is a cause, but it doesn't always mean users are impacted. A batch job might spike CPU to 90% without affecting user requests. Conversely, latency directly measures user experience.

Exception: Alert on saturation metrics (connection pool 90% full) as leading indicators before they cause user-visible symptoms. This gives time to respond proactively.

Alert Thresholds and Windows

Threshold selection:

Thresholds should balance sensitivity (catch all real issues) with specificity (avoid false alarms).

# Too sensitive - will fire constantly
alert: HighErrorRate
expr: error_rate > 0.01  # 1% - too low, normal fluctuation

# Too conservative - misses real issues
expr: error_rate > 0.50  # 50% - too high, users already severely impacted

# Balanced - catches meaningful degradation
expr: error_rate > 0.05  # 5% - above normal, below disaster

Time windows:

Require issues to persist before alerting to avoid transient spikes:

alert: HighLatency
expr: p95_latency > 1.0  # 1 second
for: 5m  # Must be true for 5 consecutive minutes
annotations:
  summary: "P95 latency above 1s for 5+ minutes"

Why "for" matters: A single slow request spikes P95 temporarily. If it persists for 5 minutes, it's a systemic issue requiring investigation.

Guidance on windows:

Availability/errors: 5-10 minute windows (catch issues quickly)
Latency: 5-10 minute windows (avoid transient spikes)
Saturation: 10-15 minute windows (gives time to scale before critical)
Resource trends: 30+ minute windows (disk space, memory leaks)

Alert Descriptions and Runbooks

Every alert should include:

What is wrong: Clear problem statement
Why it matters: User impact
What to do: Link to runbook with investigation steps
Who to escalate to: If responder can't resolve

# Prometheus alert with comprehensive annotations
groups:
  - name: payment_service
    interval: 30s
    rules:
      - alert: PaymentServiceHighErrorRate
        expr: |
          sum(rate(payment_errors_total[5m]))
            /
          sum(rate(payment_requests_total[5m]))
            > 0.05
        for: 10m
        labels:
          severity: P2_HIGH
          service: payment-service
          team: payments
        annotations:
          summary: "Payment service error rate above 5% for 10+ minutes"
          description: |
            {{ $value | humanizePercentage }} of payment requests are failing.
            Current rate: {{ $value }} (threshold: 0.05)
            This affects users' ability to complete purchases.
          impact: "Users cannot complete purchases. Revenue impact: ~$X per minute."
          runbook_url: "https://wiki.company.com/runbooks/payment-service/high-error-rate"
          dashboard_url: "https://grafana.company.com/d/payments/payment-service-overview"
          grafana_query: 'sum(rate(payment_errors_total[5m])) / sum(rate(payment_requests_total[5m]))'

Runbook content (linked from alert):

# Runbook: Payment Service High Error Rate

## Symptom
Error rate for payment processing exceeds 5% for 10+ minutes.

## Impact
Users cannot complete purchases. Revenue loss approximately $500 per minute.

## Investigation Steps

### 1. Check Error Types (2 minutes)
```sh
# View error breakdown by type
kubectl logs -l app=payment-service --tail=100 | grep ERROR | jq .error_type | sort | uniq -c
```

**Common error types:**
- `GATEWAY_TIMEOUT`: Payment gateway unavailable
- `VALIDATION_ERROR`: Bad request data (likely code deployment)
- `DATABASE_ERROR`: Database connectivity issue

### 2. Check Dependencies (3 minutes)
```sh
# Check payment gateway health
curl https://gateway.paymentprovider.com/health

# Check database connectivity
kubectl exec -it payment-service-pod -- psql -c "SELECT 1"
```

### 3. Check Recent Deployments (2 minutes)
```sh
# List recent deployments
kubectl rollout history deployment/payment-service

# If recent deployment (&lt;15 min), rollback:
kubectl rollout undo deployment/payment-service
```

## Resolution Paths

### If Payment Gateway Down
1. Check status page: https://status.paymentprovider.com
2. Contact payment provider support: +1-555-0100
3. Consider failover to backup gateway (requires approval)

### If Database Issue
1. Check database metrics dashboard
2. Restart connection pool: `kubectl rollout restart deployment/payment-service`
3. If database is down, escalate to infrastructure team

### If Recent Deployment
1. Rollback immediately (see step 3 above)
2. Create incident ticket
3. Notify #payments-team Slack channel

## Escalation
- **Primary on-call**: Check PagerDuty rotation
- **Escalate after 15 min**: Senior engineer (see PagerDuty escalation policy)
- **Escalate after 30 min**: Engineering manager + VP Engineering

Alert Fatigue Prevention

Alert fatigue occurs when engineers receive too many non-actionable alerts and start ignoring them.

Symptoms of alert fatigue:

Alerts are routinely silenced without investigation
Average time-to-acknowledge increases over time
On-call engineers report constant interruptions
Team "tunes out" alert noise

Prevention strategies:

1. Regular alert audits:

-- Query alert statistics (Prometheus Alertmanager database)
SELECT
    alert_name,
    COUNT(*) as fires_per_month,
    AVG(resolution_time_minutes) as avg_resolution,
    COUNT(*) FILTER (WHERE acknowledged = false) as ignored_count
FROM alerts
WHERE created_at > NOW() - INTERVAL '30 days'
GROUP BY alert_name
HAVING fires_per_month > 100  -- Firing too frequently
    OR ignored_count > 10;     -- Being ignored

2. Alert tuning cycle:

3. Alert aggregation:

Instead of alerting on each failing instance, alert when a threshold of failures is reached:

# Bad: Alerts for each failing pod
alert: PodDown
expr: up{job="payment-service"} == 0

# Good: Alert when >20% pods are down (systemic issue)
alert: PaymentServiceDegraded
expr: |
  sum(up{job="payment-service"} == 0)
    /
  count(up{job="payment-service"})
    > 0.2

4. Inhibition rules:

Suppress redundant alerts when a root cause alert fires:

# Prometheus Alertmanager inhibition
inhibit_rules:
  # If database is down, inhibit all alerts from services using it
  - source_match:
      alertname: DatabaseDown
    target_match_re:
      service: "(payment|user|order)-service"
    equal: ['datacenter']

On-Call Best Practices

On-call engineers are responsible for responding to production incidents outside business hours.

On-Call Rotation Structure

Recommended rotation:

Shift length: 1 week per engineer
Primary + Secondary: Two-tier rotation for escalation
Follow-the-sun: If global team, hand off to next timezone
Fair distribution: Track hours to ensure equity

On-call compensation:

Stipend: Flat payment for being on-call (regardless of pages)
Incident pay: Additional payment per incident handled
Time off: Comp time if pages occurred during personal time

Incident Response Workflow

When an alert fires, follow a structured response:

Response SLAs by severity:

Severity	Acknowledge	Begin Investigation	Escalate If Not Resolved
P1	5 minutes	Immediate	15 minutes
P2	10 minutes	Within 15 min	30 minutes
P3	1 hour	Next business hours	4 hours
P4	Next business day	-	-

Runbook Quality Standards

Runbooks are step-by-step guides for responding to specific alerts. Quality runbooks are critical for effective incident response.

Runbook template:

# Runbook: [Alert Name]

## Metadata
- **Owner**: [Team name]
- **Last updated**: [Date]
- **Tested**: [Date last executed]

## Symptom
[Clear description of what is wrong]

## Impact
[User-facing impact and business consequences]

## Investigation (Time-boxed Steps)

### Step 1: [Action] (X minutes)
**What to do:**
[Specific command or action]

**Expected result:**
[What you should see if system is healthy]

**If result is abnormal:**
[What it means and next step]

### Step 2: [Action] (X minutes)
...

## Resolution Paths

### Scenario A: [Common cause]
[Step-by-step resolution]

### Scenario B: [Another common cause]
[Step-by-step resolution]

## Escalation
- **After X minutes**: [Who to escalate to]
- **Contact**: [Phone/Slack/PagerDuty info]

## Post-Resolution
- [ ] Update incident ticket
- [ ] Notify #status channel
- [ ] Schedule post-mortem if P1/P2

## Related Runbooks
- [Link to related runbook]

Runbook maintenance:

Test regularly: Run through runbooks quarterly
Update after incidents: Incorporate new learnings
Version control: Store in Git alongside code
Rotate ownership: Each engineer owns 2-3 runbooks

Incident Communication

Clear communication during incidents minimizes confusion and coordinates response.

Communication channels:

Channel	Purpose	Audience	Example
#incidents	Real-time coordination	Responders + leadership	"DB connection pool exhausted. Restarting pods."
#status	Customer-facing updates	Internal + customers	"Payment processing experiencing delays. Investigating."
StatusPage	External status	Customers only	"Degraded Performance: Payment API latency elevated"
Incident ticket	Documentation	Responders + future reference	Detailed timeline, actions taken, resolution

Update cadence:

incident_communication:
  initial_report:
    when: "Within 10 minutes of incident start"
    content: "What's wrong, impact, who's investigating"

  progress_updates:
    P1_CRITICAL: "Every 15 minutes until resolved"
    P2_HIGH: "Every 30 minutes until resolved"
    P3_MEDIUM: "Hourly during business hours"

  resolution:
    when: "Immediately upon resolution"
    content: "What was fixed, next steps, post-mortem timeline"

  post_mortem:
    when: "Within 3 business days for P1/P2"
    content: "Root cause, timeline, action items"

Dashboards for Different Audiences

Different stakeholders need different views of system health.

Operations Dashboard (Engineers)

Purpose: Real-time system health for troubleshooting.

Contents:

Four Golden Signals: Latency, traffic, errors, saturation
SLI/SLO tracking: Current SLI vs. target, error budget remaining
Request breakdown: By endpoint, status code, method
Infrastructure metrics: CPU, memory, connection pools
Recent deployments: Timeline of changes

dashboard:
  name: "Payment Service - Operations"
  refresh: "10s"

  rows:
    - title: "Golden Signals (Last 1 Hour)"
      panels:
        - type: graph
          title: "Latency (P50, P95, P99)"
          query: |
            histogram_quantile(0.50, sum(rate(payment_latency_bucket[5m])) by (le))
            histogram_quantile(0.95, sum(rate(payment_latency_bucket[5m])) by (le))
            histogram_quantile(0.99, sum(rate(payment_latency_bucket[5m])) by (le))

        - type: graph
          title: "Traffic (Requests/sec)"
          query: "sum(rate(payment_requests_total[5m]))"

        - type: graph
          title: "Error Rate (%)"
          query: |
            sum(rate(payment_errors_total[5m]))
              /
            sum(rate(payment_requests_total[5m])) * 100

        - type: gauge
          title: "Connection Pool Saturation"
          query: "hikaricp_connections_active / hikaricp_connections_max"
          thresholds:
            - value: 0.7
              color: yellow
            - value: 0.85
              color: red

    - title: "SLO Tracking (Last 30 Days)"
      panels:
        - type: stat
          title: "Availability SLO"
          query: |
            sum(rate(payment_requests_total{status!~"5.."}[30d]))
              /
            sum(rate(payment_requests_total[30d])) * 100
          target: 99.9

        - type: stat
          title: "Error Budget Remaining"
          query: |
            (1 - (1 - <availability_sli>) / (1 - 0.999)) * 100
          thresholds:
            - value: 20
              color: red
            - value: 50
              color: yellow
            - value: 100
              color: green

Business Dashboard (Management)

Purpose: High-level health and trends for business stakeholders.

Contents:

Uptime: Overall availability percentage
User impact: Requests affected, users impacted (if available)
SLO compliance: Are we meeting our targets?
Trends: Week-over-week, month-over-month comparisons
Incidents: Count and severity

dashboard:
  name: "Platform Health - Executive View"
  refresh: "5m"

  rows:
    - title: "This Month"
      panels:
        - type: stat
          title: "Platform Availability"
          query: "avg(sli_availability_30d)"
          unit: "percent"
          colorMode: "background"
          thresholds:
            - value: 99.9
              color: green
            - value: 99.0
              color: yellow
            - value: 0
              color: red

        - type: stat
          title: "Active Incidents"
          query: "count(ALERTS{severity=~'P1|P2'})"

        - type: stat
          title: "Users Affected (Last 7d)"
          query: "sum(increase(error_requests_by_user[7d]))"

    - title: "Trends (Last 90 Days)"
      panels:
        - type: graph
          title: "Daily Availability %"
          query: "avg_over_time(sli_availability_1d[90d])"

        - type: graph
          title: "Incident Count by Severity"
          query: |
            sum(increase(incidents_total[1d])) by (severity)

Customer-Facing Status Page

Purpose: Transparent communication of service health to external users.

Contents (minimal):

Overall status: Operational, Degraded, Outage
Component status: API, Web App, Mobile App, etc.
Incident history: Last 90 days
Scheduled maintenance: Upcoming windows

Example (StatusPage.io or custom):

status_page:
  components:
    - name: "Payment API"
      status: operational  # operational | degraded | outage

    - name: "Web Application"
      status: operational

    - name: "Mobile App"
      status: degraded
      description: "Increased latency for some users. Investigating."

  incidents:
    - date: "2024-01-15"
      title: "Payment Processing Delays"
      status: resolved
      duration: "45 minutes"
      summary: "Elevated latency caused payment delays. Issue resolved by scaling database."

  maintenance:
    - date: "2024-01-20 02:00-04:00 UTC"
      title: "Database Maintenance"
      impact: "Brief intermittent errors possible during 2-minute cutover"

Key principle: External status pages should be conservative. Don't report every blip - only user-impacting issues.

Summary

Effective monitoring and alerting balances comprehensive visibility with actionable insights:

Key principles:

Monitor what matters: SLIs aligned with user experience, not arbitrary infrastructure metrics
Alert on symptoms: User-visible issues (latency, errors) not causes (CPU, memory)
Design for action: Every alert should require immediate response with clear runbook
Prevent fatigue: Regular alert audits, aggregation, and inhibition prevent alert noise
Error budgets: Objective framework for balancing reliability and innovation

Implementation path:

Define SLIs: Choose 2-3 metrics that reflect user experience
Set SLOs: Target values based on user expectations and historical performance
Implement Four Golden Signals: Latency, traffic, errors, saturation
Create runbooks: Step-by-step guides for common scenarios
Establish on-call rotation: Fair, sustainable schedule with clear escalation
Build dashboards: Operations, business, and customer-facing views
Iterate: Regular retrospectives to improve alert quality

Start small: Begin with availability SLI/SLO and one high-quality alert. Expand as you learn what works for your team and systems.

Monitoring and Alerting Strategy

Purpose of Monitoring and Alerting

SLIs, SLOs, SLAs, and Error Budgets

Service Level Indicators (SLIs)

Service Level Objectives (SLOs)

Service Level Agreements (SLAs)

Error Budgets

The Four Golden Signals

1. Latency

2. Traffic

3. Errors

4. Saturation

Combining the Four Signals

Alert Design Principles

Characteristics of Good Alerts

Alert Severity Levels

Alerting on Symptoms, Not Causes

Alert Thresholds and Windows

Alert Descriptions and Runbooks

Alert Fatigue Prevention

On-Call Best Practices

On-Call Rotation Structure

Incident Response Workflow

Runbook Quality Standards

Incident Communication

Dashboards for Different Audiences

Operations Dashboard (Engineers)

Business Dashboard (Management)

Customer-Facing Status Page

Further Reading

Internal Documentation

External Resources

Summary

Purpose of Monitoring and Alerting​

SLIs, SLOs, SLAs, and Error Budgets​

Service Level Indicators (SLIs)​

Service Level Objectives (SLOs)​

Service Level Agreements (SLAs)​

Error Budgets​

The Four Golden Signals​

1. Latency​

2. Traffic​

3. Errors​

4. Saturation​

Combining the Four Signals​

Alert Design Principles​

Characteristics of Good Alerts​

Alert Severity Levels​

Alerting on Symptoms, Not Causes​

Alert Thresholds and Windows​

Alert Descriptions and Runbooks​

Alert Fatigue Prevention​

On-Call Best Practices​

On-Call Rotation Structure​

Incident Response Workflow​

Runbook Quality Standards​

Incident Communication​

Dashboards for Different Audiences​

Operations Dashboard (Engineers)​

Business Dashboard (Management)​

Customer-Facing Status Page​

Further Reading​

Internal Documentation​

External Resources​

Summary​

Purpose of Monitoring and Alerting

SLIs, SLOs, SLAs, and Error Budgets

Service Level Indicators (SLIs)

Service Level Objectives (SLOs)

Service Level Agreements (SLAs)

Error Budgets

The Four Golden Signals

1. Latency

2. Traffic

3. Errors

4. Saturation

Combining the Four Signals

Alert Design Principles

Characteristics of Good Alerts

Alert Severity Levels

Alerting on Symptoms, Not Causes

Alert Thresholds and Windows

Alert Descriptions and Runbooks

Alert Fatigue Prevention

On-Call Best Practices

On-Call Rotation Structure

Incident Response Workflow

Runbook Quality Standards

Incident Communication

Dashboards for Different Audiences

Operations Dashboard (Engineers)

Business Dashboard (Management)

Customer-Facing Status Page

Further Reading

Internal Documentation

External Resources

Summary