Monitoring and Alerting Strategy
Purpose of Monitoring and Alerting
Effective monitoring tells you when something is wrong, while effective alerting tells you what action to take. The goal is not to monitor everything, but to monitor what matters to your users and to alert only when human intervention is needed.
Key principles:
- User-centric: Monitor what affects user experience, not just infrastructure
- Actionable: Every alert should require immediate action or investigation
- Proactive: Detect issues before users report them
- Scalable: System works as you grow from 10 to 10,000 services
This guide covers the frameworks and practices for building reliable, sustainable monitoring and alerting systems.
SLIs, SLOs, SLAs, and Error Budgets
These four concepts form the foundation of modern service reliability management. Understanding their relationship is critical for effective monitoring strategy.
Service Level Indicators (SLIs)
An SLI is a quantitative measurement of a specific aspect of service quality. SLIs are what you actually measure.
Definition: A ratio of good events to total events, expressed as a percentage:
SLI = (Good Events / Total Events) × 100%
Common SLIs:
| SLI Type | Measures | Good Event | Total Events | Example Target |
|---|---|---|---|---|
| Availability | Service uptime | Successful requests (2xx/3xx) | All requests | 99.9% |
| Latency | Response speed | Requests < threshold (e.g., 500ms) | All requests | 95% < 500ms |
| Error Rate | Request success | Successful requests (non-5xx) | All requests | 99.5% success |
| Throughput | Request handling capacity | Requests processed | Requests received | 99% processed |
| Durability | Data preservation | Successfully stored items | All write attempts | 99.999% |
Example calculation:
# Latency SLI: Percentage of requests under 500ms
Total requests: 10,000
Requests under 500ms: 9,500
SLI = (9,500 / 10,000) × 100% = 95%
PromQL query for latency SLI:
# Calculate percentage of requests under 500ms
sum(rate(http_request_duration_seconds_bucket{le="0.5"}[5m]))
/
sum(rate(http_request_duration_seconds_count[5m]))
* 100
SLI Selection Criteria:
Choose SLIs that:
- Directly impact user experience: Latency affects every user interaction
- Are measurable from user perspective: End-to-end request time, not internal queue depth
- Correlate with user satisfaction: Slow requests lead to user complaints
- Cover different failure modes: Availability + latency + errors gives comprehensive view
Anti-pattern: Monitoring infrastructure metrics (CPU %, memory %) as SLIs. These are symptoms, not user-facing outcomes. Monitor them separately for capacity planning and debugging.
Service Level Objectives (SLOs)
An SLO is a target value for an SLI. It defines "how good is good enough."
Structure: SLI ≥ Target over Time Window
Examples:
# Payment Service SLOs
slos:
- name: "payment-availability"
sli: "successful_requests_percentage"
target: 99.9
window: "30 days"
description: "99.9% of payment requests succeed (non-5xx responses)"
- name: "payment-latency"
sli: "requests_under_500ms_percentage"
target: 95.0
window: "30 days"
description: "95% of payment requests complete within 500ms"
- name: "payment-error-rate"
sli: "error_free_requests_percentage"
target: 99.5
window: "30 days"
description: "99.5% of payment requests have no errors"
SLO as a State Machine:
An SLO exists in one of three states at any given time:
Setting SLO Targets:
Don't aim for 100% - it's expensive and unnecessary. The target depends on:
- User expectations: Consumer apps (99.9%) vs. critical infrastructure (99.99%)
- Cost of improvement: Going from 99.9% to 99.99% costs exponentially more
- Dependencies: Your SLO can't exceed your dependencies' SLOs
- Historical performance: Start with current performance, then iterate
Rule of thumb: If users aren't complaining and you're meeting business goals, your current SLO might be good enough. Focus optimization effort elsewhere.
Service Level Agreements (SLAs)
An SLA is a contractual commitment with consequences if violated. SLAs are always more lenient than SLOs.
SLO vs SLA relationship:
SLA < SLO < Actual Performance
Example:
- Actual Performance: 99.95% availability (what you deliver)
- SLO: 99.9% availability (internal target with buffer)
- SLA: 99.5% availability (contractual commitment to customers)
The gap between SLO and SLA is your safety buffer for unexpected issues without contractual penalties.
SLA Example:
## Payment Processing SLA
### Availability Commitment
We guarantee 99.5% availability measured monthly.
### Latency Commitment
95% of payment requests will complete within 1 second.
### Remedies
If SLA is not met:
- 99.5% - 99.0%: 10% service credit
- 99.0% - 95.0%: 25% service credit
- < 95.0%: 50% service credit
### Exclusions
Downtime caused by:
- Customer's infrastructure failures
- Scheduled maintenance (with 48hr notice)
- Force majeure events
SLA Considerations:
- Financial consequences: Credits, refunds, or penalties for violations
- Legal review required: SLAs are contracts, involve legal team
- Customer communication: Customers see SLAs, not internal SLOs
- Incident reporting: SLA violations typically require customer notification
Error Budgets
An error budget is the allowed amount of unreliability before violating your SLO. It's calculated from your SLO target.
Calculation:
Error Budget = 100% - SLO Target
Example: 99.9% availability SLO
Error Budget = 100% - 99.9% = 0.1% downtime allowed
Time-based error budget (30 days):
Total time: 30 days × 24 hours × 60 minutes = 43,200 minutes
Error budget: 43,200 × 0.1% = 43.2 minutes of downtime allowed per month
Error Budget as a Shared Resource:
Error budget enables objective conversations between product and engineering:
Error Budget Policy:
Define what happens when budget is exhausted:
error_budget_policy:
- budget_remaining: "> 50%"
action: "Business as usual - all changes allowed"
- budget_remaining: "20% - 50%"
action: "Elevated caution - review risky changes"
- budget_remaining: "0% - 20%"
action: "Freeze non-critical features, focus on reliability"
- budget_remaining: "< 0%"
action: "Feature freeze until next window, on-call postmortem required"
Tracking Error Budget:
Example: Payment Service Error Budget Calculation
SLO Target: 99.9% availability
Error Budget: 100% - 99.9% = 0.1%
Current period (30 days):
- Total requests: 1,000,000
- Failed requests: 800
- Actual SLI: (1,000,000 - 800) / 1,000,000 = 99.92%
Budget calculation:
- Allowed failures: 1,000,000 × 0.1% = 1,000 requests
- Actual failures: 800 requests
- Budget consumed: (800 / 1,000) × 100% = 80%
- Budget remaining: 20%
Status: WARNING (20% remaining)
PromQL query for error budget:
# Error budget remaining percentage
(1 - (
sum(rate(http_server_requests_total{status=~"5.."}[30d]))
/
sum(rate(http_server_requests_total[30d]))
) / (1 - 0.999)) * 100
Benefits of Error Budgets:
- Objective decision-making: Data-driven vs. opinion-driven
- Balance innovation and reliability: Known risk tolerance
- Prevent alert fatigue: Small SLO misses don't trigger incidents if budget remains
- Incentive alignment: Product and engineering share reliability goals
The Four Golden Signals
Google's Site Reliability Engineering (SRE) book identifies four metrics that matter most for monitoring user-facing systems. These signals provide comprehensive coverage with minimal metrics.
1. Latency
What it measures: How long requests take to complete.
Why it matters: Users perceive slow systems as broken. Latency directly impacts user satisfaction and conversion rates.
Key insight: Measure latency for successful and failed requests separately. A failed request that returns immediately (e.g., 401 Unauthorized) shouldn't skew your latency metrics.
Metrics structure:
# Track latency separately by outcome
payment_latency_seconds:
type: histogram
labels:
- outcome: success | failure
- payment_method: card | bank_transfer | wallet
- error_type: validation | gateway_timeout | network # only for failures
Example PromQL for separate latency tracking:
# P95 latency for successful requests only
histogram_quantile(0.95,
sum(rate(payment_latency_seconds_bucket{outcome="success"}[5m])) by (le)
)
# P95 latency for failed requests (for debugging slow failures)
histogram_quantile(0.95,
sum(rate(payment_latency_seconds_bucket{outcome="failure"}[5m])) by (le, error_type)
)
For framework-specific implementation of latency tracking, see Spring Boot Observability.
What to monitor:
- Percentiles, not averages: P50 (median), P95, P99, P99.9
- Distribution: Histogram to see full latency distribution
- By endpoint: Different endpoints have different latency expectations
Example PromQL query:
# P95 latency over 5 minutes for successful payment requests
histogram_quantile(0.95,
sum(rate(payment_latency_bucket{outcome="success"}[5m])) by (le)
)
Anti-pattern: Alerting on average latency. Averages hide outliers. A system averaging 100ms could have 5% of requests taking 10 seconds - users experiencing those slow requests will be frustrated.
2. Traffic
What it measures: Demand on your system (requests per second, transactions per minute, etc.).
Why it matters: Helps identify capacity constraints, detect unusual patterns (DDoS, viral growth), and forecast infrastructure needs.
Metrics to track:
# Traffic metrics
http_server_requests_total:
type: counter
labels:
- method: GET | POST | PUT | DELETE
- uri: /api/payments | /api/users | ...
- status: 200 | 400 | 500 | ...
# Most frameworks auto-instrument this metric
# Spring Boot: http.server.requests
# Express.js: http_request_duration_seconds_count
# FastAPI: http_requests_total
What to monitor:
- Request rate: Requests per second (RPS) overall and by endpoint
- Connection rate: New connections/second (for connection pools, databases)
- Business metrics: Payments/minute, user signups/hour
Example PromQL query:
# Requests per second over last 5 minutes
sum(rate(http_server_requests_total[5m])) by (uri, method)
Pattern: Compare current traffic to historical baselines to detect anomalies:
# Traffic is 3x higher than usual (comparing to last week same time)
sum(rate(http_server_requests_total[5m]))
>
3 * sum(rate(http_server_requests_total[5m] offset 7d))
3. Errors
What it measures: Rate of failed requests.
Why it matters: Errors directly degrade user experience. A 1% error rate means 1 in 100 users has a bad experience.
Error categorization:
Explicit errors are easy to detect (500 status codes, exceptions). Implicit errors are correct technically but wrong functionally (e.g., returning stale data after cache timeout).
Metrics structure for error tracking:
# Track both explicit and implicit errors
payment_errors_total:
type: counter
labels:
- type: gateway_exception | declined | validation | timeout
- reason: insufficient_funds | card_expired | ... # for declined
- error: NetworkException | TimeoutException | ... # for exceptions
Example: Tracking both error types
Explicit error (HTTP 500):
- Gateway throws exception
- Increment: payment_errors_total{type="gateway_exception", error="NetworkException"}
Implicit error (HTTP 200 but business failure):
- Payment declined by gateway
- Increment: payment_errors_total{type="declined", reason="insufficient_funds"}
For framework-specific implementation, see Spring Boot Observability.
What to monitor:
- Error rate: Errors per second and percentage of total requests
- Error types: 4xx vs 5xx, exception types, business error codes
- Error ratio by endpoint: Some endpoints naturally have higher error rates (authentication)
Example PromQL query:
# Error rate: percentage of 5xx responses
sum(rate(http_server_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_server_requests_total[5m]))
* 100
4. Saturation
What it measures: How "full" your service is - resource utilization and queueing.
Why it matters: Saturation precedes failure. A database connection pool at 95% usage will soon start rejecting requests. Monitoring saturation enables proactive scaling before users are impacted.
Common saturation metrics:
| Resource | Metric | Warning Threshold | Critical Threshold |
|---|---|---|---|
| CPU | % utilization | > 70% | > 85% |
| Memory | % used | > 80% | > 90% |
| Disk | % used | > 75% | > 90% |
| Connection pool | Active connections / Max | > 70% | > 85% |
| Thread pool | Active threads / Max | > 70% | > 85% |
| Queue depth | Messages in queue | > 1000 | > 10,000 |
| Network | % bandwidth used | > 70% | > 85% |
Common saturation metrics by technology:
# Database connection pools
# HikariCP (Java): hikaricp_connections_active, hikaricp_connections_max
# pg-pool (Node.js): pg_pool_size, pg_pool_max
# SQLAlchemy (Python): sqlalchemy_pool_checkedout, sqlalchemy_pool_size
# Thread/worker pools
# JVM: jvm_threads_live, jvm_threads_peak
# Node.js: nodejs_active_handles, nodejs_active_requests
# Python: process_threads
# Message queues
# RabbitMQ: rabbitmq_queue_messages, rabbitmq_queue_consumers
# Kafka: kafka_consumer_lag
For framework-specific saturation monitoring setup, see Spring Boot Observability.
Saturation alerting pattern:
Alert when resources approach limits, not when they're exceeded:
# Alert when connection pool is 80% utilized
(hikaricp_connections_active / hikaricp_connections_max) > 0.8
Why 80% and not 100%? At 80% utilization, you have time to scale before hitting limits. At 100%, users are already experiencing failures.
Combining the Four Signals
The signals work together to diagnose issues:
| Symptom | Latency | Traffic | Errors | Saturation | Likely Cause |
|---|---|---|---|---|---|
| Slow requests | CPU/DB | Resource contention | |||
| High errors | Application bug | ||||
| Both slow + errors | All | Traffic spike overwhelming system | |||
| Slow DB queries | DB conn | Connection pool exhaustion |
Alert Design Principles
Alerts are notifications that demand immediate human action. Poorly designed alerts lead to alert fatigue, where engineers ignore or silence alerts, missing real incidents.
Characteristics of Good Alerts
Every alert should be:
- Actionable: The person receiving the alert knows what to do
- Urgent: Requires immediate investigation or action
- User-impacting: Affects user experience or will soon
- Clear: Obvious what's wrong and where
- Unique: Not redundant with other alerts
Test your alert: Ask "If this woke me at 3am, would I be able to take meaningful action?" If no, it's not an alert - it's a notification or dashboard metric.
Alert Severity Levels
Define clear severity levels with corresponding response expectations:
severity_levels:
P1_CRITICAL:
description: "Service down or severely degraded, users affected NOW"
response_time: "Immediate (page on-call)"
examples:
- "API returning 100% errors for 5+ minutes"
- "Database unreachable"
- "Payment processing completely unavailable"
escalation: "After 15 minutes, escalate to senior engineer"
P2_HIGH:
description: "Partial degradation, subset of users affected or imminent total failure"
response_time: "15 minutes"
examples:
- "Error rate above 5% for 10+ minutes"
- "P95 latency 3x normal"
- "Connection pool 95% saturated (about to fail)"
escalation: "After 30 minutes, escalate to lead"
P3_MEDIUM:
description: "Minor degradation, small user impact or early warning"
response_time: "1 hour (during business hours only)"
examples:
- "Error rate above 1% for 30+ minutes"
- "Disk space 85% full (trending toward full)"
- "Certificate expiring in 7 days"
escalation: "Create ticket if not resolved in 4 hours"
P4_LOW:
description: "Potential issue, no current user impact"
response_time: "Next business day"
examples:
- "Increased latency but within SLO"
- "Non-critical background job failing"
- "Staging environment issue"
escalation: "Weekly review if recurring"
Key rule: Only P1 and P2 should page/wake people. P3 and P4 are tracked but don't require immediate response.
Alerting on Symptoms, Not Causes
Anti-pattern: Alert on CPU usage exceeding 80% Better: Alert on latency exceeding SLO
Why? High CPU is a cause, but it doesn't always mean users are impacted. A batch job might spike CPU to 90% without affecting user requests. Conversely, latency directly measures user experience.
Exception: Alert on saturation metrics (connection pool 90% full) as leading indicators before they cause user-visible symptoms. This gives time to respond proactively.
Alert Thresholds and Windows
Threshold selection:
Thresholds should balance sensitivity (catch all real issues) with specificity (avoid false alarms).
# Too sensitive - will fire constantly
alert: HighErrorRate
expr: error_rate > 0.01 # 1% - too low, normal fluctuation
# Too conservative - misses real issues
expr: error_rate > 0.50 # 50% - too high, users already severely impacted
# Balanced - catches meaningful degradation
expr: error_rate > 0.05 # 5% - above normal, below disaster
Time windows:
Require issues to persist before alerting to avoid transient spikes:
alert: HighLatency
expr: p95_latency > 1.0 # 1 second
for: 5m # Must be true for 5 consecutive minutes
annotations:
summary: "P95 latency above 1s for 5+ minutes"
Why "for" matters: A single slow request spikes P95 temporarily. If it persists for 5 minutes, it's a systemic issue requiring investigation.
Guidance on windows:
- Availability/errors: 5-10 minute windows (catch issues quickly)
- Latency: 5-10 minute windows (avoid transient spikes)
- Saturation: 10-15 minute windows (gives time to scale before critical)
- Resource trends: 30+ minute windows (disk space, memory leaks)
Alert Descriptions and Runbooks
Every alert should include:
- What is wrong: Clear problem statement
- Why it matters: User impact
- What to do: Link to runbook with investigation steps
- Who to escalate to: If responder can't resolve
# Prometheus alert with comprehensive annotations
groups:
- name: payment_service
interval: 30s
rules:
- alert: PaymentServiceHighErrorRate
expr: |
sum(rate(payment_errors_total[5m]))
/
sum(rate(payment_requests_total[5m]))
> 0.05
for: 10m
labels:
severity: P2_HIGH
service: payment-service
team: payments
annotations:
summary: "Payment service error rate above 5% for 10+ minutes"
description: |
{{ $value | humanizePercentage }} of payment requests are failing.
Current rate: {{ $value }} (threshold: 0.05)
This affects users' ability to complete purchases.
impact: "Users cannot complete purchases. Revenue impact: ~$X per minute."
runbook_url: "https://wiki.company.com/runbooks/payment-service/high-error-rate"
dashboard_url: "https://grafana.company.com/d/payments/payment-service-overview"
grafana_query: 'sum(rate(payment_errors_total[5m])) / sum(rate(payment_requests_total[5m]))'
Runbook content (linked from alert):
# Runbook: Payment Service High Error Rate
## Symptom
Error rate for payment processing exceeds 5% for 10+ minutes.
## Impact
Users cannot complete purchases. Revenue loss approximately $500 per minute.
## Investigation Steps
### 1. Check Error Types (2 minutes)
```sh
# View error breakdown by type
kubectl logs -l app=payment-service --tail=100 | grep ERROR | jq .error_type | sort | uniq -c
```
**Common error types:**
- `GATEWAY_TIMEOUT`: Payment gateway unavailable
- `VALIDATION_ERROR`: Bad request data (likely code deployment)
- `DATABASE_ERROR`: Database connectivity issue
### 2. Check Dependencies (3 minutes)
```sh
# Check payment gateway health
curl https://gateway.paymentprovider.com/health
# Check database connectivity
kubectl exec -it payment-service-pod -- psql -c "SELECT 1"
```
### 3. Check Recent Deployments (2 minutes)
```sh
# List recent deployments
kubectl rollout history deployment/payment-service
# If recent deployment (<15 min), rollback:
kubectl rollout undo deployment/payment-service
```
## Resolution Paths
### If Payment Gateway Down
1. Check status page: https://status.paymentprovider.com
2. Contact payment provider support: +1-555-0100
3. Consider failover to backup gateway (requires approval)
### If Database Issue
1. Check database metrics dashboard
2. Restart connection pool: `kubectl rollout restart deployment/payment-service`
3. If database is down, escalate to infrastructure team
### If Recent Deployment
1. Rollback immediately (see step 3 above)
2. Create incident ticket
3. Notify #payments-team Slack channel
## Escalation
- **Primary on-call**: Check PagerDuty rotation
- **Escalate after 15 min**: Senior engineer (see PagerDuty escalation policy)
- **Escalate after 30 min**: Engineering manager + VP Engineering
Alert Fatigue Prevention
Alert fatigue occurs when engineers receive too many non-actionable alerts and start ignoring them.
Symptoms of alert fatigue:
- Alerts are routinely silenced without investigation
- Average time-to-acknowledge increases over time
- On-call engineers report constant interruptions
- Team "tunes out" alert noise
Prevention strategies:
1. Regular alert audits:
-- Query alert statistics (Prometheus Alertmanager database)
SELECT
alert_name,
COUNT(*) as fires_per_month,
AVG(resolution_time_minutes) as avg_resolution,
COUNT(*) FILTER (WHERE acknowledged = false) as ignored_count
FROM alerts
WHERE created_at > NOW() - INTERVAL '30 days'
GROUP BY alert_name
HAVING fires_per_month > 100 -- Firing too frequently
OR ignored_count > 10; -- Being ignored
2. Alert tuning cycle:
3. Alert aggregation:
Instead of alerting on each failing instance, alert when a threshold of failures is reached:
# Bad: Alerts for each failing pod
alert: PodDown
expr: up{job="payment-service"} == 0
# Good: Alert when >20% pods are down (systemic issue)
alert: PaymentServiceDegraded
expr: |
sum(up{job="payment-service"} == 0)
/
count(up{job="payment-service"})
> 0.2
4. Inhibition rules:
Suppress redundant alerts when a root cause alert fires:
# Prometheus Alertmanager inhibition
inhibit_rules:
# If database is down, inhibit all alerts from services using it
- source_match:
alertname: DatabaseDown
target_match_re:
service: "(payment|user|order)-service"
equal: ['datacenter']
On-Call Best Practices
On-call engineers are responsible for responding to production incidents outside business hours.
On-Call Rotation Structure
Recommended rotation:
- Shift length: 1 week per engineer
- Primary + Secondary: Two-tier rotation for escalation
- Follow-the-sun: If global team, hand off to next timezone
- Fair distribution: Track hours to ensure equity
On-call compensation:
- Stipend: Flat payment for being on-call (regardless of pages)
- Incident pay: Additional payment per incident handled
- Time off: Comp time if pages occurred during personal time
Incident Response Workflow
When an alert fires, follow a structured response:
Response SLAs by severity:
| Severity | Acknowledge | Begin Investigation | Escalate If Not Resolved |
|---|---|---|---|
| P1 | 5 minutes | Immediate | 15 minutes |
| P2 | 10 minutes | Within 15 min | 30 minutes |
| P3 | 1 hour | Next business hours | 4 hours |
| P4 | Next business day | - | - |
Runbook Quality Standards
Runbooks are step-by-step guides for responding to specific alerts. Quality runbooks are critical for effective incident response.
Runbook template:
# Runbook: [Alert Name]
## Metadata
- **Owner**: [Team name]
- **Last updated**: [Date]
- **Tested**: [Date last executed]
## Symptom
[Clear description of what is wrong]
## Impact
[User-facing impact and business consequences]
## Investigation (Time-boxed Steps)
### Step 1: [Action] (X minutes)
**What to do:**
[Specific command or action]
**Expected result:**
[What you should see if system is healthy]
**If result is abnormal:**
[What it means and next step]
### Step 2: [Action] (X minutes)
...
## Resolution Paths
### Scenario A: [Common cause]
[Step-by-step resolution]
### Scenario B: [Another common cause]
[Step-by-step resolution]
## Escalation
- **After X minutes**: [Who to escalate to]
- **Contact**: [Phone/Slack/PagerDuty info]
## Post-Resolution
- [ ] Update incident ticket
- [ ] Notify #status channel
- [ ] Schedule post-mortem if P1/P2
## Related Runbooks
- [Link to related runbook]
Runbook maintenance:
- Test regularly: Run through runbooks quarterly
- Update after incidents: Incorporate new learnings
- Version control: Store in Git alongside code
- Rotate ownership: Each engineer owns 2-3 runbooks
Incident Communication
Clear communication during incidents minimizes confusion and coordinates response.
Communication channels:
| Channel | Purpose | Audience | Example |
|---|---|---|---|
| #incidents | Real-time coordination | Responders + leadership | "DB connection pool exhausted. Restarting pods." |
| #status | Customer-facing updates | Internal + customers | "Payment processing experiencing delays. Investigating." |
| StatusPage | External status | Customers only | "Degraded Performance: Payment API latency elevated" |
| Incident ticket | Documentation | Responders + future reference | Detailed timeline, actions taken, resolution |
Update cadence:
incident_communication:
initial_report:
when: "Within 10 minutes of incident start"
content: "What's wrong, impact, who's investigating"
progress_updates:
P1_CRITICAL: "Every 15 minutes until resolved"
P2_HIGH: "Every 30 minutes until resolved"
P3_MEDIUM: "Hourly during business hours"
resolution:
when: "Immediately upon resolution"
content: "What was fixed, next steps, post-mortem timeline"
post_mortem:
when: "Within 3 business days for P1/P2"
content: "Root cause, timeline, action items"
Dashboards for Different Audiences
Different stakeholders need different views of system health.
Operations Dashboard (Engineers)
Purpose: Real-time system health for troubleshooting.
Contents:
- Four Golden Signals: Latency, traffic, errors, saturation
- SLI/SLO tracking: Current SLI vs. target, error budget remaining
- Request breakdown: By endpoint, status code, method
- Infrastructure metrics: CPU, memory, connection pools
- Recent deployments: Timeline of changes
dashboard:
name: "Payment Service - Operations"
refresh: "10s"
rows:
- title: "Golden Signals (Last 1 Hour)"
panels:
- type: graph
title: "Latency (P50, P95, P99)"
query: |
histogram_quantile(0.50, sum(rate(payment_latency_bucket[5m])) by (le))
histogram_quantile(0.95, sum(rate(payment_latency_bucket[5m])) by (le))
histogram_quantile(0.99, sum(rate(payment_latency_bucket[5m])) by (le))
- type: graph
title: "Traffic (Requests/sec)"
query: "sum(rate(payment_requests_total[5m]))"
- type: graph
title: "Error Rate (%)"
query: |
sum(rate(payment_errors_total[5m]))
/
sum(rate(payment_requests_total[5m])) * 100
- type: gauge
title: "Connection Pool Saturation"
query: "hikaricp_connections_active / hikaricp_connections_max"
thresholds:
- value: 0.7
color: yellow
- value: 0.85
color: red
- title: "SLO Tracking (Last 30 Days)"
panels:
- type: stat
title: "Availability SLO"
query: |
sum(rate(payment_requests_total{status!~"5.."}[30d]))
/
sum(rate(payment_requests_total[30d])) * 100
target: 99.9
- type: stat
title: "Error Budget Remaining"
query: |
(1 - (1 - <availability_sli>) / (1 - 0.999)) * 100
thresholds:
- value: 20
color: red
- value: 50
color: yellow
- value: 100
color: green
Business Dashboard (Management)
Purpose: High-level health and trends for business stakeholders.
Contents:
- Uptime: Overall availability percentage
- User impact: Requests affected, users impacted (if available)
- SLO compliance: Are we meeting our targets?
- Trends: Week-over-week, month-over-month comparisons
- Incidents: Count and severity
dashboard:
name: "Platform Health - Executive View"
refresh: "5m"
rows:
- title: "This Month"
panels:
- type: stat
title: "Platform Availability"
query: "avg(sli_availability_30d)"
unit: "percent"
colorMode: "background"
thresholds:
- value: 99.9
color: green
- value: 99.0
color: yellow
- value: 0
color: red
- type: stat
title: "Active Incidents"
query: "count(ALERTS{severity=~'P1|P2'})"
- type: stat
title: "Users Affected (Last 7d)"
query: "sum(increase(error_requests_by_user[7d]))"
- title: "Trends (Last 90 Days)"
panels:
- type: graph
title: "Daily Availability %"
query: "avg_over_time(sli_availability_1d[90d])"
- type: graph
title: "Incident Count by Severity"
query: |
sum(increase(incidents_total[1d])) by (severity)
Customer-Facing Status Page
Purpose: Transparent communication of service health to external users.
Contents (minimal):
- Overall status: Operational, Degraded, Outage
- Component status: API, Web App, Mobile App, etc.
- Incident history: Last 90 days
- Scheduled maintenance: Upcoming windows
Example (StatusPage.io or custom):
status_page:
components:
- name: "Payment API"
status: operational # operational | degraded | outage
- name: "Web Application"
status: operational
- name: "Mobile App"
status: degraded
description: "Increased latency for some users. Investigating."
incidents:
- date: "2024-01-15"
title: "Payment Processing Delays"
status: resolved
duration: "45 minutes"
summary: "Elevated latency caused payment delays. Issue resolved by scaling database."
maintenance:
- date: "2024-01-20 02:00-04:00 UTC"
title: "Database Maintenance"
impact: "Brief intermittent errors possible during 2-minute cutover"
Key principle: External status pages should be conservative. Don't report every blip - only user-impacting issues.
Further Reading
Internal Documentation
- Observability Overview - Three pillars: logs, metrics, traces
- Logging Best Practices - Structured logging, correlation IDs
- Application Metrics - Micrometer, Prometheus, custom metrics
- Distributed Tracing - OpenTelemetry, Jaeger integration
- Incident Post-Mortems - Blameless retrospectives
External Resources
- Google SRE Book - Monitoring Distributed Systems - Four Golden Signals origin
- Google SRE Book - Service Level Objectives - SLI/SLO/SLA framework
- Site Reliability Workbook - Alerting on SLOs - Alert design patterns
- Prometheus Best Practices - Alerting - Alert rules and templates
- PagerDuty Incident Response Guide - Incident management workflows
Summary
Effective monitoring and alerting balances comprehensive visibility with actionable insights:
Key principles:
- Monitor what matters: SLIs aligned with user experience, not arbitrary infrastructure metrics
- Alert on symptoms: User-visible issues (latency, errors) not causes (CPU, memory)
- Design for action: Every alert should require immediate response with clear runbook
- Prevent fatigue: Regular alert audits, aggregation, and inhibition prevent alert noise
- Error budgets: Objective framework for balancing reliability and innovation
Implementation path:
- Define SLIs: Choose 2-3 metrics that reflect user experience
- Set SLOs: Target values based on user expectations and historical performance
- Implement Four Golden Signals: Latency, traffic, errors, saturation
- Create runbooks: Step-by-step guides for common scenarios
- Establish on-call rotation: Fair, sustainable schedule with clear escalation
- Build dashboards: Operations, business, and customer-facing views
- Iterate: Regular retrospectives to improve alert quality
Start small: Begin with availability SLI/SLO and one high-quality alert. Expand as you learn what works for your team and systems.