Skip to main content

Incident Post-Mortems

Post-mortems transform production incidents into learning opportunities, driving systematic improvements in reliability, processes, and team practices.

Overview

An incident post-mortem is a structured retrospective conducted after a significant production incident. The goal is not to assign blame, but to understand what happened, why it happened, and how to prevent similar incidents in the future.

Purpose

Post-mortems create a culture of continuous improvement where incidents are learning opportunities, not failures to hide. They systematically improve system reliability and team response capabilities.


Core Principles

  • Blameless: Focus on systems and processes, not individuals
  • Thorough: Investigate root causes, not just symptoms
  • Actionable: Produce concrete improvements, not just documentation
  • Transparent: Share learnings across teams and organization
  • Timely: Conduct while details are fresh (within 48-72 hours)

Blameless Post-Mortem Culture

Why Blameless Matters

Blame-focused post-mortems create a culture of fear where engineers hide mistakes, avoid taking risks, and don't report problems early. This makes systems less reliable, not more.

Psychology of Blame: When incidents are blamed on individuals, several negative patterns emerge:

  • Engineers delay reporting problems hoping they'll resolve themselves
  • Teams focus on CYA (covering themselves) rather than fixing issues
  • Knowledge sharing stops because admitting "I didn't know" feels unsafe
  • Innovation stalls because trying new approaches risks becoming the scapegoat
  • Postmortem meetings become defensive exercises rather than learning opportunities

Blameless ≠ Accountability-Free: Blameless culture doesn't mean no accountability. It means recognizing that humans make errors in complex systems, and focusing on improving the system rather than punishing the person.

  • Bad system: Single engineer has production database credentials, accidentally runs DELETE instead of SELECT. Blame-focused response: "You should have been more careful."
  • Good system: Multiple safeguards prevent accidents - read-only production access by default, elevated privileges require approval, destructive queries require confirmation, automated backups enable quick recovery. System-focused response: "How did our safeguards fail to prevent this?"

Psychological Safety: Teams need psychological safety to report incidents honestly and early. This safety is built through:

  • Leaders modeling vulnerability: Tech leads openly discuss their own past mistakes and learnings
  • Consistent messaging: "We fix systems, not people" repeated until believed
  • Rewarding transparency: Publicly thanking engineers who surface problems early
  • No retribution: Ensuring that honest mistakes never result in punishment, demotion, or negative reviews

When engineers trust they won't be punished, they report problems immediately, provide honest timelines, and volunteer information that helps investigations - all of which improve incident response.

Language Matters

The language used in post-mortems shapes culture. Small word choices signal whether the organization truly embraces blamelessness or merely pays lip service.

Blame-Focused Language (Avoid):

BAD:  "John deployed broken code"
BAD: "The developer didn't test properly"
BAD: "Sarah forgot to check the config"
BAD: "Human error caused the outage"
BAD: "The on-call engineer should have responded faster"

These phrases assign fault to individuals, making them defensive and shutting down honest dialogue.

System-Focused Language (Use):

GOOD:  "The deployment process allowed untested code to reach production"
GOOD: "Our testing strategy didn't catch this edge case"
GOOD: "The configuration review process was skipped under time pressure"
GOOD: "The system lacked safeguards to prevent this class of error"
GOOD: "The alerting threshold was too high to detect early warning signs"

These phrases acknowledge the human action while focusing on the systemic gap that allowed the error.

Why This Works: System-focused language shifts conversation from "who messed up?" to "what gaps exist?" This framing encourages:

  • Engineers to volunteer information without fear
  • Teams to identify process improvements
  • Organizations to invest in better tooling and safeguards
  • Root cause analysis to go deeper (beyond "human error")

Example Reframing:

SituationBlame-FocusedSystem-Focused
Missed alert"On-call engineer ignored the alert""Alert fatigue from frequent false positives caused alert to be missed"
Bad deployment"DevOps team deployed on Friday""No policy prevented risky Friday deployments; change freeze not enforced"
Database deletion"DBA ran wrong query""Production database lacked row-level security; destructive queries had no confirmation step"
Config error"Engineer typo'd the config value""Configuration deployment lacked validation; no schema enforcement caught invalid value"

In each case, the system-focused version opens more productive investigation paths than blame.

Creating Psychological Safety

During Incidents:

  • Avoid asking "who did this?" Instead ask "what sequence of events led to this?"
  • Thank people for reporting problems quickly
  • Focus on restoring service, not finding fault
  • Document actions without judgment

During Post-Mortems:

  • Start meetings by explicitly stating "this is a blameless post-mortem"
  • If someone self-blames ("I should have known better"), redirect to system gaps
  • Ask "what could the system have done to prevent this?" not "what should you have done?"
  • Normalize failure: remind team that complex systems fail, and failure is how we learn

In Follow-Up:

  • Performance reviews never penalize honest mistakes documented in post-mortems
  • Promotions recognize engineers who surface and fix systemic issues
  • Team celebrations highlight successful post-incident improvements
  • Post-mortem participation is valued as learning and leadership

Example: Reinforcing Safety:

During post-mortem: "I feel terrible, I should have double-checked the config."

Bad response: [silence, implicit agreement]
Good response: "The fact that a single config typo could cause an outage means our validation failed. Let's talk about schema enforcement and pre-deployment checks."

This response validates the engineer's contribution while shifting focus to actionable improvements.


When to Write a Post-Mortem

Not every incident requires a full post-mortem. Conduct post-mortems for incidents that are:

Always Require Post-Mortem

Customer-Impacting Outages:

  • Any service degradation affecting >1% of users
  • Complete service outage (no users can access)
  • Data loss or corruption affecting customer data
  • Security breach or unauthorized data access
  • Payment processing failures

Example: API latency spiked to 30+ seconds for 15 minutes, affecting 40% of active users. Even though service recovered automatically, the customer impact and unclear root cause warrant investigation.

SLA Violations:

  • Breached uptime SLA (e.g., 99.95% commitment)
  • Breached performance SLA (e.g., P95 latency target)
  • Any incident causing contractual penalties

Example: Service was unavailable for 4 hours in a month with a 99.95% SLA (allows only 21.6 minutes downtime). This breach likely triggers contractual obligations and requires thorough analysis.

Security Incidents:

  • Unauthorized access to systems or data
  • Data exfiltration or breach
  • Successful attack (DDoS, SQL injection, etc.)
  • Exposure of credentials or secrets

Even unsuccessful attacks that reached production systems warrant post-mortems to understand how defenses failed and how to improve.

Data Loss or Corruption:

  • Customer data deleted or corrupted
  • Database integrity compromised
  • Backup/restore failures
  • Data inconsistency between systems

Data is often irreplaceable, making these incidents especially critical to understand and prevent.

May Require Post-Mortem (Judgment Call)

Near-Misses:

  • Serious issue caught before customer impact
  • Escalation that could have become outage
  • Monitoring or alerting caught problem just in time

Example: Database reaching 95% capacity was caught by alerts, and engineers scaled before customer impact. No outage occurred, but the near-miss reveals capacity planning gaps worth investigating.

Recurring Issues:

  • Third occurrence of similar problem
  • Pattern of related incidents
  • Symptom of deeper systemic issue

Even if individual incidents are minor, patterns suggest underlying issues that post-mortems can uncover.

Process Failures:

  • Deployment blocked by unexpected issues
  • Major miscommunication between teams
  • Runbook was inaccurate or incomplete
  • Change management process bypassed

These may not directly impact customers but reveal process gaps that could cause future incidents.

Learning Opportunities:

  • Interesting technical failure mode
  • Novel problem worth documenting
  • New technology or pattern being tested

Even non-critical incidents can be valuable learning opportunities, especially when exploring new technologies.

Typically Don't Require Post-Mortem

  • Individual user reports (not systemic)
  • Planned maintenance windows
  • Known issues with existing workarounds
  • Issues entirely within SLA bounds
  • Incidents fully explained by previous post-mortems

Guidelines:

  • If in doubt, write a lightweight post-mortem
  • Tech lead makes final decision on post-mortem necessity
  • Track decision in incident log (why post-mortem was/wasn't written)

Post-Mortem Template

Use this template to ensure comprehensive, consistent post-mortems.

# Post-Mortem: [Brief Incident Description]

**Date**: YYYY-MM-DD
**Author(s)**: [@person1, @person2]
**Reviewers**: [@tech-lead, @sre, @team]
**Incident ID**: INC-12345
**Severity**: [SEV1 / SEV2 / SEV3]
**Status**: [Draft / Under Review / Published]

---

## Executive Summary

[2-3 sentence summary of what happened, impact, and resolution. Written for leadership who won't read full doc.]

**Example**:
"On November 10, 2025, the Payment API experienced a 2-hour outage affecting 100% of payment processing. The root cause was a database connection pool exhaustion triggered by a traffic spike combined with a connection leak. Service was restored by restarting the application and scaling the database connection pool. No data was lost."

---

## Impact

### Customer Impact
- **Affected users**: [number/percentage of users affected]
- **Duration**: [total duration of customer-facing impact]
- **Severity**: [complete outage / degraded performance / intermittent errors]
- **User experience**: [describe what customers experienced]

**Example**:
- **Affected users**: ~15,000 users (40% of active users during incident)
- **Duration**: 2 hours 15 minutes (10:30 AM - 12:45 PM UTC)
- **Severity**: Complete outage for payment processing; other services operational
- **User experience**: Users received "Service Unavailable" errors when attempting payments. Failed transactions were not retried automatically.

### Business Impact
- **Revenue impact**: [estimated revenue loss]
- **SLA impact**: [SLA breaches, customer credits owed]
- **Reputation impact**: [customer complaints, media coverage]
- **Operational impact**: [team time spent, opportunity cost]

**Example**:
- **Revenue impact**: Estimated $45,000 in lost transaction fees
- **SLA impact**: Breached 99.95% uptime SLA for 3 enterprise customers (~$12,000 in credits)
- **Reputation impact**: 47 support tickets, 3 escalations to account managers
- **Operational impact**: 12 engineer-hours for incident response + 8 engineer-hours for post-mortem and remediation

### Technical Impact
- **Systems affected**: [list of services/components impacted]
- **Data impact**: [data loss, corruption, inconsistency]
- **Downstream impact**: [effects on dependent services]

**Example**:
- **Systems affected**: Payment API, Payment Worker service (dependency)
- **Data impact**: 234 payments stuck in PENDING state (manually reconciled post-incident)
- **Downstream impact**: Notification service queued 15,000 failed payment emails (suppressed)

---

## Timeline

Detailed timeline of events from detection to resolution. Use UTC timestamps.

| Time (UTC) | Event | Who/What |
|------------|-------|----------|
| 10:30 | Payment API error rate spiked to 95% | Automated monitoring |
| 10:32 | On-call engineer paged | PagerDuty |
| 10:35 | Engineer investigated logs, saw database connection errors | @engineer1 |
| 10:40 | Database team engaged | @engineer1 → @dba-team |
| 10:45 | Database metrics showed connection pool exhausted (200/200 connections) | @dba1 |
| 10:50 | Decision made to restart Payment API instances | @engineer1, @dba1 |
| 10:55 | First instance restarted, no improvement | @engineer1 |
| 11:00 | Incident escalated to SEV1 | @engineer1 → @incident-commander |
| 11:05 | All instances restarted, connections cleared temporarily | @engineer1 |
| 11:10 | Connection pool exhausted again within 5 minutes | Monitoring |
| 11:15 | Code review revealed connection leak in new payment method (deployed 2 days prior) | @engineer2 |
| 11:25 | Decision made to rollback to previous version | @incident-commander |
| 11:35 | Rollback deployment initiated | @engineer1 |
| 11:50 | Rollback complete, service restored | Deployment pipeline |
| 12:00 | Monitoring confirmed error rates returned to normal (<0.1%) | Monitoring |
| 12:15 | Database connection pool usage normalized (20/200 connections) | @dba1 |
| 12:45 | All-clear declared, incident closed | @incident-commander |

**Key Observations**:
- 5-minute detection time (monitoring working well)
- 20 minutes spent investigating before engaging database team (could have been faster)
- Restart provided temporary relief but didn't address root cause
- Code review identified root cause 45 minutes into incident
- Rollback took 25 minutes (acceptable, tested process)

---

## Root Cause Analysis

### What Happened

[Detailed technical explanation of the failure sequence]

**Triggering Event**:
On November 8 (2 days before incident), a code change introduced a new payment method for ACH bank transfers. The implementation opened database connections in a try block but failed to close them in a finally block, creating a connection leak.

**Failure Cascade**:
1. **Day 1-2**: Connection leak slowly consumed connection pool under normal traffic (5-10 leaked connections/hour)
2. **Day 3 (incident day)**: Traffic spike from marketing campaign (3x normal volume)
3. **Connection pool exhaustion**: Leak rate increased with traffic; pool exhausted within 30 minutes of spike
4. **Request failures**: New requests failed immediately with "Cannot get connection" errors
5. **Thread exhaustion**: Application threads blocked waiting for database connections
6. **Service unresponsive**: Health checks failed, service marked down by load balancer

**Code Defect**:
```java
// Defective code (simplified)
public void processAchPayment(PaymentRequest request) {
Connection conn = dataSource.getConnection(); // Connection acquired
try {
// Process payment
PreparedStatement stmt = conn.prepareStatement(SQL);
stmt.executeUpdate();
} catch (SQLException e) {
log.error("Payment failed", e);
throw new PaymentException(e);
}
// Connection never closed! Missing finally block
}

The missing finally block meant connections were only returned to the pool if the method completed successfully. Any exception left the connection orphaned.

Why It Happened

Use "5 Whys" technique to drill into root cause:

  1. Why did the service go down?

    • Database connection pool exhausted, no connections available for new requests
  2. Why was the connection pool exhausted?

    • Code defect leaked database connections (didn't close connections in finally block)
  3. Why did the defective code reach production?

    • Code review didn't catch the missing finally block
    • Static analysis tools didn't flag resource leak
    • Integration tests didn't detect connection leaks (short-lived test executions)
  4. Why didn't code review catch this?

    • Reviewer focused on business logic correctness
    • Connection management is boilerplate, easy to miss in review
    • No explicit checklist item for resource management
  5. Why didn't automated testing catch this?

    • Integration tests ran for under 1 minute (leak not apparent)
    • No connection pool monitoring in test environments
    • Load tests not run for minor payment method additions

Contributing Factors:

  • Traffic spike: Marketing campaign created 3x normal load, accelerating leak impact
  • Monitoring gap: No alerting on connection pool usage (0-80% range)
  • Slow diagnosis: On-call engineer unfamiliar with database connection pool metrics
  • Delayed escalation: Database team not engaged until 15 minutes into incident

Systemic Issues

Beyond immediate defect, what systemic issues allowed this to happen?

Code Review Process:

  • No checklist for resource management (connections, files, streams)
  • Reviewers inconsistent in checking resource cleanup
  • No automated resource leak detection in CI pipeline

Testing Strategy:

  • Integration tests too short-lived to detect resource leaks
  • No soak testing or long-running test environments
  • Load testing not performed for "small" changes

Monitoring and Alerting:

  • Database connection pool usage not monitored
  • No alerting on connection pool exhaustion warning thresholds
  • Missing metrics for connection acquisition time (early warning signal)

Deployment Process:

  • No gradual rollout (canary deployment) for this change
  • 100% traffic switched to new version immediately
  • No automated rollback on error rate spike

Knowledge Gaps:

  • On-call engineer unfamiliar with database connection pool troubleshooting
  • No runbook for connection pool exhaustion scenario
  • Team lacks training on resource management patterns in Java

What Went Well

Post-mortems should also highlight successes to reinforce good practices.

GOOD: Fast Detection:

  • Monitoring detected error rate spike within 2 minutes
  • Automated paging worked correctly
  • On-call engineer acknowledged page within 3 minutes

GOOD: Effective Communication:

  • Incident Slack channel created immediately
  • Status updates posted every 15 minutes
  • Customer support notified within 10 minutes

GOOD: Rollback Process:

  • Rollback decision made decisively after confirming root cause
  • Deployment pipeline enabled 1-click rollback
  • Rollback completed in 25 minutes (within target)

GOOD: No Data Loss:

  • Database integrity maintained throughout incident
  • Payment state machine handled failures gracefully
  • Post-incident reconciliation identified and resolved stuck payments

GOOD: Team Collaboration:

  • Cross-functional team (engineers, DBAs, incident commander) worked effectively
  • Clear decision-making authority (incident commander)
  • No finger-pointing during incident

Lessons to Preserve: These practices worked well and should be maintained and reinforced in team onboarding and incident response training.


What Went Wrong

Honest assessment of failures and gaps.

BAD: Code Review Missed Defect:

  • Resource leak not caught despite standard review process
  • No automated detection of resource leaks in CI

BAD: Testing Gaps:

  • Integration tests too short to detect leak
  • No load testing for new payment method
  • No connection pool monitoring in test environments

BAD: Monitoring Blind Spot:

  • Connection pool usage not monitored or alerted on
  • No early warning for resource exhaustion

BAD: Delayed Diagnosis:

  • Took 45 minutes to identify root cause
  • On-call engineer unfamiliar with connection pool metrics
  • Database team engagement delayed

BAD: Deployment Risk:

  • 100% rollout for new payment method (no canary)
  • No automated rollback on error spike

BAD: Runbook Gap:

  • No documented procedure for connection pool exhaustion
  • On-call engineer had to learn troubleshooting during incident

Action Items

Concrete, assigned, time-bound improvements. Each action item must have an owner and deadline.

Prevent Recurrence (High Priority)

ActionOwnerDeadlineStatusJira Ticket
GOOD:Add SpotBugs plugin to detect resource leaks in CI pipeline@engineer22025-11-20Done
GOOD:Create code review checklist with resource management section@tech-lead2025-11-18Done
Add connection pool usage monitoring and alerting (>70% = warning, >85% = critical)@sre-team2025-11-25In ProgressPROJ-5680
Refactor all database access to use try-with-resources pattern@engineer12025-12-05To DoPROJ-5681
Implement automated connection leak testing in integration test suite@engineer22025-12-01To DoPROJ-5682

Improve Detection and Response (Medium Priority)

ActionOwnerDeadlineStatusJira Ticket
Create runbook for "Database connection pool exhaustion"@dba-team2025-11-22In ProgressPROJ-5683
Add database troubleshooting to on-call training materials@sre-team2025-12-10To DoPROJ-5684
Implement automated canary deployment for payment-related changes@devops-team2025-12-15To DoPROJ-5685
Configure automated rollback on sustained error rate >10%@devops-team2025-12-15To DoPROJ-5686

Systemic Improvements (Lower Priority, Long-Term)

ActionOwnerDeadlineStatusJira Ticket
Implement soak testing environment running 24/7 with production-like load@sre-team2026-01-15To DoPROJ-5687
Evaluate connection pool size and scaling strategy across all services@dba-team2025-12-20To DoPROJ-5688
Conduct "gameday" exercise simulating resource exhaustion scenarios@sre-team2026-02-01To DoPROJ-5689

Action Item Guidelines:

  • Specific: Clearly defined, actionable task (not vague goals like "improve testing")
  • Assigned: Single owner responsible for completion (can delegate, but owns outcome)
  • Time-bound: Realistic deadline based on priority and complexity
  • Tracked: Linked to Jira ticket for visibility and accountability
  • Prioritized: High/Medium/Low priority based on risk and impact

Follow-Up Process:

  • Weekly review of action item status in team standup
  • Tech lead tracks overall completion rate
  • Incomplete items discussed in retrospectives
  • Action items from post-mortems never silently dropped

Lessons Learned

High-level takeaways for the team and organization.

Technical Lessons

Resource Management is Critical: Resource leaks (connections, file handles, memory) are subtle bugs that don't manifest in short test runs but cause catastrophic failures under load. Always use language-provided resource management (try-with-resources in Java, context managers in Python, defer in Go).

Monitoring Requires Coverage: If a resource can be exhausted (connections, threads, memory, disk), it must be monitored and alerted on. Waiting until 100% exhaustion is too late - alert at 70-80% to allow proactive intervention.

Static Analysis Catches Defects: Automated tools like SpotBugs, ErrorProne, and SonarQube catch entire classes of bugs that humans miss in review. These tools should be mandatory CI steps, not optional.

Process Lessons

Code Review Needs Structure: Unstructured code reviews are inconsistent. Checklists ensure reviewers check critical aspects like resource management, error handling, and security consistently.

Testing Must Match Production: Tests that run for 30 seconds won't catch issues that manifest after 30 minutes. Soak testing and long-running environments are necessary to catch time-dependent bugs.

Gradual Rollouts Reduce Blast Radius: Deploying changes to 100% of traffic immediately means 100% of users affected when problems occur. Canary deployments (5% → 25% → 100%) limit impact and enable fast rollback.

Organizational Lessons

Runbooks Reduce MTTR: On-call engineers can't know everything. Documented procedures for common failure modes (connection exhaustion, memory leaks, cache failures) dramatically reduce mean time to resolution.

Training Improves Response: Expecting on-call engineers to learn troubleshooting during incidents is unfair and slow. Proactive training and gameday exercises build muscle memory for faster incident response.

Blamelessness Enables Honesty: The engineer who wrote the defective code participated actively in the post-mortem, providing critical context for timeline and root cause analysis. This only happens when the culture is truly blameless.


References


Appendix

Glossary

  • SEV1: Critical incident, complete service outage or data loss
  • SEV2: Major incident, significant degradation or partial outage
  • SEV3: Minor incident, limited impact or resolved quickly
  • MTTR: Mean Time To Resolution
  • RCA: Root Cause Analysis

Metrics

  • Time to Detect (TTD): 2 minutes (monitoring → alert)
  • Time to Engage (TTE): 3 minutes (alert → engineer acknowledgment)
  • Time to Understand (TTU): 45 minutes (engagement → root cause identified)
  • Time to Resolve (TTR): 2 hours 15 minutes (detection → all-clear)
  • Time to Recovery (TTRec): 1 hour 20 minutes (decision to rollback → service restored)
  • INC-11234 - Similar connection leak in Notification service (Sept 2025)
  • INC-10456 - Thread pool exhaustion in API Gateway (July 2025)

---

## Conducting Post-Mortem Meetings

Post-mortem meetings bring the team together to collaboratively analyze the incident and develop action items.

### Meeting Preparation

**Before the Meeting** (Owner: Post-Mortem Author):

1. **Draft post-mortem document** (1-2 days after incident)
- Collect incident timeline from Slack, logs, monitoring
- Interview involved engineers for context
- Draft initial root cause hypothesis
- Share draft 24 hours before meeting

2. **Identify participants**
- All engineers involved in incident response
- On-call engineer
- Tech lead and/or architect
- Product owner (if customer impact significant)
- Representatives from affected teams (database, SRE, etc.)

3. **Schedule meeting**
- Within 48-72 hours of incident (while details fresh)
- 60-90 minutes duration
- Mandatory attendance for key participants

4. **Set up logistics**
- Video conference link (for remote participants)
- Shared screen for live document editing
- Incident timeline, logs, and dashboards accessible

### Meeting Facilitation

**Facilitator Role** (Usually Tech Lead or Incident Commander):

The facilitator guides the meeting to ensure productive, blameless discussion and actionable outcomes.

**Responsibilities**:
- Keep discussion on track (prevent tangents)
- Ensure all voices heard (draw out quiet participants)
- Redirect blame language to system-focused language
- Timebox sections (don't get stuck debating one point)
- Drive toward action items and owners

**Meeting Agenda** (60 minutes):

```markdown
## Post-Mortem Meeting Agenda

**Incident**: [Brief description]
**Date**: [Meeting date]
**Facilitator**: [@person]
**Notetaker**: [@person]
**Participants**: [@person1, @person2, @person3, ...]

---

### 1. Introduction (5 minutes)

**Facilitator**:
- Welcome and thank participants
- State purpose: "We're here to learn, not to blame"
- Review meeting agenda and timeboxes
- Assign notetaker (captures discussion points and action items)

---

### 2. Incident Overview (10 minutes)

**Author** presents:
- What happened (high-level summary)
- Impact (customers, business, systems)
- Timeline (key events, detection to resolution)

**Questions** for clarification only (not debating root cause yet)

---

### 3. Root Cause Analysis (20 minutes)

**Discussion**:
- What was the immediate technical cause?
- What underlying factors contributed?
- 5 Whys exercise (go deeper than surface cause)
- What could have prevented this?

**Facilitator** ensures:
- Discussion stays technical (system gaps, not individual mistakes)
- All perspectives considered (engineers, DBAs, SRE, etc.)
- Root cause clearly identified and agreed upon

---

### 4. What Went Well / What Went Wrong (10 minutes)

**Discussion**:
- What practices should we preserve? (celebrate successes)
- What gaps did we identify? (monitoring, testing, processes)

**Facilitator** encourages:
- Balanced discussion (not just dwelling on negatives)
- Specific examples (not vague generalizations)

---

### 5. Action Items (15 minutes)

**Brainstorm** improvements:
- How do we prevent this from happening again?
- How do we detect faster next time?
- How do we respond better?
- What systemic improvements are needed?

**Prioritize** action items:
- High: Prevent recurrence of this specific issue
- Medium: Improve detection and response
- Low: Long-term systemic improvements

**Assign** owners and deadlines:
- Each action item has single owner
- Deadlines realistic based on priority and complexity
- Owners commit to deadlines (not voluntold)

**Facilitator** ensures:
- Action items are specific and actionable (not vague goals)
- Deadlines are realistic (not aspirational)
- Owners accept assignments (not coerced)

---

### 6. Wrap-Up (10 minutes)

**Review**:
- Recap key decisions
- Confirm action item owners and deadlines
- Identify any open questions requiring follow-up

**Next steps**:
- Author finalizes post-mortem document
- Post-mortem published to team wiki/Confluence
- Action items created in Jira and tracked in sprint planning
- Summary shared with broader organization

**Facilitator** thanks participants and closes meeting

Meeting Best Practices

Do:

  • [GOOD] Start on time (respect participants' schedules)
  • [GOOD] Use collaborative document editing (everyone sees updates live)
  • [GOOD] Encourage participation from all attendees (ask quiet people for input)
  • [GOOD] Timebox discussions (use "parking lot" for tangents)
  • [GOOD] Focus on learning and improvement (forward-looking)
  • [GOOD] End with clear action items and next steps

Don't:

  • [BAD] Let one person dominate discussion
  • [BAD] Debate technical implementation details (move to follow-up)
  • [BAD] Skip action item assignment (vague "someone should...")
  • [BAD] Rush through root cause (take time to understand deeply)
  • [BAD] Allow blame language (redirect to system focus)
  • [BAD] Go over time without explicit agreement

Handling Difficult Situations:

If someone self-blames:

Participant: "I should have caught this in code review. This is my fault."

Facilitator: "Thanks for your honesty. Let's talk about what in our review process could have caught this. Do we have a checklist for resource management? Should we add automated checks?"

If discussion becomes a blame session:

Participant: "DevOps shouldn't have deployed on a Friday."

Facilitator: "Let's reframe that. What policy or guardrail could prevent risky Friday deployments? Should we implement a change freeze?"

If root cause debate stalls:

Facilitator: "We have two hypotheses for root cause. Let's document both in the post-mortem and assign someone to investigate further offline. We need to move to action items."

If action items are too vague:

Participant: "We should improve our testing."

Facilitator: "Can we make that more specific? For example: 'Add connection leak test to integration suite' or 'Implement soak testing environment'?"

Post-Meeting Follow-Up

Within 24 Hours (Author):

  • Finalize post-mortem document incorporating meeting discussion
  • Create Jira tickets for all action items with owners and deadlines
  • Publish post-mortem to shared repository (Confluence, internal wiki, GitLab)
  • Share summary with broader engineering team and leadership

Within 1 Week (Tech Lead):

  • Review action items in sprint planning
  • Prioritize high-priority items for immediate sprint
  • Schedule medium/low priority items for future sprints

Ongoing (Team):

  • Track action item completion in weekly standups
  • Update post-mortem with completion status
  • Celebrate completed action items (recognize improvement effort)

Sharing Post-Mortems

Sharing post-mortems broadly multiplies their value by spreading learnings across teams and preventing similar incidents elsewhere.

Internal Sharing

Team Level:

  • Publish all post-mortems to team wiki or Confluence
  • Create "Post-Mortem" tag/category for easy discovery
  • Reference relevant post-mortems in onboarding materials

Engineering Organization:

  • Share summary in engineering all-hands meetings
  • Send email digest of post-mortems monthly
  • Maintain searchable repository of all post-mortems

Cross-Team:

  • Alert related teams to relevant post-mortems (e.g., share database post-mortem with all teams using that database)
  • Invite related teams to post-mortem meetings when their systems involved
  • Create cross-team action items when systemic issues span teams

Leadership:

  • Provide executive summaries for significant incidents
  • Include post-mortem metrics in quarterly reviews (MTTR trends, action item completion rates)
  • Highlight successful improvements driven by post-mortems

External Sharing

Some organizations publish post-mortems externally to build trust and contribute to industry knowledge.

When to Share Externally:

  • Outage affected customers significantly (transparency builds trust)
  • Incident provides valuable learning for broader industry
  • Novel failure mode or interesting technical problem
  • Company culture values radical transparency

Redaction for Public Sharing:

  • Remove customer-specific details (names, account numbers, business metrics)
  • Remove internal system names and architecture details (security risk)
  • Remove sensitive financial data (revenue numbers, SLA details)
  • Remove names of individuals (privacy and blamelessness)
  • Generalize enough to prevent competitive intelligence gathering

Example Public Post-Mortem: Cloudflare, GitHub, and Stripe regularly publish redacted post-mortems that balance transparency with security.

Learning from Others' Post-Mortems

Reading post-mortems from other teams and companies is valuable proactive learning.

Internal Post-Mortem Review:

  • Dedicate 15 minutes in team meetings to review recent post-mortems
  • Ask: "Could this happen to our service?" and "How can we prevent it?"
  • Create action items for preventative measures (even though incident didn't affect your service)

External Post-Mortem Sources:

Learning Discussion: In team meetings, discuss external post-mortems:

  • "Stripe had a Redis failover issue that caused 30 minutes of downtime. Do we have similar Redis single point of failure? Should we test our failover process?"
  • "GitHub's MySQL cluster failover failed due to missing monitoring. Do we have monitoring for our database cluster health?"

This proactive learning prevents incidents before they happen.


Action Item Tracking and Follow-Up

Post-mortems are only valuable if action items are actually completed. Rigorous tracking ensures improvements happen.

Tracking System

Jira Integration: Create action items as Jira tickets with:

  • Type: "Post-Mortem Action Item" (custom issue type or label)
  • Summary: Clear, actionable description
  • Description: Link back to post-mortem document for context
  • Assignee: Single owner responsible for completion
  • Due Date: Deadline based on priority
  • Priority: High / Medium / Low
  • Labels: post-mortem, incident ID (e.g., INC-12345), category (e.g., monitoring, testing, process)

Dashboard: Create Jira dashboard tracking:

  • Total action items by status (To Do, In Progress, Done)
  • Action items by priority
  • Overdue action items (flagged for escalation)
  • Action item completion rate over time
  • Action items by category (monitoring, testing, process, etc.)

Follow-Up Cadence

Weekly (Team Standup):

  • Review in-progress action items
  • Identify blockers and provide help
  • Celebrate completed items

Sprint Planning:

  • Prioritize high-priority action items for current sprint
  • Schedule medium-priority items for upcoming sprints
  • Ensure bandwidth allocated for action items (not just features)

Monthly (Team Retrospective):

  • Review overall action item completion rate
  • Discuss patterns (categories with most action items = systemic issues)
  • Identify action items that should be escalated or reprioritized

Quarterly (Leadership Review):

  • Report on action item completion metrics
  • Highlight significant improvements from post-mortems
  • Discuss resourcing needs if action item backlog growing

Accountability

Owner Responsibility:

  • Owner commits to deadline when action item assigned
  • Owner provides status updates in standups
  • Owner escalates blockers immediately (doesn't wait for deadline)
  • Owner ensures action item completed or explicitly reprioritized (not silently dropped)

Tech Lead Responsibility:

  • Tech lead tracks overall action item health
  • Tech lead escalates overdue items
  • Tech lead ensures action items prioritized in sprint planning
  • Tech lead prevents action items from being silently dropped

Escalation Process: When action items become overdue:

  1. Week 1 overdue: Tech lead follows up with owner (blocker? need help? reprioritize?)
  2. Week 2 overdue: Escalate in team standup (public visibility, team offers help)
  3. Week 3 overdue: Escalate to engineering manager (resourcing issue? priority dispute?)
  4. Week 4+ overdue: Explicit decision required (complete, reprioritize, or drop with documented rationale)

Never Silently Drop: Action items should never disappear without explicit decision. If an action item is no longer relevant or feasible:

  • Document reason in post-mortem
  • Close Jira ticket with explanation
  • Communicate to team (transparency)

This accountability ensures post-mortems drive real improvements, not just documentation.


Measuring Post-Mortem Effectiveness

Track metrics to ensure post-mortem process delivers value.

Process Metrics

Timeliness:

  • Post-mortem completion time: Target <72 hours from incident resolution to published post-mortem
  • Meeting scheduling: Target <48 hours from incident to meeting

Quality:

  • Root cause identified: 100% of post-mortems should identify clear root cause (not "human error")
  • Action items generated: Average 5-10 action items per post-mortem
  • Reviewer feedback: Post-mortem reviews identify gaps or unclear sections

Outcome Metrics

Action Item Completion:

  • Completion rate: Target >90% of action items completed within deadline
  • High-priority completion rate: Target 100% of high-priority items completed
  • Time to completion: Average time from creation to completion

Recurrence Prevention:

  • Incident recurrence: Track if similar incidents occur after post-mortem (indicates action items ineffective)
  • Problem category trends: Are monitoring, testing, or process categories decreasing over time?

System Reliability:

  • MTTR improvement: Is mean time to resolution decreasing over time?
  • Incident frequency: Is overall incident rate decreasing?
  • Severity distribution: Are SEV1 incidents becoming less frequent?

Learning Metrics

Knowledge Sharing:

  • Post-mortem views: How many people are reading post-mortems?
  • Post-mortem references: Are post-mortems referenced in design docs, code reviews, planning?
  • Preventative action items: Are teams creating action items from other teams' post-mortems?

Culture:

  • Psychological safety: Team surveys on safety reporting mistakes
  • Participation: Are engineers actively participating in post-mortem meetings?
  • Transparency: Are post-mortems published and accessible?

Common Post-Mortem Anti-Patterns

Avoid these common pitfalls that reduce post-mortem effectiveness.

Blame and Shame

Anti-Pattern: Post-mortem identifies "responsible individual" and focuses on what they should have done differently.

Problem: Creates fear, reduces transparency, prevents learning.

Solution: Use system-focused language. Ask "what process gap allowed this?" not "who made the mistake?"

Superficial Root Cause

Anti-Pattern: Root cause listed as "human error" or "developer mistake" without deeper analysis.

Problem: "Human error" is never the root cause in complex systems. It's a symptom of system gaps.

Solution: Use "5 Whys" to dig deeper. Why did the human make that error? What system safeguards failed?

Action Item Theater

Anti-Pattern: Post-mortem generates long list of action items that are never completed or tracked.

Problem: Post-mortems become checkbox exercise with no actual improvement.

Solution: Limit action items to realistic number, assign owners, track in Jira, review weekly.

Delayed Post-Mortem

Anti-Pattern: Post-mortem written weeks or months after incident.

Problem: Details forgotten, context lost, momentum gone, lessons not learned.

Solution: Publish post-mortem within 72 hours. Schedule meeting within 48 hours. Strike while iron is hot.

Skipping "What Went Well"

Anti-Pattern: Post-mortem only focuses on negatives and failures.

Problem: Demoralizes team, misses opportunity to reinforce good practices.

Solution: Always include "What Went Well" section. Celebrate successes like fast detection, good communication, effective rollback.

No Follow-Up

Anti-Pattern: Post-mortem published and forgotten. Action items never reviewed.

Problem: Zero improvement from post-mortem effort.

Solution: Track action items in Jira, review weekly, escalate overdue items, celebrate completions.

Echo Chamber

Anti-Pattern: Post-mortem only shared with immediate team, learnings not spread.

Problem: Other teams don't learn from incident, similar incidents repeat elsewhere.

Solution: Publish post-mortems organization-wide, share in all-hands, alert related teams.

Too Much Detail

Anti-Pattern: Post-mortem is 20+ pages of exhaustive technical detail.

Problem: No one reads it, key lessons buried, not maintainable.

Solution: Be concise. Executive summary for skimmers, detailed sections for those who want depth. Use appendix for deep technical details.

Not Learning from Success

Anti-Pattern: Only writing post-mortems for failures, not for near-misses or successes.

Problem: Miss opportunity to understand what went right and reinforce it.

Solution: Consider lightweight post-mortems for near-misses ("what almost went wrong?") and successful incident responses ("what made this recovery so smooth?").


Post-Mortem Culture Maturity Model

Teams evolve through stages of post-mortem maturity.

Level 1: Reactive

Characteristics:

  • Post-mortems written only for major outages
  • Blame-focused language common
  • Action items rarely completed
  • Limited sharing outside immediate team

Improvements Needed:

  • Establish blameless language guidelines
  • Create post-mortem template
  • Implement action item tracking

Level 2: Consistent

Characteristics:

  • Post-mortems written for most significant incidents
  • Generally blameless language
  • Action items tracked in Jira
  • Post-mortems published to team wiki

Improvements Needed:

  • Improve root cause analysis depth (5 Whys)
  • Increase action item completion rate
  • Share post-mortems across teams

Level 3: Proactive

Characteristics:

  • Post-mortems for near-misses and successes
  • Deep root cause analysis standard
  • 90% action item completion rate

  • Post-mortems shared organization-wide
  • Teams learn from other teams' post-mortems

Improvements Needed:

  • Automate post-mortem generation from incident data
  • Trend analysis across multiple post-mortems
  • External sharing (public post-mortems)

Level 4: Learning Organization

Characteristics:

  • Post-mortems embedded in culture
  • Psychological safety evident (engineers openly discuss failures)
  • Automated metrics and trend analysis
  • Proactive improvements from other teams' post-mortems
  • Regular "gameday" exercises based on post-mortem scenarios
  • Declining incident frequency and severity over time

Further Reading

External Resources:


Summary

Key Takeaways:

  1. Blameless Culture: Focus on systems and processes, not individuals. Use system-focused language consistently.
  2. Timely Execution: Conduct post-mortems within 48-72 hours while details are fresh.
  3. Thorough Analysis: Use "5 Whys" to identify root cause beyond "human error."
  4. Actionable Outcomes: Generate specific, assigned, time-bound action items and track them rigorously.
  5. Celebrate Successes: Include "What Went Well" to reinforce good practices.
  6. Share Broadly: Publish post-mortems organization-wide to spread learnings.
  7. Track Action Items: Review weekly, escalate overdue items, complete >90%.
  8. Learn from Others: Read post-mortems from other teams and external sources proactively.
  9. Measure Effectiveness: Track MTTR trends, action item completion, incident recurrence.
  10. Continuous Improvement: Post-mortems are learning opportunities that systematically improve reliability.
Psychological Safety

The quality of your post-mortems directly reflects the psychological safety of your team. If engineers fear blame, they'll hide mistakes and rush through post-mortems. If engineers trust they won't be punished for honest mistakes, they'll provide the transparency needed for deep learning and improvement.

Invest in blameless culture. The return is dramatic: better incident response, faster recovery, fewer recurring incidents, and higher team morale.