Technical Writing Guide
Guidelines for writing clear, comprehensive, and maintainable technical documentation that enables teams to understand, use, and maintain software systems effectively.
Overview
Technical documentation reduces onboarding time, prevents errors, facilitates collaboration, and preserves institutional knowledge. Poor documentation forces engineers to reverse-engineer code, slows development, and increases incident resolution time. Documentation quality directly correlates with team productivity because well-documented systems are easier to understand, modify, and debug.
This guide covers documentation types, writing principles, structural patterns, and maintenance practices. Each section provides concrete examples demonstrating the principles in action.
Treat documentation with the same rigor as production code: version control, peer review, automated testing (link checking, spell checking), and continuous updates.
Core Principles
- Match documentation type to purpose - Different documentation types (API docs, runbooks, ADRs, READMEs) serve distinct needs and audiences
- Write for scanability - Readers rarely read linearly; use headings, lists, tables, and diagrams to enable quick navigation
- Provide concrete examples - Abstract concepts become actionable through real code examples and scenarios
- Keep documentation current - Outdated documentation is worse than no documentation because it misleads readers
- Automate validation - Link checking, spell checking, and code example testing catch errors before they reach readers
Documentation Types
Different documentation types serve different purposes and audiences. Understanding when to use each type ensures documentation provides maximum value.
API Documentation
API documentation describes how to use an API, including endpoints, request/response formats, authentication, and error handling. The primary audience is developers consuming the API (internal teams, external partners, or public developers).
Essential Elements:
- Endpoint descriptions with HTTP methods and paths
- Request and response schemas with examples
- Authentication and authorization requirements
- Error codes and handling strategies
- Rate limiting and usage quotas
- Versioning information
Use OpenAPI/Swagger specifications to generate interactive API documentation. OpenAPI provides a machine-readable contract that generates documentation, client SDKs, and server stubs automatically. This approach ensures documentation stays synchronized with implementation because the specification is validated against the actual API through contract testing (see Contract Testing).
# OpenAPI specification serves as both contract and documentation source
openapi: 3.0.0
info:
title: Payment API
version: 1.0.0
description: API for processing payment transactions
paths:
/api/v1/payments:
post:
summary: Create a new payment
description: Initiates a payment transaction with the specified amount and currency
operationId: createPayment
requestBody:
required: true
content:
application/json:
schema:
$ref: '#/components/schemas/PaymentRequest'
example:
customerId: "cust_123abc"
amount: 100.00
currency: "USD"
description: "Invoice payment #12345"
responses:
'201':
description: Payment created successfully
content:
application/json:
schema:
$ref: '#/components/schemas/PaymentResponse'
'400':
description: Invalid request - missing or malformed fields
content:
application/json:
schema:
$ref: '#/components/schemas/Error'
'401':
description: Unauthorized - Invalid or missing authentication token
The specification becomes the source of truth. Tools like Swagger UI, Redoc, or Stoplight generate interactive documentation where developers can test API calls directly in the browser, eliminating the gap between documentation and reality.
Common Documentation Gaps:
- Outdated examples that don't match current API behavior
- Missing error scenarios and edge cases (document what happens when requests fail, not just success paths)
- Insufficient explanation of authentication flows (show the complete token acquisition and refresh process)
- No examples of real-world use cases (isolated endpoint docs don't show how to accomplish user goals)
User Guides
User guides teach users (developers or end users) how to accomplish specific tasks or workflows. Structure guides as task-oriented tutorials rather than feature catalogs. Users come to documentation to solve problems, not to read about features.
Start each guide with a clear goal statement: "By the end of this guide, you will be able to..." This sets expectations and helps readers determine if the guide meets their needs.
# Setting Up Local Development Environment
**Time required**: ~20 minutes
**Prerequisites**:
- Java 25 or higher installed (`java -version` to check)
- Docker Desktop running (needed for PostgreSQL and Redis)
- Git configured with SSH keys (test with `ssh -T [email protected]`)
**Goal**: Set up a fully functional local development environment to run and test the payment service.
## Step 1: Clone the Repository
```bash
git clone [email protected]:payments/payment-service.git
cd payment-service
Step 2: Configure Environment Variables
Create a .env file in the project root. This file contains local configuration that differs from production settings.
# Database configuration - connects to local Docker PostgreSQL
DB_HOST=localhost
DB_PORT=5432
DB_NAME=payments_dev
DB_USER=dev_user
DB_PASSWORD=dev_password
# External API keys (request from team lead)
PAYMENT_GATEWAY_API_KEY=<your-api-key>
Security Note: Never commit the .env file to version control. It's already in .gitignore. Production secrets are managed through Kubernetes secrets.
Step 3: Start Dependencies
Start PostgreSQL and Redis using Docker Compose. The docker-compose.yml file defines these services with appropriate configurations for local development.
docker-compose up -d postgres redis
Wait for services to be healthy. Check status with:
docker-compose ps
Both services should show status "Up" with health "healthy".
Step 4: Run Database Migrations
Apply database migrations to create the schema. The application uses Flyway to manage schema versions (see Database Migrations for details).
./gradlew flywayMigrate
You should see output confirming all migrations applied successfully:
Successfully applied 5 migrations
Step 5: Start the Application
./gradlew bootRun
The application starts on http://localhost:8080. Watch for log output indicating successful startup:
Started PaymentServiceApplication in 8.234 seconds
Verify Installation
Test the application is working by calling the health endpoint. This endpoint checks database connectivity and other dependencies.
curl http://localhost:8080/actuator/health
Expected response:
{
"status": "UP",
"components": {
"db": {"status": "UP"},
"redis": {"status": "UP"}
}
}
Troubleshooting
Problem: Port 8080 already in use
Cause: Another application is using port 8080
Solution: Either stop that application or change the port in application.yml under server.port: 8081
Problem: Could not connect to database
Cause: PostgreSQL container not running or not healthy
Solution:
- Verify PostgreSQL is running:
docker-compose ps postgres - Check logs for errors:
docker-compose logs postgres - If container crashed, restart it:
docker-compose restart postgres
Next Steps
- Run the test suite to verify everything works (see Testing Guide)
- Make your first code change (see Development Workflow)
- Submit a pull request (see Pull Request Process)
The guide provides a linear sequence of actions with clear verification steps at each stage. Troubleshooting addresses common failure modes encountered during setup, reducing frustration and support burden.
---
### Runbooks
Runbooks provide step-by-step operational procedures for managing production systems, handling incidents, and performing routine maintenance. The audience is on-call engineers, DevOps teams, and SREs who need to respond to production issues.
Runbooks must be action-oriented and assume the reader is under stress (incident response at 2 AM). Use checklists, decision trees, and imperative language ("Check X", "If Y, then do Z"). Every alert should have a corresponding runbook entry so on-call engineers know exactly what to do when paged.
```markdown
# Runbook: High Payment Processing Latency
**Alert**: `payment_processing_p95_latency > 2000ms`
**Severity**: High (impacts customer experience)
**On-Call**: @payments-team-oncall
---
## Symptoms
- Payment processing taking longer than 2 seconds (P95 latency)
- Customers reporting slow checkout experience
- Dashboard shows elevated latency metrics in Grafana
## Investigation Steps
### 1. Check System Health
```bash
# Check overall service health
curl https://payments.company.com/actuator/health
# Check database connection pool utilization
curl https://payments.company.com/actuator/metrics/hikari.connections.active
What to look for:
- Health endpoint returns
DOWNstatus -> Service degradation, proceed to step 2 - Active connections near max pool size (20) -> Database connection exhaustion, proceed to step 4
- Health endpoint timeout -> Service completely down, escalate immediately
2. Check Recent Deployments
Recent deployments are the most common cause of sudden performance degradation.
# Check recent deployments
kubectl rollout history deployment/payment-service -n production
What to look for:
- Deployment within last 15 minutes -> Recent change likely introduced issue
- If recent deployment exists, consider rollback (see Rollback section below)
3. Check External Dependencies
Payment gateway latency often causes downstream processing delays.
# Check payment gateway latency metrics
curl https://payments.company.com/actuator/metrics/http.client.requests \
| grep payment_gateway
What to look for:
- Payment gateway latency elevated (>1000ms) -> External dependency issue, proceed to Scenario A
- Payment gateway error rate >1% -> Upstream service degradation, proceed to Scenario A
4. Check Database Performance
Database performance issues manifest as slow query execution or lock contention.
# Connect to read-replica for diagnostics (never use primary for diagnostics)
kubectl exec -it postgres-read-replica -n production -- psql -U payments
-- Check for long-running queries
SELECT pid, now() - query_start as duration, state, query
FROM pg_stat_activity
WHERE state != 'idle'
ORDER BY duration DESC
LIMIT 10;
-- Check for lock contention
SELECT * FROM pg_locks WHERE NOT granted;
What to look for:
- Queries running >5 seconds -> Potentially missing indexes or inefficient query, proceed to Scenario B
- Lock contention present (ungran locks exist) -> Multiple transactions competing for same resources, proceed to Scenario B
Resolution Steps
Scenario A: External Dependency Degradation
If payment gateway latency is elevated:
-
Enable circuit breaker to prevent cascading failures:
kubectl set env deployment/payment-service \
CIRCUIT_BREAKER_ENABLED=true -n productionThe circuit breaker prevents request pile-up when the gateway is slow, returning fast failures instead of timing out (see Resilience Patterns).
-
Check vendor status page: https://status.payment-gateway.com
-
Notify stakeholders via #payments-incidents Slack channel:
INCIDENT: Payment gateway experiencing high latency.
Circuit breaker enabled. Monitoring vendor status. -
Monitor recovery: Watch for latency to return to normal (under 500ms P95)
Scenario B: Database Performance Issue
If database queries are slow:
-
Identify slow query from pg_stat_activity output (copy the query text)
-
Analyze query execution plan:
EXPLAIN ANALYZE <slow-query>;Look for "Seq Scan" on large tables (indicates missing index) or high row counts in intermediate steps.
-
If missing index identified:
- Create index on read-replica first to test impact (safe to test here)
- If successful and performance improves, schedule index creation on primary during maintenance window
- DO NOT create indexes on primary during incident (creates locks that worsen the situation)
-
Temporary mitigation - Scale read replicas if read-heavy traffic:
kubectl scale deployment/payment-service-read-replica \
--replicas=5 -n productionAdditional replicas distribute read load, reducing latency. This doesn't solve the root cause but provides breathing room for proper fix.
Scenario C: Recent Deployment Regression
If deployment occurred within last 30 minutes and no other cause identified:
-
Rollback immediately:
kubectl rollout undo deployment/payment-service -n production -
Verify rollback completed:
kubectl rollout status deployment/payment-service -n productionWait for "successfully rolled out" message.
-
Monitor latency - should return to baseline within 2-3 minutes after rollback completes
-
Notify team to investigate the problematic deployment before retrying
Escalation
If issue not resolved within 15 minutes:
- Page Tech Lead: @tech-lead-oncall
- Post in #payments-incidents with investigation findings so far
- Create bridge call: https://company.zoom.us/j/emergency-bridge
If service is completely down (health endpoint not responding):
- IMMEDIATELY page Tech Lead and Engineering Manager
- Update status page: https://status.company.com (customers need visibility)
- Enable maintenance mode to prevent partial failures
Post-Incident
After resolution:
- Document timeline in incident ticket (what happened when, what actions taken)
- Schedule post-mortem within 48 hours (see Incident Post-Mortems)
- Update runbook if new information discovered (make it better for next time)
- Create follow-up tasks for preventative measures (address root cause)
Related Runbooks
- Payment Service Deployment Issues
- Database Connection Pool Exhaustion
- Circuit Breaker Activation
This runbook structure provides clear decision trees for investigation with actionable commands and expected outputs. Explicit escalation criteria prevent on-call engineers from struggling too long with unfamiliar issues.
---
### Architecture Decision Records (ADRs)
Architecture Decision Records document significant architectural decisions, the context that led to them, and their consequences. The audience includes engineers, architects, and future team members who need to understand "why" decisions were made.
ADRs create an architectural paper trail that prevents repeated debates and helps new team members understand the system's evolution. Without ADRs, teams repeatedly revisit the same discussions as team membership changes or memory fades. Write ADRs when making decisions that are expensive to reverse, such as choosing databases, architectural patterns, or third-party dependencies.
The [Technical Design Process](./technical-design.md) guide contains comprehensive ADR guidance including templates, workflow, and detailed examples. This section provides a brief overview.
**When to Write an ADR**:
- Selecting database technology (PostgreSQL vs MongoDB vs DynamoDB) - these choices lock in data models and operational patterns
- Adopting architectural patterns (microservices vs monolith, event sourcing) - these shape how teams work
- Choosing major frameworks or libraries (React vs Angular, Spring Boot vs Micronaut) - these determine skill requirements
- Changing deployment strategy (blue-green vs canary vs rolling) - these affect release processes
- Deciding authentication mechanisms (OAuth 2.0, SAML, API keys) - these impact security and user experience
**ADR Template Structure**:
```markdown
# ADR-015: Use PostgreSQL for Transactional Data
**Status**: Accepted
**Date**: 2025-11-15
**Deciders**: @engineering-team, @platform-team
## Context
We need to select a database for storing payment transaction data. The database must support:
- ACID transactions (payment processing cannot tolerate data loss or inconsistency)
- Complex queries (reporting requires joins across multiple tables)
- Strong consistency (balance calculations must be correct immediately)
- Proven reliability at scale (handling millions of transactions)
Current system uses MongoDB, which has caused production issues:
- Eventual consistency led to incorrect balance displays
- Lack of transactions caused orphaned payment records
- Complex aggregations perform poorly
## Decision
We will use PostgreSQL for all transactional data (payments, accounts, transactions).
**Rationale**:
- ACID guarantees ensure data consistency critical for financial data
- Rich query capabilities support complex reporting without data duplication
- Mature ecosystem with extensive tooling and monitoring
- Team has PostgreSQL expertise from previous projects
## Alternatives Considered
### Alternative 1: Continue with MongoDB
**Pros**:
- No migration required
- Flexible schema useful for rapidly changing requirements
**Cons**:
- Eventual consistency unsuitable for financial data
- Limited transaction support (multi-document transactions added recently, not mature)
- Complex queries require application-level joins or data duplication
**Rejected because**: Consistency requirements for financial data outweigh schema flexibility benefits.
### Alternative 2: DynamoDB
**Pros**:
- Infinite horizontal scaling
- Managed service reduces operational burden
- High availability built-in
**Cons**:
- Limited query capabilities (no joins, must design for access patterns up front)
- Vendor lock-in to AWS
- Higher cost at our scale ($2000/month vs $200/month for PostgreSQL)
- Team lacks DynamoDB expertise
**Rejected because**: Query flexibility needed for evolving reporting requirements. Cost and learning curve not justified by scale requirements.
## Consequences
**Positive**:
- Strong consistency eliminates balance calculation bugs
- ACID transactions simplify application code (no manual compensation logic)
- Rich querying enables ad-hoc analysis and reporting
- Reduced production incidents related to data consistency
**Negative**:
- Vertical scaling has limits (will need sharding if we exceed ~100K transactions/second)
- Migration effort required (estimated 2 weeks)
- Potential performance degradation during migration (mitigated by gradual rollout)
**Mitigation**:
- Plan migration carefully with rollback procedures
- Monitor database performance metrics closely during and after migration
- Design schema for future sharding if needed (partition by account ID)
## Related Decisions
- ADR-012: Event Sourcing for Audit Log
- ADR-018: Database Migration Strategy
ADRs capture not just what was decided, but why alternatives were rejected. This prevents future team members from proposing the same alternatives without understanding the trade-offs already considered.
README Files
README files provide project overview, quick start instructions, and pointers to detailed documentation. The audience is new team members, contributors, or anyone discovering the project.
The README is the front door to your project. It should answer three questions immediately: What is this? Why does it exist? How do I get started? Keep it concise - detailed information belongs in separate documentation files.
# Payment Service
> Microservice for processing payment transactions with support for multiple payment gateways, fraud detection, and compliance reporting.
## Overview
The Payment Service handles all payment processing for the platform, including:
- Payment creation, authorization, and capture
- Refunds and chargebacks
- Multi-currency support (USD, EUR, GBP)
- Integration with payment gateways (Stripe, Adyen)
- Fraud detection via risk scoring
- PCI-DSS compliant payment data handling
**Architecture**: Spring Boot microservice with PostgreSQL database and Kafka event streaming.
**Good**: **Status**: Production (99.95% uptime SLA)
## Quick Start
### Prerequisites
- Java 25+ (`java -version` to verify)
- Docker Desktop (for PostgreSQL and Redis)
- Git with SSH keys configured (`ssh -T [email protected]` to test)
### Run Locally
```bash
# Clone repository
git clone [email protected]:payments/payment-service.git
cd payment-service
# Start dependencies
docker-compose up -d
# Run application
./gradlew bootRun
The service will be available at http://localhost:8080.
Verify: curl http://localhost:8080/actuator/health
See Local Development Guide for detailed setup including environment configuration and troubleshooting.
Documentation
- API Documentation - Interactive API reference with try-it-out functionality
- Architecture Guide - System design, components, and data flow
- Development Guide - Local setup, testing, debugging
- Deployment Guide - CI/CD pipeline and production deployment
Runbooks- Operational procedures and troubleshooting
Key Technologies
- Backend: Java 25, Spring Boot 3.5, Spring Data JPA
- Database: PostgreSQL 15 with Flyway migrations
- Messaging: Kafka 3.5 for event streaming
- Monitoring: Prometheus metrics, Grafana dashboards
- Testing: JUnit 5, TestContainers, Contract Tests (Pact)
Project Structure
payment-service/
|-- src/main/java/ # Application source code
| `-- com/company/payments/
| |-- api/ # REST controllers
| |-- domain/ # Business logic and models
| |-- infrastructure/ # Data access, external integrations
| `-- config/ # Spring configuration
|-- src/test/ # Tests
|-- docs/ # Documentation
|-- docker/ # Docker configurations
`-- k8s/ # Kubernetes manifests
Contributing
See CONTRIBUTING.md for:
- Code style and conventions (enforced via Checkstyle and SpotBugs)
- Pull request process (see Pull Request Guidelines)
- Testing requirements (minimum 80% coverage)
- Code review guidelines
Team & Support
- Team: Payments Squad (@payments-team)
- Slack: #payments-team
- On-Call: @payments-oncall
- Issue Tracker: GitLab Issues
License
Proprietary - Internal use only
This README provides immediate context and clear next steps without overwhelming the reader. Links direct users to detailed documentation organized by concern (architecture, development, deployment, operations).
---
### RFCs (Request for Comments)
RFCs propose significant changes or new features, gather feedback from stakeholders, and build consensus before implementation. The audience includes engineering team, architects, product managers, and stakeholders.
RFCs differ from ADRs in timing and purpose. RFCs are collaborative documents created before decisions are finalized to gather input and refine proposals. ADRs document decisions after they're made to explain the final choice. Use RFCs for changes that affect multiple teams or require broad input.
```markdown
# RFC-042: Implement Idempotency Keys for Payment API
**Status**: In Review
**Author**: @jane-developer
**Created**: 2025-11-05
**Updated**: 2025-11-08
**Reviewers**: @payments-team, @platform-team, @api-guild
**Target Decision Date**: 2025-11-15
---
## Problem Statement
Customers occasionally experience duplicate charges when network issues cause payment requests to be retried. Our API does not currently support idempotency, meaning identical requests are processed as separate transactions.
**Impact**:
- ~50 duplicate payment incidents per month
- Customer support burden: 2-3 hours per week handling duplicates
- Customer trust impact from billing errors
- Manual refund processing required
**Example Scenario**:
1. Customer submits payment for $100
2. Request reaches server, payment is created
3. Network timeout before response reaches client
4. Client retries request (reasonable behavior given timeout)
5. Server creates second payment for $100
6. Customer charged $200 instead of $100
This happens because the server cannot distinguish between a retry of a request that succeeded (but client didn't receive response) and a genuinely new request.
## Proposed Solution
Implement idempotency key support following Stripe's idempotency pattern:
1. **Client sends idempotency key** in header:
```http
POST /api/v1/payments
Idempotency-Key: unique-request-id-12345
Content-Type: application/json
{"amount": 100.00, "currency": "USD"}
-
Server checks for existing request with same idempotency key:
- If found within 24-hour window: Return cached response (200 OK)
- If not found: Process request normally, cache response
-
Store idempotency data:
- Idempotency key (indexed for fast lookup)
- Request fingerprint (method + path + body hash to detect conflicting requests)
- Response (status code, body)
- Timestamp (for cleanup after 24 hours)
Database Schema
CREATE TABLE idempotency_keys (
idempotency_key VARCHAR(255) PRIMARY KEY,
request_hash VARCHAR(64) NOT NULL, -- SHA-256 of method+path+body
response_status_code INT NOT NULL,
response_body JSONB NOT NULL,
created_at TIMESTAMP NOT NULL DEFAULT NOW(),
expires_at TIMESTAMP NOT NULL DEFAULT NOW() + INTERVAL '24 hours'
);
CREATE INDEX idx_idempotency_expires ON idempotency_keys(expires_at);
The request_hash detects cases where a client reuses an idempotency key with different request data (which should be rejected as an error).
Sequence Diagram
When the client retries with the same idempotency key, the server returns the cached response from the first request, preventing duplicate processing.
Alternatives Considered
Alternative 1: Client-Generated Request IDs Only
Store payment IDs from clients and reject duplicates with 409 Conflict.
Pros:
- Simpler implementation (no response caching)
- Lighter database storage
Cons:
- Doesn't handle legitimate retries (client never got response and doesn't know if payment succeeded)
- Client must handle 409 and query payment status separately (complex client logic)
- Doesn't solve the core problem of unclear request success
Rejected because: Doesn't solve the fundamental problem of clients not knowing if their request succeeded.
Alternative 2: Distributed Locking (Redis)
Use Redis distributed locks to prevent concurrent processing of same key.
Pros:
- Prevents race conditions during concurrent retries
- Fast lock acquisition (under 10ms)
Cons:
- Adds Redis dependency to critical path
- Lock timeout management complexity (what if lock holder crashes?)
- Still need database storage for response caching
- Over-engineered for the problem
Rejected because: Adds complexity without significant benefit. Database approach is simpler and sufficient.
Trade-offs and Risks
Trade-offs
| Aspect | Decision | Rationale |
|---|---|---|
| Storage duration | 24 hours | Balances safety (most retries happen within minutes) with storage cost |
| Storage location | PostgreSQL | Already available, ACID guarantees, simple implementation |
| Key validation | Client responsibility | Server validates format but doesn't generate keys (client controls retry semantics) |
Risks
| Risk | Likelihood | Impact | Mitigation |
|---|---|---|---|
| Database storage growth | Medium | Low | Automated cleanup job runs every 6 hours, indexed expiry column for efficient deletion |
| Hash collision | Very Low | High | Use SHA-256 (cryptographically secure), include timestamp in hash input |
| Performance impact | Low | Medium | Database index on key provides fast lookups, add Redis cache layer in Phase 2 if needed |
| Client adoption | Medium | Medium | Make header optional for backward compatibility, provide client libraries with automatic key generation |
Impact Analysis
Technical Impact:
Breaking Change: No (backward compatible)
- Idempotency-Key header is optional
- Requests without key behave as before
- Existing clients unaffected
Performance:
- Additional database lookup per request: ~5-10ms
- Negligible impact on overall P95 latency (currently ~150ms)
- Mitigation: Add Redis cache layer in Phase 2 if monitoring shows latency impact
Infrastructure:
- Estimated storage: ~500 KB per 1,000 requests
- At 1M requests/day: ~500 MB/day, 12 GB retained (24-hour window)
- Cleanup job runs every 6 hours, removing expired keys
Operational Impact:
Monitoring:
- Add metric:
idempotency_key_hit_rate(% of requests with existing keys) - Alert if hit rate >10% (indicates client retry issues or bugs)
Logging:
- Log when idempotent response returned (INFO level)
- Include original request timestamp to measure retry latency
Business Impact:
Benefits:
- Eliminate ~50 duplicate payment incidents/month
- Reduce customer support burden by 2-3 hours/week
- Improve customer trust and satisfaction
- Align with industry standard pattern (Stripe, PayPal use this approach)
Costs:
- Development: 2 weeks (1 engineer)
- Storage: ~$5/month (PostgreSQL storage)
- Maintenance: Minimal (automated cleanup)
Open Questions
-
Should we validate request body matches for same key?
- If client sends same key with different body, should we reject (409 Conflict) or allow?
- Recommendation: Reject with 409 Conflict and clear error message explaining the issue. This prevents accidental misuse.
-
What's the maximum key length?
- Stripe uses 255 characters max
- Recommendation: 255 characters (UUID typically 36, but client-generated keys may be longer)
-
Should we support key expiry extension?
- If client retries after 24 hours, should we extend expiry?
- Recommendation: No. Retries after 24 hours indicate a different issue. Fixed 24-hour window keeps cleanup logic simple.
Success Criteria
Technical:
- Idempotency implementation tested with >90% code coverage
- Performance impact under 10ms P95
- Zero duplicate payments in staging over 1-week testing period
- Load testing confirms no degradation under 2x normal load
Business:
- Duplicate payment incidents reduced by >80% within 3 months
- Customer support hours for duplicates reduced by >75%
- Adoption by at least 2 client applications within 6 months
Timeline
| Milestone | Target Date | Owner |
|---|---|---|
| RFC Review & Approval | 2025-11-15 | @payments-team |
| Design Review | 2025-11-18 | @jane-developer |
| Implementation | 2025-11-29 | @jane-developer |
| QA & Testing | 2025-12-06 | @qa-team |
| Production Deploy | 2025-12-13 | @devops-team |
| Monitor & Iterate | 2025-12-20 | @payments-team |
Feedback
Comments from @tech-lead (2025-11-06)
Looks solid. Consider Redis cache layer if database lookups become bottleneck. Also, what about distributed system clock skew affecting timestamps?
Response: Will add Redis cache in Phase 2 if monitoring shows >50ms DB lookup latency. Good point on clock skew - using database NOW() for all timestamps to avoid server clock issues.
Comments from @security-engineer (2025-11-07)
What prevents malicious clients from filling database with keys?
Response: Excellent point. Will add rate limiting (100 unique idempotency keys per client per hour) and automatic cleanup. Also added monitoring for unusual key creation patterns.
Decision
Good: Status: Approved (2025-11-15)
Outcome: Proceed with implementation as proposed. Add Redis caching layer in Phase 2 if performance monitoring indicates need.
Action Items:
- @jane-developer: Create implementation ticket with detailed subtasks
- @api-guild: Draft company-wide idempotency standard RFC
- @docs-team: Update API documentation with idempotency examples
- @qa-team: Create test plan including race condition scenarios
This RFC structure facilitates discussion by clearly presenting the problem, proposed solution, alternatives, and open questions. The "Feedback" section captures the collaborative refinement process, creating a record of how the decision evolved based on team input.
---
## Writing Principles
Effective technical writing follows core principles that ensure clarity, accuracy, and usability.
### Active Voice
Active voice makes writing clearer and more direct by explicitly stating who performs each action. This reduces cognitive load because readers immediately know the subject and action.
**Passive (unclear actor)**:
> "The payment is validated by the fraud detection service."
> "Errors should be handled gracefully."
> "The database schema is migrated automatically during deployment."
**Active (clear actor)**:
> "The fraud detection service validates the payment."
> "Your service should handle errors gracefully."
> "Flyway automatically migrates the database schema during deployment."
In runbooks and procedures, passive voice obscures responsibility, which delays incident response. "The database should be restarted" leaves engineers wondering who should restart it. "Restart the database using `kubectl restart`" provides clear action.
### Brevity
Concise writing respects the reader's time. Every sentence should add value. Remove filler words, redundant phrases, and unnecessary qualifiers.
**Verbose**:
> "It is important to note that you should always remember to validate user input in order to prevent potential security vulnerabilities that could possibly be exploited by malicious actors."
**Concise**:
> "Validate user input to prevent security vulnerabilities."
The concise version communicates the same requirement in 7 words instead of 29. Brevity doesn't mean removing necessary detail - it means removing unnecessary words. Explain complex topics thoroughly, but eliminate words that don't contribute meaning.
**Wordy Phrases to Eliminate**:
- "It is important to note that" -> (delete, or use only if truly critical)
- "In order to" -> "To"
- "Due to the fact that" -> "Because"
- "At this point in time" -> "Now" or "Currently"
- "For the purpose of" -> "To"
### Clarity
Clarity means the reader understands your meaning on first reading. It requires precise word choice, logical structure, and appropriate detail level.
**Unclear**:
> "The service handles requests asynchronously when appropriate using various strategies depending on load characteristics and SLA requirements."
**Clear**:
> "The service processes requests asynchronously when response time exceeds 500ms. It uses a thread pool for CPU-intensive tasks and event loop for I/O operations."
The clear version specifies *when* async processing occurs and *how* it's implemented. Vague terms like "when appropriate" and "various strategies" force readers to guess implementation details.
**Techniques for Clarity**:
- Use specific numbers instead of vague quantifiers ("500ms" not "fast", "1000 requests/second" not "high throughput")
- Define acronyms on first use: "SLA (Service Level Agreement)"
- Provide examples to illustrate abstract concepts
- Use consistent terminology (don't alternate between "request", "call", and "invocation" for the same concept)
### Examples
Examples transform abstract concepts into concrete understanding. Every significant concept should have at least one example showing it in practice.
**Abstract (hard to apply)**:
> "Use dependency injection to improve testability."
**Concrete (actionable)**:
```java
// Without dependency injection - hard to test
public class PaymentService {
// Hard-coded dependency - cannot be mocked in tests
private PaymentGateway gateway = new StripeGateway();
public void processPayment(Payment payment) {
gateway.charge(payment); // Cannot mock gateway for testing
}
}
// With dependency injection - easy to test
public class PaymentService {
private final PaymentGateway gateway;
// Dependency injected via constructor - can inject mock in tests
public PaymentService(PaymentGateway gateway) {
this.gateway = gateway;
}
public void processPayment(Payment payment) {
gateway.charge(payment); // Can inject mock gateway in tests
}
}
// Test demonstrates the benefit
@Test
void processPayment_callsGateway() {
PaymentGateway mockGateway = mock(PaymentGateway.class);
PaymentService service = new PaymentService(mockGateway); // Inject mock
service.processPayment(new Payment(100.00));
verify(mockGateway).charge(any()); // Verify interaction
}
The example demonstrates the principle in action, showing both the problem (hard-coded dependency) and solution (constructor injection) with concrete code. The test shows why this matters, making the abstract benefit ("improve testability") concrete.
Structuring Documentation
Good structure makes documentation scannable and navigable. Readers rarely read documentation linearly - they scan headings to locate relevant sections.
Headings and Hierarchy
Use heading levels to create logical document structure. Well-structured headings enable readers to quickly navigate to relevant information.
Good hierarchy:
# Payment API
## Authentication
### API Keys
### OAuth 2.0
## Endpoints
### Create Payment
### Get Payment Status
### Refund Payment
## Error Handling
### Error Codes
### Retry Logic
Poor hierarchy (flat structure with no navigation aid):
# Payment API
This document covers authentication, endpoints, and error handling...
(long undifferentiated block of text with no headings)
Rules:
- Use one H1 (
#) per document (document title) - Don't skip heading levels (H1 -> H2 -> H3, not H1 -> H3)
- Keep headings concise (3-7 words)
- Use parallel structure ("Creating Users", "Updating Users", "Deleting Users" - all gerunds)
Code Block Formatting
Always specify the language for syntax highlighting. Include comments explaining non-obvious code. Keep examples focused on the concept being demonstrated.
Good (syntax highlighting, comments, focused):
// Use constructor injection for dependencies - enables testing with mocks
public class PaymentService {
private final PaymentRepository repository;
// Spring automatically injects repository when creating this bean
public PaymentService(PaymentRepository repository) {
this.repository = repository;
}
}
Poor (no language, no comments, too much irrelevant code):
public class PaymentService {
private PaymentRepository repository;
private MetricsCollector metrics;
private Logger logger;
private ConfigService config;
private CacheManager cache;
private EmailService email;
// ... 50 more lines of unrelated setup that obscures the point
}
Supported Languages: java, typescript, kotlin, swift, bash, sql, yaml, json, xml, markdown, mermaid
Lists and Tables
Use lists for sequences, collections, or alternatives. Use tables for structured data with multiple attributes that need comparison.
Lists (good for steps or options):
**Prerequisites**:
- Java 25 or higher installed
- Docker Desktop running
- Git configured with SSH keys
- 4GB available RAM minimum
Tables (good for comparing options):
| Database | Pros | Cons | Use Case |
|----------|------|------|----------|
| PostgreSQL | ACID, rich queries, mature | Vertical scaling limits | Transactional data |
| MongoDB | Flexible schema, easy scaling | Eventual consistency | Rapidly changing schemas |
| DynamoDB | Infinite scale, managed | Vendor lock-in, limited queries | High-scale key-value |
Tables excel at comparing multiple items across several dimensions. Lists work better for simple enumerations or sequential steps.
Using Diagrams Effectively
Diagrams communicate system structure and behavior more efficiently than prose for spatial and temporal relationships. Different diagram types serve different purposes.
Architecture Diagrams
Architecture diagrams show system components and their relationships. Use the C4 model for consistent abstraction levels:
- Context diagrams: System and external actors (highest level)
- Container diagrams: Applications, databases, message queues
- Component diagrams: Major components within a container
- Code diagrams: Class relationships (rarely needed in documentation)
This container-level diagram shows the major applications (API, Worker), data stores (PostgreSQL, Kafka), and external systems (Payment Gateway). Grouping related components in subgraphs improves readability by showing system boundaries.
Best Practices:
- Use consistent shapes/colors (rectangles for services, cylinders for databases, parallelograms for queues)
- Show direction of data flow with arrows
- Label connections with protocols or data types
- Keep diagrams focused on one concept (don't try to show everything in one diagram)
Sequence Diagrams
Sequence diagrams illustrate interactions over time, showing message flow between components. They're essential for understanding complex workflows with multiple steps.
This sequence diagram shows both success and failure paths, including the database transaction boundary and asynchronous event publishing. The alt block clearly differentiates the two flows, making it easy to understand error handling.
Best Practices:
- Show only relevant participants (omit infrastructure like load balancers unless directly relevant)
- Use
altblocks for conditional flows (success/failure, different execution paths) - Use
optblocks for optional steps - Add notes for non-obvious behavior:
Note over PaymentService: Retry up to 3 times with exponential backoff
State Machines
State machines represent object lifecycle and valid state transitions. They're critical for documenting workflows with complex business rules.
State machines clarify which transitions are valid (e.g., you can't refund a Failed payment - there's no arrow) and document business constraints (authorization expiry after 7 days). They prevent implementation bugs where invalid state transitions are accidentally allowed.
Best Practices:
- Show all valid states and transitions
- Document invalid transitions by omission (if there's no arrow connecting two states, that transition is not allowed)
- Add notes for business rules, time limits, and conditions
- Indicate terminal states clearly (states that connect to [*])
Flowcharts
Flowcharts describe decision-making processes and algorithms. They're useful for documenting complex conditional logic that would be hard to follow in prose.
This flowchart documents the payment processing decision tree, showing validation, duplicate detection, fraud checking, and processing in a visual flow. It makes the complex conditional logic easy to understand at a glance.
Best Practices:
- Use diamonds for decisions, rectangles for actions, rounded rectangles for start/end
- Label decision branches clearly (Yes/No, or specific conditions like "Balance > Amount")
- Show all possible paths through the flow
- Keep flows left-to-right or top-to-bottom for readability (avoid crossing arrows)
Diagram Tools
Mermaid (recommended for most diagrams):
- Text-based diagrams embedded directly in Markdown
- Version controllable (plain text)
- Automatically rendered by Docusaurus, GitHub, GitLab
- Supports sequence, flowchart, state, ER, Gantt, class diagrams
- Easy to update (just edit text)
Draw.io / Diagrams.net (for complex diagrams):
- Visual editor for intricate diagrams
- Save as
.drawio.svgfor version control and web rendering - Better for complex C4 diagrams with many components
- Store in
/static/diagrams/
When to use each:
- Use Mermaid for most diagrams (version controllable, easy to update, no external tools needed)
- Use Draw.io for complex architectural diagrams with many components where visual layout matters significantly
- Always prefer text-based (Mermaid) when feasible for easier maintenance and collaboration
Documentation-as-Code Practices
Treat documentation with the same engineering rigor as production code. This ensures documentation quality and prevents drift.
Version Control
Store documentation in the same repository as the code it documents. This ensures documentation stays synchronized with code changes and inherits the same versioning, branching, and review processes.
Repository Structure:
payment-service/
|-- src/ # Application code
|-- docs/ # Documentation
| |-- architecture.md
| |-- api.md
| |-- runbooks/
| | |-- high-latency.md
| | `-- deployment-failure.md
| `-- adr/ # Architecture Decision Records
| |-- 001-use-postgresql.md
| `-- 002-event-sourcing.md
|-- README.md
`-- CONTRIBUTING.md
Colocating documentation with code enables pull requests to update documentation in the same commit as code changes, preventing drift. When reviewing a PR that changes API behavior, reviewers can verify both code and documentation changes together.
Best Practices:
- Include documentation changes in the same PR as code changes
- Use Markdown for version control friendliness (diffs work well on text files)
- Link to code with permalinks containing commit SHA, not
mainbranch (so links don't break as code evolves)
Review Process
Documentation should undergo the same review process as code. Include documentation reviewers in pull request reviews to ensure technical accuracy, clarity, and completeness.
Review Checklist:
- Technical accuracy (does it reflect actual system behavior?)
- Code examples compile and run as written
- Links are valid (no 404s)
- Grammar and spelling correct
- Appropriate diagram usage (diagrams exist where needed, accurate)
- Consistent with existing documentation style
Tools like markdownlint enforce Markdown style consistency, similar to how ESLint enforces code style. Integrate these into CI pipelines to catch issues before review.
Keeping Documentation Current
Outdated documentation is worse than no documentation because it misleads readers and wastes their time. Engineers lose trust in documentation when it repeatedly contains incorrect information.
Strategy 1: Documentation Ownership
Assign each documentation section to a team or individual. The owner is responsible for accuracy and updates.
---
title: Payment API
owner: @payments-team
last_reviewed: 2025-11-01
review_frequency: quarterly
---
Schedule quarterly documentation reviews to catch drift and outdated information. During reviews, verify examples still work and information reflects current implementation.
Strategy 2: Deprecation Notices
When deprecating features, update documentation immediately with clear deprecation warnings and migration paths.
:::warning[Deprecated]
This authentication method is deprecated as of v2.0 and will be removed in v3.0 (scheduled for Q2 2026).
**Migration**: Use OAuth 2.0 authentication instead. See [Authentication and OAuth 2.0](../security/authentication.md) for implementation details.
:::
Strategy 3: Automated Validation
Automate documentation validation in CI pipelines:
- Link checking: Fail builds if documentation contains broken links (use markdown-link-check)
- Code example testing: Extract and compile code examples to ensure they work (similar to Rust's doctest)
- OpenAPI validation: Validate API documentation against actual OpenAPI spec to catch drift
Example CI job:
docs-validate:
stage: test
image: node:22
script:
- npm ci
- npm run lint:markdown
- npm run docs:links
- npm run docs:snippets
rules:
- if: '$CI_PIPELINE_SOURCE == "merge_request_event"'
Keep this job fast enough to run on every merge request. If snippet execution is slow, split into a fast MR job and a deeper nightly validation job.
Strategy 4: Update Triggers
Trigger documentation updates automatically:
- Pull request template checklist: "[ ] Documentation updated if behavior changed"
- Pre-commit hooks: Check if code changes affect documented behavior
- Bot comments: If API contract changes detected, bot comments on PR: "API changes detected. Please update API documentation in
/docs/api.md."
Anti-Patterns
Avoid these common documentation mistakes:
Assuming Knowledge
- [Bad] "Simply configure the service mesh"
- [Good] "Configure the service mesh by editing
istio-config.yaml. Use Kubernetes Platform Guidelines for deployment and operational patterns."
Explain prerequisites and provide links to background information. Not everyone has the same context.
Using Unnecessary Jargon
- [Bad] "Leverage the synergistic capabilities of the orchestration layer"
- [Good] "Use Kubernetes to manage container deployment and scaling"
Prefer plain language when possible. Use technical terms when they're precise and widely understood by your audience.
Writing Walls of Text
- [Bad] Long paragraphs with no breaks, headings, or whitespace
- [Good] Short paragraphs (3-5 sentences), frequent headings, lists, code examples
Break content into scannable chunks. Readers should be able to skim headings to find relevant sections.
Forgetting Error Scenarios
- [Bad] Only documenting the happy path
- [Good] Document what happens when requests fail, services are down, or data is invalid
Error handling is often more complex than the happy path. Document failure modes, error codes, and recovery procedures.
Skipping Diagrams
- [Bad] Describing complex architecture or workflows in prose only
- [Good] Including diagrams for system architecture, sequence flows, state machines
Complex systems need visual aids. A sequence diagram showing API interactions is clearer than paragraphs of text.
Letting Documentation Drift
- [Bad] Documentation that contradicts actual system behavior
- [Good] Regular reviews, automated validation, updates in same PR as code changes
Outdated documentation frustrates users and erodes trust. Maintain documentation with the same discipline as code.
Duplicating Information
- [Bad] Copying content from authoritative source into your docs
- [Good] Linking to authoritative source rather than duplicating
When information exists elsewhere (official framework docs, library documentation), link to it rather than copying. Copied content quickly becomes outdated.
Over-Documenting Trivial Code
- [Bad]
// Increments the counter by 1abovecounter++; - [Good] Document why, not what (self-explanatory code needs no documentation)
Don't document what the code obviously does. Document why it does it, especially for non-obvious design decisions.
Best Practices Summary
Do:
- Know your audience - write for their expertise level (junior developer vs senior architect)
- Use active voice - "The service validates the request" not "The request is validated"
- Provide examples - show concepts in practice with runnable code
- Include diagrams - visualize architecture, sequences, state transitions
- Keep it current - update documentation in the same PR as code changes
- Review documentation - apply same rigor as code review
- Use version control - store docs in Git alongside code
- Automate validation - link checking, spell checking, code example testing
- Structure for scanning - headings, TOCs, lists, tables for easy navigation
- Define acronyms - spell out on first use: "ADR (Architecture Decision Record)"
Don't:
- Assume knowledge - explain context and prerequisites
- Use jargon unnecessarily - prefer plain language when possible
- Write walls of text - break into sections, use headings, add whitespace
- Forget error scenarios - document failures, not just happy path
- Skip diagrams - complex systems need visual aids
- Let docs drift - outdated documentation misleads and frustrates
- Duplicate content - link to authoritative source rather than copying
- Over-document trivial code - self-explanatory code needs no documentation
Further Reading
Internal Guidelines:
- Technical Design Process - Writing design docs and ADRs
- Incident Post-Mortems - Documenting learnings from incidents
- Pull Request Best Practices - Writing clear PR descriptions
- Code Review Guidelines - Documenting review feedback
- Contract Testing - Keeping API docs and implementation synchronized
External Resources:
- Google Developer Documentation Style Guide - Comprehensive style guide for technical writing with detailed grammar and terminology guidance
- Microsoft Writing Style Guide - Style and terminology guidance with focus on clarity and consistency
- Write the Docs - Community for technical writers and engineers with conferences, meetups, and resources
- Diataxis Framework - Systematic approach to technical documentation organizing content as tutorials, how-to guides, reference, and explanation
Summary
Key Takeaways:
- Match documentation type to purpose - API docs explain usage, runbooks guide incident response, ADRs record decisions, READMEs provide quick orientation
- Write clearly and concisely - Active voice, brevity, specific examples make documentation easy to understand
- Structure for discoverability - Headings, TOCs, logical organization enable readers to find information quickly
- Visualize with diagrams - Architecture, sequence, state, and flow diagrams communicate complex relationships efficiently
- Treat docs as code - Version control, review process, automated validation ensure quality
- Keep documentation current - Update with code changes, schedule reviews, automate drift detection
- Provide examples - Concrete code examples make abstract concepts actionable
- Automate validation - Link checking, spell checking, code testing catch errors early
Well-maintained documentation accelerates onboarding (new engineers become productive faster), reduces incidents (clear runbooks lower MTTR), improves decisions (ADRs prevent repeated debates), and decreases support burden (comprehensive docs mean fewer questions). The investment in documentation quality pays dividends throughout the system lifecycle.