Technical Writing Guide

Guidelines for writing clear, comprehensive, and maintainable technical documentation that enables teams to understand, use, and maintain software systems effectively.

Overview

Technical documentation reduces onboarding time, prevents errors, facilitates collaboration, and preserves institutional knowledge. Poor documentation forces engineers to reverse-engineer code, slows development, and increases incident resolution time. Documentation quality directly correlates with team productivity because well-documented systems are easier to understand, modify, and debug.

This guide covers documentation types, writing principles, structural patterns, and maintenance practices. Each section provides concrete examples demonstrating the principles in action.

Documentation as Code

Treat documentation with the same rigor as production code: version control, peer review, automated testing (link checking, spell checking), and continuous updates.

Core Principles

Match documentation type to purpose - Different documentation types (API docs, runbooks, ADRs, READMEs) serve distinct needs and audiences
Write for scanability - Readers rarely read linearly; use headings, lists, tables, and diagrams to enable quick navigation
Provide concrete examples - Abstract concepts become actionable through real code examples and scenarios
Keep documentation current - Outdated documentation is worse than no documentation because it misleads readers
Automate validation - Link checking, spell checking, and code example testing catch errors before they reach readers

Documentation Types

Different documentation types serve different purposes and audiences. Understanding when to use each type ensures documentation provides maximum value.

API Documentation

API documentation describes how to use an API, including endpoints, request/response formats, authentication, and error handling. The primary audience is developers consuming the API (internal teams, external partners, or public developers).

Essential Elements:

Endpoint descriptions with HTTP methods and paths
Request and response schemas with examples
Authentication and authorization requirements
Error codes and handling strategies
Rate limiting and usage quotas
Versioning information

Use OpenAPI/Swagger specifications to generate interactive API documentation. OpenAPI provides a machine-readable contract that generates documentation, client SDKs, and server stubs automatically. This approach ensures documentation stays synchronized with implementation because the specification is validated against the actual API through contract testing (see Contract Testing).

# OpenAPI specification serves as both contract and documentation source
openapi: 3.0.0
info:
  title: Payment API
  version: 1.0.0
  description: API for processing payment transactions

paths:
  /api/v1/payments:
    post:
      summary: Create a new payment
      description: Initiates a payment transaction with the specified amount and currency
      operationId: createPayment
      requestBody:
        required: true
        content:
          application/json:
            schema:
              $ref: '#/components/schemas/PaymentRequest'
            example:
              customerId: "cust_123abc"
              amount: 100.00
              currency: "USD"
              description: "Invoice payment #12345"
      responses:
        '201':
          description: Payment created successfully
          content:
            application/json:
              schema:
                $ref: '#/components/schemas/PaymentResponse'
        '400':
          description: Invalid request - missing or malformed fields
          content:
            application/json:
              schema:
                $ref: '#/components/schemas/Error'
        '401':
          description: Unauthorized - Invalid or missing authentication token

The specification becomes the source of truth. Tools like Swagger UI, Redoc, or Stoplight generate interactive documentation where developers can test API calls directly in the browser, eliminating the gap between documentation and reality.

Common Documentation Gaps:

Outdated examples that don't match current API behavior
Missing error scenarios and edge cases (document what happens when requests fail, not just success paths)
Insufficient explanation of authentication flows (show the complete token acquisition and refresh process)
No examples of real-world use cases (isolated endpoint docs don't show how to accomplish user goals)

User Guides

User guides teach users (developers or end users) how to accomplish specific tasks or workflows. Structure guides as task-oriented tutorials rather than feature catalogs. Users come to documentation to solve problems, not to read about features.

Start each guide with a clear goal statement: "By the end of this guide, you will be able to..." This sets expectations and helps readers determine if the guide meets their needs.

# Setting Up Local Development Environment

**Time required**: ~20 minutes

**Prerequisites**:
- Java 25 or higher installed (`java -version` to check)
- Docker Desktop running (needed for PostgreSQL and Redis)
- Git configured with SSH keys (test with `ssh -T [email protected]`)

**Goal**: Set up a fully functional local development environment to run and test the payment service.

## Step 1: Clone the Repository

```bash
git clone [email protected]:payments/payment-service.git
cd payment-service

Step 2: Configure Environment Variables

Create a .env file in the project root. This file contains local configuration that differs from production settings.

# Database configuration - connects to local Docker PostgreSQL
DB_HOST=localhost
DB_PORT=5432
DB_NAME=payments_dev
DB_USER=dev_user
DB_PASSWORD=dev_password

# External API keys (request from team lead)
PAYMENT_GATEWAY_API_KEY=&lt;your-api-key>

Security Note: Never commit the .env file to version control. It's already in .gitignore. Production secrets are managed through Kubernetes secrets.

Step 3: Start Dependencies

Start PostgreSQL and Redis using Docker Compose. The docker-compose.yml file defines these services with appropriate configurations for local development.

docker-compose up -d postgres redis

Wait for services to be healthy. Check status with:

docker-compose ps

Both services should show status "Up" with health "healthy".

Step 4: Run Database Migrations

Apply database migrations to create the schema. The application uses Flyway to manage schema versions (see Database Migrations for details).

./gradlew flywayMigrate

You should see output confirming all migrations applied successfully:

Successfully applied 5 migrations

Step 5: Start the Application

./gradlew bootRun

The application starts on http://localhost:8080. Watch for log output indicating successful startup:

Started PaymentServiceApplication in 8.234 seconds

Verify Installation

Test the application is working by calling the health endpoint. This endpoint checks database connectivity and other dependencies.

curl http://localhost:8080/actuator/health

Expected response:

{
  "status": "UP",
  "components": {
    "db": {"status": "UP"},
    "redis": {"status": "UP"}
  }
}

Troubleshooting

Problem: Port 8080 already in use Cause: Another application is using port 8080 Solution: Either stop that application or change the port in application.yml under server.port: 8081

Problem: Could not connect to database Cause: PostgreSQL container not running or not healthy Solution:

Verify PostgreSQL is running: docker-compose ps postgres
Check logs for errors: docker-compose logs postgres
If container crashed, restart it: docker-compose restart postgres

Next Steps

Run the test suite to verify everything works (see Testing Guide)
Make your first code change (see Development Workflow)
Submit a pull request (see Pull Request Process)

The guide provides a linear sequence of actions with clear verification steps at each stage. Troubleshooting addresses common failure modes encountered during setup, reducing frustration and support burden.

---

### Runbooks

Runbooks provide step-by-step operational procedures for managing production systems, handling incidents, and performing routine maintenance. The audience is on-call engineers, DevOps teams, and SREs who need to respond to production issues.

Runbooks must be action-oriented and assume the reader is under stress (incident response at 2 AM). Use checklists, decision trees, and imperative language ("Check X", "If Y, then do Z"). Every alert should have a corresponding runbook entry so on-call engineers know exactly what to do when paged.

```markdown
# Runbook: High Payment Processing Latency

**Alert**: `payment_processing_p95_latency > 2000ms`

**Severity**: High (impacts customer experience)

**On-Call**: @payments-team-oncall

---

## Symptoms

- Payment processing taking longer than 2 seconds (P95 latency)
- Customers reporting slow checkout experience
- Dashboard shows elevated latency metrics in Grafana

## Investigation Steps

### 1. Check System Health

```bash
# Check overall service health
curl https://payments.company.com/actuator/health

# Check database connection pool utilization
curl https://payments.company.com/actuator/metrics/hikari.connections.active

What to look for:

Health endpoint returns DOWN status -> Service degradation, proceed to step 2
Active connections near max pool size (20) -> Database connection exhaustion, proceed to step 4
Health endpoint timeout -> Service completely down, escalate immediately

2. Check Recent Deployments

Recent deployments are the most common cause of sudden performance degradation.

# Check recent deployments
kubectl rollout history deployment/payment-service -n production

What to look for:

Deployment within last 15 minutes -> Recent change likely introduced issue
If recent deployment exists, consider rollback (see Rollback section below)

3. Check External Dependencies

Payment gateway latency often causes downstream processing delays.

# Check payment gateway latency metrics
curl https://payments.company.com/actuator/metrics/http.client.requests \
  | grep payment_gateway

What to look for:

Payment gateway latency elevated (>1000ms) -> External dependency issue, proceed to Scenario A
Payment gateway error rate >1% -> Upstream service degradation, proceed to Scenario A

4. Check Database Performance

Database performance issues manifest as slow query execution or lock contention.

# Connect to read-replica for diagnostics (never use primary for diagnostics)
kubectl exec -it postgres-read-replica -n production -- psql -U payments

-- Check for long-running queries
SELECT pid, now() - query_start as duration, state, query
FROM pg_stat_activity
WHERE state != 'idle'
ORDER BY duration DESC
LIMIT 10;

-- Check for lock contention
SELECT * FROM pg_locks WHERE NOT granted;

What to look for:

Queries running >5 seconds -> Potentially missing indexes or inefficient query, proceed to Scenario B
Lock contention present (ungran locks exist) -> Multiple transactions competing for same resources, proceed to Scenario B

Resolution Steps

Scenario A: External Dependency Degradation

If payment gateway latency is elevated:

Enable circuit breaker to prevent cascading failures:
```
kubectl set env deployment/payment-service \
  CIRCUIT_BREAKER_ENABLED=true -n production
```
The circuit breaker prevents request pile-up when the gateway is slow, returning fast failures instead of timing out (see Resilience Patterns).
Check vendor status page: https://status.payment-gateway.com

Notify stakeholders via #payments-incidents Slack channel:

INCIDENT: Payment gateway experiencing high latency.
Circuit breaker enabled. Monitoring vendor status.

Monitor recovery: Watch for latency to return to normal (under 500ms P95)

Scenario B: Database Performance Issue

If database queries are slow:

Identify slow query from pg_stat_activity output (copy the query text)
Analyze query execution plan:
```
EXPLAIN ANALYZE &lt;slow-query>;
```
Look for "Seq Scan" on large tables (indicates missing index) or high row counts in intermediate steps.
If missing index identified:
- Create index on read-replica first to test impact (safe to test here)
- If successful and performance improves, schedule index creation on primary during maintenance window
- DO NOT create indexes on primary during incident (creates locks that worsen the situation)
Temporary mitigation - Scale read replicas if read-heavy traffic:
```
kubectl scale deployment/payment-service-read-replica \
  --replicas=5 -n production
```
Additional replicas distribute read load, reducing latency. This doesn't solve the root cause but provides breathing room for proper fix.

Scenario C: Recent Deployment Regression

If deployment occurred within last 30 minutes and no other cause identified:

Rollback immediately:

kubectl rollout undo deployment/payment-service -n production

Verify rollback completed:
```
kubectl rollout status deployment/payment-service -n production
```
Wait for "successfully rolled out" message.
Monitor latency - should return to baseline within 2-3 minutes after rollback completes
Notify team to investigate the problematic deployment before retrying

Escalation

If issue not resolved within 15 minutes:

Page Tech Lead: @tech-lead-oncall
Post in #payments-incidents with investigation findings so far
Create bridge call: https://company.zoom.us/j/emergency-bridge

If service is completely down (health endpoint not responding):

IMMEDIATELY page Tech Lead and Engineering Manager
Update status page: https://status.company.com (customers need visibility)
Enable maintenance mode to prevent partial failures

Post-Incident

After resolution:

Document timeline in incident ticket (what happened when, what actions taken)
Schedule post-mortem within 48 hours (see Incident Post-Mortems)
Update runbook if new information discovered (make it better for next time)
Create follow-up tasks for preventative measures (address root cause)

Payment Service Deployment Issues
Database Connection Pool Exhaustion
Circuit Breaker Activation

This runbook structure provides clear decision trees for investigation with actionable commands and expected outputs. Explicit escalation criteria prevent on-call engineers from struggling too long with unfamiliar issues.

---

### Architecture Decision Records (ADRs)

Architecture Decision Records document significant architectural decisions, the context that led to them, and their consequences. The audience includes engineers, architects, and future team members who need to understand "why" decisions were made.

ADRs create an architectural paper trail that prevents repeated debates and helps new team members understand the system's evolution. Without ADRs, teams repeatedly revisit the same discussions as team membership changes or memory fades. Write ADRs when making decisions that are expensive to reverse, such as choosing databases, architectural patterns, or third-party dependencies.

The [Technical Design Process](./technical-design.md) guide contains comprehensive ADR guidance including templates, workflow, and detailed examples. This section provides a brief overview.

**When to Write an ADR**:
- Selecting database technology (PostgreSQL vs MongoDB vs DynamoDB) - these choices lock in data models and operational patterns
- Adopting architectural patterns (microservices vs monolith, event sourcing) - these shape how teams work
- Choosing major frameworks or libraries (React vs Angular, Spring Boot vs Micronaut) - these determine skill requirements
- Changing deployment strategy (blue-green vs canary vs rolling) - these affect release processes
- Deciding authentication mechanisms (OAuth 2.0, SAML, API keys) - these impact security and user experience

**ADR Template Structure**:

```markdown
# ADR-015: Use PostgreSQL for Transactional Data

**Status**: Accepted
**Date**: 2025-11-15
**Deciders**: @engineering-team, @platform-team

## Context

We need to select a database for storing payment transaction data. The database must support:
- ACID transactions (payment processing cannot tolerate data loss or inconsistency)
- Complex queries (reporting requires joins across multiple tables)
- Strong consistency (balance calculations must be correct immediately)
- Proven reliability at scale (handling millions of transactions)

Current system uses MongoDB, which has caused production issues:
- Eventual consistency led to incorrect balance displays
- Lack of transactions caused orphaned payment records
- Complex aggregations perform poorly

## Decision

We will use PostgreSQL for all transactional data (payments, accounts, transactions).

**Rationale**:
- ACID guarantees ensure data consistency critical for financial data
- Rich query capabilities support complex reporting without data duplication
- Mature ecosystem with extensive tooling and monitoring
- Team has PostgreSQL expertise from previous projects

## Alternatives Considered

### Alternative 1: Continue with MongoDB

**Pros**:
- No migration required
- Flexible schema useful for rapidly changing requirements

**Cons**:
- Eventual consistency unsuitable for financial data
- Limited transaction support (multi-document transactions added recently, not mature)
- Complex queries require application-level joins or data duplication

**Rejected because**: Consistency requirements for financial data outweigh schema flexibility benefits.

### Alternative 2: DynamoDB

**Pros**:
- Infinite horizontal scaling
- Managed service reduces operational burden
- High availability built-in

**Cons**:
- Limited query capabilities (no joins, must design for access patterns up front)
- Vendor lock-in to AWS
- Higher cost at our scale ($2000/month vs $200/month for PostgreSQL)
- Team lacks DynamoDB expertise

**Rejected because**: Query flexibility needed for evolving reporting requirements. Cost and learning curve not justified by scale requirements.

## Consequences

**Positive**:
- Strong consistency eliminates balance calculation bugs
- ACID transactions simplify application code (no manual compensation logic)
- Rich querying enables ad-hoc analysis and reporting
- Reduced production incidents related to data consistency

**Negative**:
- Vertical scaling has limits (will need sharding if we exceed ~100K transactions/second)
- Migration effort required (estimated 2 weeks)
- Potential performance degradation during migration (mitigated by gradual rollout)

**Mitigation**:
- Plan migration carefully with rollback procedures
- Monitor database performance metrics closely during and after migration
- Design schema for future sharding if needed (partition by account ID)

## Related Decisions

- ADR-012: Event Sourcing for Audit Log
- ADR-018: Database Migration Strategy

ADRs capture not just what was decided, but why alternatives were rejected. This prevents future team members from proposing the same alternatives without understanding the trade-offs already considered.

README Files

README files provide project overview, quick start instructions, and pointers to detailed documentation. The audience is new team members, contributors, or anyone discovering the project.

The README is the front door to your project. It should answer three questions immediately: What is this? Why does it exist? How do I get started? Keep it concise - detailed information belongs in separate documentation files.

# Payment Service

> Microservice for processing payment transactions with support for multiple payment gateways, fraud detection, and compliance reporting.

## Overview

The Payment Service handles all payment processing for the platform, including:
- Payment creation, authorization, and capture
- Refunds and chargebacks
- Multi-currency support (USD, EUR, GBP)
- Integration with payment gateways (Stripe, Adyen)
- Fraud detection via risk scoring
- PCI-DSS compliant payment data handling

**Architecture**: Spring Boot microservice with PostgreSQL database and Kafka event streaming.

**Good**: **Status**:  Production (99.95% uptime SLA)

## Quick Start

### Prerequisites
- Java 25+ (`java -version` to verify)
- Docker Desktop (for PostgreSQL and Redis)
- Git with SSH keys configured (`ssh -T [email protected]` to test)

### Run Locally

```bash
# Clone repository
git clone [email protected]:payments/payment-service.git
cd payment-service

# Start dependencies
docker-compose up -d

# Run application
./gradlew bootRun

The service will be available at http://localhost:8080.

Verify: curl http://localhost:8080/actuator/health

See Local Development Guide for detailed setup including environment configuration and troubleshooting.

Documentation

API Documentation - Interactive API reference with try-it-out functionality
Architecture Guide - System design, components, and data flow
Development Guide - Local setup, testing, debugging
Deployment Guide - CI/CD pipeline and production deployment
Runbooks - Operational procedures and troubleshooting

Key Technologies

Backend: Java 25, Spring Boot 3.5, Spring Data JPA
Database: PostgreSQL 15 with Flyway migrations
Messaging: Kafka 3.5 for event streaming
Monitoring: Prometheus metrics, Grafana dashboards
Testing: JUnit 5, TestContainers, Contract Tests (Pact)

Project Structure

payment-service/
|-- src/main/java/          # Application source code
|   `-- com/company/payments/
|       |-- api/            # REST controllers
|       |-- domain/         # Business logic and models
|       |-- infrastructure/ # Data access, external integrations
|       `-- config/         # Spring configuration
|-- src/test/               # Tests
|-- docs/                   # Documentation
|-- docker/                 # Docker configurations
`-- k8s/                    # Kubernetes manifests

Contributing

See CONTRIBUTING.md for:

Code style and conventions (enforced via Checkstyle and SpotBugs)
Pull request process (see Pull Request Guidelines)
Testing requirements (minimum 80% coverage)
Code review guidelines

Team & Support

Team: Payments Squad (@payments-team)
Slack: #payments-team
On-Call: @payments-oncall
Issue Tracker: GitLab Issues

License

Proprietary - Internal use only

This README provides immediate context and clear next steps without overwhelming the reader. Links direct users to detailed documentation organized by concern (architecture, development, deployment, operations).

---

### RFCs (Request for Comments)

RFCs propose significant changes or new features, gather feedback from stakeholders, and build consensus before implementation. The audience includes engineering team, architects, product managers, and stakeholders.

RFCs differ from ADRs in timing and purpose. RFCs are collaborative documents created before decisions are finalized to gather input and refine proposals. ADRs document decisions after they're made to explain the final choice. Use RFCs for changes that affect multiple teams or require broad input.

```markdown
# RFC-042: Implement Idempotency Keys for Payment API

**Status**: In Review
**Author**: @jane-developer
**Created**: 2025-11-05
**Updated**: 2025-11-08
**Reviewers**: @payments-team, @platform-team, @api-guild
**Target Decision Date**: 2025-11-15

---

## Problem Statement

Customers occasionally experience duplicate charges when network issues cause payment requests to be retried. Our API does not currently support idempotency, meaning identical requests are processed as separate transactions.

**Impact**:
- ~50 duplicate payment incidents per month
- Customer support burden: 2-3 hours per week handling duplicates
- Customer trust impact from billing errors
- Manual refund processing required

**Example Scenario**:
1. Customer submits payment for $100
2. Request reaches server, payment is created
3. Network timeout before response reaches client
4. Client retries request (reasonable behavior given timeout)
5. Server creates second payment for $100
6. Customer charged $200 instead of $100

This happens because the server cannot distinguish between a retry of a request that succeeded (but client didn't receive response) and a genuinely new request.

## Proposed Solution

Implement idempotency key support following Stripe's idempotency pattern:

1. **Client sends idempotency key** in header:
   ```http
   POST /api/v1/payments
   Idempotency-Key: unique-request-id-12345
   Content-Type: application/json

   {"amount": 100.00, "currency": "USD"}

Server checks for existing request with same idempotency key:
- If found within 24-hour window: Return cached response (200 OK)
- If not found: Process request normally, cache response
Store idempotency data:
- Idempotency key (indexed for fast lookup)
- Request fingerprint (method + path + body hash to detect conflicting requests)
- Response (status code, body)
- Timestamp (for cleanup after 24 hours)

Database Schema

CREATE TABLE idempotency_keys (
    idempotency_key VARCHAR(255) PRIMARY KEY,
    request_hash VARCHAR(64) NOT NULL,      -- SHA-256 of method+path+body
    response_status_code INT NOT NULL,
    response_body JSONB NOT NULL,
    created_at TIMESTAMP NOT NULL DEFAULT NOW(),
    expires_at TIMESTAMP NOT NULL DEFAULT NOW() + INTERVAL '24 hours'
);

CREATE INDEX idx_idempotency_expires ON idempotency_keys(expires_at);

The request_hash detects cases where a client reuses an idempotency key with different request data (which should be rejected as an error).

Sequence Diagram

When the client retries with the same idempotency key, the server returns the cached response from the first request, preventing duplicate processing.

Alternatives Considered

Alternative 1: Client-Generated Request IDs Only

Store payment IDs from clients and reject duplicates with 409 Conflict.

Pros:

Simpler implementation (no response caching)
Lighter database storage

Cons:

Doesn't handle legitimate retries (client never got response and doesn't know if payment succeeded)
Client must handle 409 and query payment status separately (complex client logic)
Doesn't solve the core problem of unclear request success

Rejected because: Doesn't solve the fundamental problem of clients not knowing if their request succeeded.

Alternative 2: Distributed Locking (Redis)

Use Redis distributed locks to prevent concurrent processing of same key.

Pros:

Prevents race conditions during concurrent retries
Fast lock acquisition (under 10ms)

Cons:

Adds Redis dependency to critical path
Lock timeout management complexity (what if lock holder crashes?)
Still need database storage for response caching
Over-engineered for the problem

Rejected because: Adds complexity without significant benefit. Database approach is simpler and sufficient.

Trade-offs and Risks

Trade-offs

Aspect	Decision	Rationale
Storage duration	24 hours	Balances safety (most retries happen within minutes) with storage cost
Storage location	PostgreSQL	Already available, ACID guarantees, simple implementation
Key validation	Client responsibility	Server validates format but doesn't generate keys (client controls retry semantics)

Risks

Risk	Likelihood	Impact	Mitigation
Database storage growth	Medium	Low	Automated cleanup job runs every 6 hours, indexed expiry column for efficient deletion
Hash collision	Very Low	High	Use SHA-256 (cryptographically secure), include timestamp in hash input
Performance impact	Low	Medium	Database index on key provides fast lookups, add Redis cache layer in Phase 2 if needed
Client adoption	Medium	Medium	Make header optional for backward compatibility, provide client libraries with automatic key generation

Impact Analysis

Technical Impact:

Breaking Change: No (backward compatible)

Idempotency-Key header is optional
Requests without key behave as before
Existing clients unaffected

Performance:

Additional database lookup per request: ~5-10ms
Negligible impact on overall P95 latency (currently ~150ms)
Mitigation: Add Redis cache layer in Phase 2 if monitoring shows latency impact

Infrastructure:

Estimated storage: ~500 KB per 1,000 requests
At 1M requests/day: ~500 MB/day, 12 GB retained (24-hour window)
Cleanup job runs every 6 hours, removing expired keys

Operational Impact:

Monitoring:

Add metric: idempotency_key_hit_rate (% of requests with existing keys)
Alert if hit rate >10% (indicates client retry issues or bugs)

Logging:

Log when idempotent response returned (INFO level)
Include original request timestamp to measure retry latency

Business Impact:

Benefits:

Eliminate ~50 duplicate payment incidents/month
Reduce customer support burden by 2-3 hours/week
Improve customer trust and satisfaction
Align with industry standard pattern (Stripe, PayPal use this approach)

Costs:

Development: 2 weeks (1 engineer)
Storage: ~$5/month (PostgreSQL storage)
Maintenance: Minimal (automated cleanup)

Open Questions

Should we validate request body matches for same key?
- If client sends same key with different body, should we reject (409 Conflict) or allow?
- Recommendation: Reject with 409 Conflict and clear error message explaining the issue. This prevents accidental misuse.
What's the maximum key length?
- Stripe uses 255 characters max
- Recommendation: 255 characters (UUID typically 36, but client-generated keys may be longer)
Should we support key expiry extension?
- If client retries after 24 hours, should we extend expiry?
- Recommendation: No. Retries after 24 hours indicate a different issue. Fixed 24-hour window keeps cleanup logic simple.

Success Criteria

Technical:

Idempotency implementation tested with >90% code coverage
Performance impact under 10ms P95
Zero duplicate payments in staging over 1-week testing period
Load testing confirms no degradation under 2x normal load

Business:

Duplicate payment incidents reduced by >80% within 3 months
Customer support hours for duplicates reduced by >75%
Adoption by at least 2 client applications within 6 months

Timeline

Milestone	Target Date	Owner
RFC Review & Approval	2025-11-15	@payments-team
Design Review	2025-11-18	@jane-developer
Implementation	2025-11-29	@jane-developer
QA & Testing	2025-12-06	@qa-team
Production Deploy	2025-12-13	@devops-team
Monitor & Iterate	2025-12-20	@payments-team

Feedback

Comments from @tech-lead (2025-11-06)

Looks solid. Consider Redis cache layer if database lookups become bottleneck. Also, what about distributed system clock skew affecting timestamps?

Response: Will add Redis cache in Phase 2 if monitoring shows >50ms DB lookup latency. Good point on clock skew - using database NOW() for all timestamps to avoid server clock issues.

Comments from @security-engineer (2025-11-07)

What prevents malicious clients from filling database with keys?

Response: Excellent point. Will add rate limiting (100 unique idempotency keys per client per hour) and automatic cleanup. Also added monitoring for unusual key creation patterns.

Decision

Good: Status: Approved (2025-11-15)

Outcome: Proceed with implementation as proposed. Add Redis caching layer in Phase 2 if performance monitoring indicates need.

Action Items:

@jane-developer: Create implementation ticket with detailed subtasks
@api-guild: Draft company-wide idempotency standard RFC
@docs-team: Update API documentation with idempotency examples
@qa-team: Create test plan including race condition scenarios

This RFC structure facilitates discussion by clearly presenting the problem, proposed solution, alternatives, and open questions. The "Feedback" section captures the collaborative refinement process, creating a record of how the decision evolved based on team input.

---

## Writing Principles

Effective technical writing follows core principles that ensure clarity, accuracy, and usability.

### Active Voice

Active voice makes writing clearer and more direct by explicitly stating who performs each action. This reduces cognitive load because readers immediately know the subject and action.

**Passive (unclear actor)**:
> "The payment is validated by the fraud detection service."
> "Errors should be handled gracefully."
> "The database schema is migrated automatically during deployment."

**Active (clear actor)**:
> "The fraud detection service validates the payment."
> "Your service should handle errors gracefully."
> "Flyway automatically migrates the database schema during deployment."

In runbooks and procedures, passive voice obscures responsibility, which delays incident response. "The database should be restarted" leaves engineers wondering who should restart it. "Restart the database using `kubectl restart`" provides clear action.

### Brevity

Concise writing respects the reader's time. Every sentence should add value. Remove filler words, redundant phrases, and unnecessary qualifiers.

**Verbose**:
> "It is important to note that you should always remember to validate user input in order to prevent potential security vulnerabilities that could possibly be exploited by malicious actors."

**Concise**:
> "Validate user input to prevent security vulnerabilities."

The concise version communicates the same requirement in 7 words instead of 29. Brevity doesn't mean removing necessary detail - it means removing unnecessary words. Explain complex topics thoroughly, but eliminate words that don't contribute meaning.

**Wordy Phrases to Eliminate**:
- "It is important to note that" -> (delete, or use only if truly critical)
- "In order to" -> "To"
- "Due to the fact that" -> "Because"
- "At this point in time" -> "Now" or "Currently"
- "For the purpose of" -> "To"

### Clarity

Clarity means the reader understands your meaning on first reading. It requires precise word choice, logical structure, and appropriate detail level.

**Unclear**:
> "The service handles requests asynchronously when appropriate using various strategies depending on load characteristics and SLA requirements."

**Clear**:
> "The service processes requests asynchronously when response time exceeds 500ms. It uses a thread pool for CPU-intensive tasks and event loop for I/O operations."

The clear version specifies *when* async processing occurs and *how* it's implemented. Vague terms like "when appropriate" and "various strategies" force readers to guess implementation details.

**Techniques for Clarity**:
- Use specific numbers instead of vague quantifiers ("500ms" not "fast", "1000 requests/second" not "high throughput")
- Define acronyms on first use: "SLA (Service Level Agreement)"
- Provide examples to illustrate abstract concepts
- Use consistent terminology (don't alternate between "request", "call", and "invocation" for the same concept)

### Examples

Examples transform abstract concepts into concrete understanding. Every significant concept should have at least one example showing it in practice.

**Abstract (hard to apply)**:
> "Use dependency injection to improve testability."

**Concrete (actionable)**:
```java
// Without dependency injection - hard to test
public class PaymentService {
    // Hard-coded dependency - cannot be mocked in tests
    private PaymentGateway gateway = new StripeGateway();

    public void processPayment(Payment payment) {
        gateway.charge(payment); // Cannot mock gateway for testing
    }
}

// With dependency injection - easy to test
public class PaymentService {
    private final PaymentGateway gateway;

    // Dependency injected via constructor - can inject mock in tests
    public PaymentService(PaymentGateway gateway) {
        this.gateway = gateway;
    }

    public void processPayment(Payment payment) {
        gateway.charge(payment); // Can inject mock gateway in tests
    }
}

// Test demonstrates the benefit
@Test
void processPayment_callsGateway() {
    PaymentGateway mockGateway = mock(PaymentGateway.class);
    PaymentService service = new PaymentService(mockGateway); // Inject mock

    service.processPayment(new Payment(100.00));

    verify(mockGateway).charge(any()); // Verify interaction
}

The example demonstrates the principle in action, showing both the problem (hard-coded dependency) and solution (constructor injection) with concrete code. The test shows why this matters, making the abstract benefit ("improve testability") concrete.

Structuring Documentation

Good structure makes documentation scannable and navigable. Readers rarely read documentation linearly - they scan headings to locate relevant sections.

Headings and Hierarchy

Use heading levels to create logical document structure. Well-structured headings enable readers to quickly navigate to relevant information.

Good hierarchy:

# Payment API

## Authentication
### API Keys
### OAuth 2.0

## Endpoints
### Create Payment
### Get Payment Status
### Refund Payment

## Error Handling
### Error Codes
### Retry Logic

Poor hierarchy (flat structure with no navigation aid):

# Payment API

This document covers authentication, endpoints, and error handling...
(long undifferentiated block of text with no headings)

Rules:

Use one H1 (#) per document (document title)
Don't skip heading levels (H1 -> H2 -> H3, not H1 -> H3)
Keep headings concise (3-7 words)
Use parallel structure ("Creating Users", "Updating Users", "Deleting Users" - all gerunds)

Code Block Formatting

Always specify the language for syntax highlighting. Include comments explaining non-obvious code. Keep examples focused on the concept being demonstrated.

Good (syntax highlighting, comments, focused):

// Use constructor injection for dependencies - enables testing with mocks
public class PaymentService {
    private final PaymentRepository repository;

    // Spring automatically injects repository when creating this bean
    public PaymentService(PaymentRepository repository) {
        this.repository = repository;
    }
}

Poor (no language, no comments, too much irrelevant code):

public class PaymentService {
    private PaymentRepository repository;
    private MetricsCollector metrics;
    private Logger logger;
    private ConfigService config;
    private CacheManager cache;
    private EmailService email;
    // ... 50 more lines of unrelated setup that obscures the point
}

Supported Languages: java, typescript, kotlin, swift, bash, sql, yaml, json, xml, markdown, mermaid

Lists and Tables

Use lists for sequences, collections, or alternatives. Use tables for structured data with multiple attributes that need comparison.

Lists (good for steps or options):

**Prerequisites**:
- Java 25 or higher installed
- Docker Desktop running
- Git configured with SSH keys
- 4GB available RAM minimum

Tables (good for comparing options):

| Database | Pros | Cons | Use Case |
|----------|------|------|----------|
| PostgreSQL | ACID, rich queries, mature | Vertical scaling limits | Transactional data |
| MongoDB | Flexible schema, easy scaling | Eventual consistency | Rapidly changing schemas |
| DynamoDB | Infinite scale, managed | Vendor lock-in, limited queries | High-scale key-value |

Tables excel at comparing multiple items across several dimensions. Lists work better for simple enumerations or sequential steps.

Using Diagrams Effectively

Diagrams communicate system structure and behavior more efficiently than prose for spatial and temporal relationships. Different diagram types serve different purposes.

Architecture Diagrams

Architecture diagrams show system components and their relationships. Use the C4 model for consistent abstraction levels:

Context diagrams: System and external actors (highest level)
Container diagrams: Applications, databases, message queues
Component diagrams: Major components within a container
Code diagrams: Class relationships (rarely needed in documentation)

This container-level diagram shows the major applications (API, Worker), data stores (PostgreSQL, Kafka), and external systems (Payment Gateway). Grouping related components in subgraphs improves readability by showing system boundaries.

Best Practices:

Use consistent shapes/colors (rectangles for services, cylinders for databases, parallelograms for queues)
Show direction of data flow with arrows
Label connections with protocols or data types
Keep diagrams focused on one concept (don't try to show everything in one diagram)

Sequence Diagrams

Sequence diagrams illustrate interactions over time, showing message flow between components. They're essential for understanding complex workflows with multiple steps.

This sequence diagram shows both success and failure paths, including the database transaction boundary and asynchronous event publishing. The alt block clearly differentiates the two flows, making it easy to understand error handling.

Best Practices:

Show only relevant participants (omit infrastructure like load balancers unless directly relevant)
Use alt blocks for conditional flows (success/failure, different execution paths)
Use opt blocks for optional steps
Add notes for non-obvious behavior: Note over PaymentService: Retry up to 3 times with exponential backoff

State Machines

State machines represent object lifecycle and valid state transitions. They're critical for documenting workflows with complex business rules.

State machines clarify which transitions are valid (e.g., you can't refund a Failed payment - there's no arrow) and document business constraints (authorization expiry after 7 days). They prevent implementation bugs where invalid state transitions are accidentally allowed.

Best Practices:

Show all valid states and transitions
Document invalid transitions by omission (if there's no arrow connecting two states, that transition is not allowed)
Add notes for business rules, time limits, and conditions
Indicate terminal states clearly (states that connect to [*])

Flowcharts

Flowcharts describe decision-making processes and algorithms. They're useful for documenting complex conditional logic that would be hard to follow in prose.

This flowchart documents the payment processing decision tree, showing validation, duplicate detection, fraud checking, and processing in a visual flow. It makes the complex conditional logic easy to understand at a glance.

Best Practices:

Use diamonds for decisions, rectangles for actions, rounded rectangles for start/end
Label decision branches clearly (Yes/No, or specific conditions like "Balance > Amount")
Show all possible paths through the flow
Keep flows left-to-right or top-to-bottom for readability (avoid crossing arrows)

Diagram Tools

Mermaid (recommended for most diagrams):

Text-based diagrams embedded directly in Markdown
Version controllable (plain text)
Automatically rendered by Docusaurus, GitHub, GitLab
Supports sequence, flowchart, state, ER, Gantt, class diagrams
Easy to update (just edit text)

Draw.io / Diagrams.net (for complex diagrams):

Visual editor for intricate diagrams
Save as .drawio.svg for version control and web rendering
Better for complex C4 diagrams with many components
Store in /static/diagrams/

When to use each:

Use Mermaid for most diagrams (version controllable, easy to update, no external tools needed)
Use Draw.io for complex architectural diagrams with many components where visual layout matters significantly
Always prefer text-based (Mermaid) when feasible for easier maintenance and collaboration

Documentation-as-Code Practices

Treat documentation with the same engineering rigor as production code. This ensures documentation quality and prevents drift.

Version Control

Store documentation in the same repository as the code it documents. This ensures documentation stays synchronized with code changes and inherits the same versioning, branching, and review processes.

Repository Structure:

payment-service/
|-- src/                    # Application code
|-- docs/                   # Documentation
|   |-- architecture.md
|   |-- api.md
|   |-- runbooks/
|   |   |-- high-latency.md
|   |   `-- deployment-failure.md
|   `-- adr/                # Architecture Decision Records
|       |-- 001-use-postgresql.md
|       `-- 002-event-sourcing.md
|-- README.md
`-- CONTRIBUTING.md

Colocating documentation with code enables pull requests to update documentation in the same commit as code changes, preventing drift. When reviewing a PR that changes API behavior, reviewers can verify both code and documentation changes together.

Best Practices:

Include documentation changes in the same PR as code changes
Use Markdown for version control friendliness (diffs work well on text files)
Link to code with permalinks containing commit SHA, not main branch (so links don't break as code evolves)

Review Process

Documentation should undergo the same review process as code. Include documentation reviewers in pull request reviews to ensure technical accuracy, clarity, and completeness.

Review Checklist:

Technical accuracy (does it reflect actual system behavior?)
Code examples compile and run as written
Links are valid (no 404s)
Grammar and spelling correct
Appropriate diagram usage (diagrams exist where needed, accurate)
Consistent with existing documentation style

Tools like markdownlint enforce Markdown style consistency, similar to how ESLint enforces code style. Integrate these into CI pipelines to catch issues before review.

Keeping Documentation Current

Outdated documentation is worse than no documentation because it misleads readers and wastes their time. Engineers lose trust in documentation when it repeatedly contains incorrect information.

Strategy 1: Documentation Ownership

Assign each documentation section to a team or individual. The owner is responsible for accuracy and updates.

---
title: Payment API
owner: @payments-team
last_reviewed: 2025-11-01
review_frequency: quarterly
---

Schedule quarterly documentation reviews to catch drift and outdated information. During reviews, verify examples still work and information reflects current implementation.

Strategy 2: Deprecation Notices

When deprecating features, update documentation immediately with clear deprecation warnings and migration paths.

:::warning[Deprecated]
This authentication method is deprecated as of v2.0 and will be removed in v3.0 (scheduled for Q2 2026).

**Migration**: Use OAuth 2.0 authentication instead. See [Authentication and OAuth 2.0](../security/authentication.md) for implementation details.
:::

Strategy 3: Automated Validation

Automate documentation validation in CI pipelines:

Link checking: Fail builds if documentation contains broken links (use markdown-link-check)
Code example testing: Extract and compile code examples to ensure they work (similar to Rust's doctest)
OpenAPI validation: Validate API documentation against actual OpenAPI spec to catch drift

Example CI job:

docs-validate:
  stage: test
  image: node:22
  script:
    - npm ci
    - npm run lint:markdown
    - npm run docs:links
    - npm run docs:snippets
  rules:
    - if: '$CI_PIPELINE_SOURCE == "merge_request_event"'

Keep this job fast enough to run on every merge request. If snippet execution is slow, split into a fast MR job and a deeper nightly validation job.

Strategy 4: Update Triggers

Trigger documentation updates automatically:

Pull request template checklist: "[ ] Documentation updated if behavior changed"
Pre-commit hooks: Check if code changes affect documented behavior
Bot comments: If API contract changes detected, bot comments on PR: "API changes detected. Please update API documentation in /docs/api.md."

Anti-Patterns

Avoid these common documentation mistakes:

Assuming Knowledge

[Bad] "Simply configure the service mesh"
[Good] "Configure the service mesh by editing istio-config.yaml. Use Kubernetes Platform Guidelines for deployment and operational patterns."

Explain prerequisites and provide links to background information. Not everyone has the same context.

Using Unnecessary Jargon

[Bad] "Leverage the synergistic capabilities of the orchestration layer"
[Good] "Use Kubernetes to manage container deployment and scaling"

Prefer plain language when possible. Use technical terms when they're precise and widely understood by your audience.

Writing Walls of Text

[Bad] Long paragraphs with no breaks, headings, or whitespace
[Good] Short paragraphs (3-5 sentences), frequent headings, lists, code examples

Break content into scannable chunks. Readers should be able to skim headings to find relevant sections.

Forgetting Error Scenarios

[Bad] Only documenting the happy path
[Good] Document what happens when requests fail, services are down, or data is invalid

Error handling is often more complex than the happy path. Document failure modes, error codes, and recovery procedures.

Skipping Diagrams

[Bad] Describing complex architecture or workflows in prose only
[Good] Including diagrams for system architecture, sequence flows, state machines

Complex systems need visual aids. A sequence diagram showing API interactions is clearer than paragraphs of text.

Letting Documentation Drift

[Bad] Documentation that contradicts actual system behavior
[Good] Regular reviews, automated validation, updates in same PR as code changes

Outdated documentation frustrates users and erodes trust. Maintain documentation with the same discipline as code.

Duplicating Information

[Bad] Copying content from authoritative source into your docs
[Good] Linking to authoritative source rather than duplicating

When information exists elsewhere (official framework docs, library documentation), link to it rather than copying. Copied content quickly becomes outdated.

Over-Documenting Trivial Code

[Bad] // Increments the counter by 1 above counter++;
[Good] Document why, not what (self-explanatory code needs no documentation)

Don't document what the code obviously does. Document why it does it, especially for non-obvious design decisions.

Best Practices Summary

Do:

Know your audience - write for their expertise level (junior developer vs senior architect)
Use active voice - "The service validates the request" not "The request is validated"
Provide examples - show concepts in practice with runnable code
Include diagrams - visualize architecture, sequences, state transitions
Keep it current - update documentation in the same PR as code changes
Review documentation - apply same rigor as code review
Use version control - store docs in Git alongside code
Automate validation - link checking, spell checking, code example testing
Structure for scanning - headings, TOCs, lists, tables for easy navigation
Define acronyms - spell out on first use: "ADR (Architecture Decision Record)"

Don't:

Assume knowledge - explain context and prerequisites
Use jargon unnecessarily - prefer plain language when possible
Write walls of text - break into sections, use headings, add whitespace
Forget error scenarios - document failures, not just happy path
Skip diagrams - complex systems need visual aids
Let docs drift - outdated documentation misleads and frustrates
Duplicate content - link to authoritative source rather than copying
Over-document trivial code - self-explanatory code needs no documentation

Summary

Key Takeaways:

Match documentation type to purpose - API docs explain usage, runbooks guide incident response, ADRs record decisions, READMEs provide quick orientation
Write clearly and concisely - Active voice, brevity, specific examples make documentation easy to understand
Structure for discoverability - Headings, TOCs, logical organization enable readers to find information quickly
Visualize with diagrams - Architecture, sequence, state, and flow diagrams communicate complex relationships efficiently
Treat docs as code - Version control, review process, automated validation ensure quality
Keep documentation current - Update with code changes, schedule reviews, automate drift detection
Provide examples - Concrete code examples make abstract concepts actionable
Automate validation - Link checking, spell checking, code testing catch errors early

Well-maintained documentation accelerates onboarding (new engineers become productive faster), reduces incidents (clear runbooks lower MTTR), improves decisions (ADRs prevent repeated debates), and decreases support burden (comprehensive docs mean fewer questions). The investment in documentation quality pays dividends throughout the system lifecycle.

Overview​

Core Principles​

Documentation Types​

API Documentation​

User Guides​

Step 2: Configure Environment Variables​

Step 3: Start Dependencies​

Step 4: Run Database Migrations​

Step 5: Start the Application​

Verify Installation​

Troubleshooting​

Next Steps​

2. Check Recent Deployments​

3. Check External Dependencies​

4. Check Database Performance​

Resolution Steps​

Scenario A: External Dependency Degradation​

Scenario B: Database Performance Issue​

Scenario C: Recent Deployment Regression​

Escalation​

Post-Incident​

Related Runbooks​

README Files​

Documentation​

Key Technologies​

Project Structure​

Contributing​

Team & Support​

License​

Database Schema​

Sequence Diagram​

Alternatives Considered​

Alternative 1: Client-Generated Request IDs Only​

Alternative 2: Distributed Locking (Redis)​

Trade-offs and Risks​

Trade-offs​

Risks​

Impact Analysis​

Open Questions​

Success Criteria​

Timeline​

Feedback​

Comments from @tech-lead (2025-11-06)​

Comments from @security-engineer (2025-11-07)​

Decision​

Structuring Documentation​

Headings and Hierarchy​

Code Block Formatting​

Lists and Tables​

Using Diagrams Effectively​

Architecture Diagrams​

Sequence Diagrams​

State Machines​

Flowcharts​

Diagram Tools​

Documentation-as-Code Practices​

Version Control​

Review Process​

Keeping Documentation Current​

Anti-Patterns​

Best Practices Summary​

Further Reading​

Summary​

Overview

Core Principles

Documentation Types

API Documentation

User Guides

Step 2: Configure Environment Variables

Step 3: Start Dependencies

Step 4: Run Database Migrations

Step 5: Start the Application

Verify Installation

Troubleshooting

Next Steps

2. Check Recent Deployments

3. Check External Dependencies

4. Check Database Performance

Resolution Steps

Scenario A: External Dependency Degradation

Scenario B: Database Performance Issue

Scenario C: Recent Deployment Regression

Escalation

Post-Incident

Related Runbooks

README Files

Documentation

Key Technologies

Project Structure

Contributing

Team & Support

License

Database Schema

Sequence Diagram

Alternatives Considered

Alternative 1: Client-Generated Request IDs Only

Alternative 2: Distributed Locking (Redis)

Trade-offs and Risks

Trade-offs

Risks

Impact Analysis

Open Questions

Success Criteria

Timeline

Feedback

Comments from @tech-lead (2025-11-06)

Comments from @security-engineer (2025-11-07)

Decision

Structuring Documentation

Headings and Hierarchy

Code Block Formatting

Lists and Tables

Using Diagrams Effectively

Architecture Diagrams

Sequence Diagrams

State Machines

Flowcharts

Diagram Tools

Documentation-as-Code Practices

Version Control

Review Process

Keeping Documentation Current

Anti-Patterns

Best Practices Summary

Further Reading

Summary