Technical Design Process

Technical designs document architectural and implementation decisions before coding begins, creating a shared understanding of the approach and preventing costly rework. They serve as both a communication tool during development and a historical record of decision rationale for future engineers.

Overview

This guide covers two complementary documentation approaches: comprehensive technical design documents for complex features involving multiple systems, and Architecture Decision Records (ADRs) for focused architectural decisions. Both create accountability, preserve institutional knowledge, and improve decision quality through structured thinking.

Understanding when and how to use each approach enables teams to balance documentation overhead with long-term maintainability. Technical designs excel at coordinating complex implementations across multiple systems and teams, while ADRs efficiently capture single architectural decisions with lightweight documentation.

Core Principles

Document Before Implementing: Design reviews catch issues faster and cheaper than code reviews
Focus on "Why": Future engineers need context and rationale, not just "what" was built
Evaluate Alternatives: Documenting rejected options prevents relitigating decisions
Right-Size Documentation: Use ADRs for single decisions, full designs for multi-system features
Keep Designs Current: Update documentation when implementation deviates from design

When to Write Technical Designs

Always Required

New Services or Applications

New microservices, mobile apps, or frontend applications require comprehensive design documents. These decisions ripple through infrastructure, deployment pipelines, and team ownership boundaries. Documenting the service boundaries, API contracts, data ownership, and operational characteristics upfront prevents integration issues and clarifies responsibilities.

Significant Architectural Changes

Database schema redesigns, monolith-to-microservices migrations, or new technology stack introductions fundamentally alter system architecture. These changes affect multiple teams, require coordinated deployments, and carry high reversal costs. Design documents help identify migration risks, coordinate team efforts, and establish rollback procedures before committing resources.

Cross-Team Integrations

New API contracts between teams, event-driven communication patterns, or shared data models require alignment on interface definitions, versioning strategies, and backward compatibility. Design documents establish these contracts explicitly, reducing integration friction and preventing breaking changes. See API Design Guidelines for API-specific design considerations.

Performance-Critical Changes

Caching layer implementations, database query optimizations, or high-throughput message processing require baseline measurements, target metrics, and capacity planning. Design documents establish performance requirements upfront, guiding implementation decisions and defining success criteria. Performance testing strategies are covered in Performance Testing.

Security-Sensitive Changes

Authentication/authorization changes, payment processing flows, or PII handling modifications carry compliance and risk implications. Design documents ensure security reviews occur before implementation, capture threat models, and document security controls. Coordinate with Security Overview and Data Protection guidelines.

Complex Business Logic

Multi-step workflows, state machines, or complex validation rules benefit from visual diagrams and explicit state modeling. Design documents clarify edge cases, document state transitions, and establish error handling strategies before coding begins.

Optional or Lightweight

Single Feature Within Existing Patterns

Features following established patterns may only need brief design discussion documented in the user story description. Reserve full design documents for novel approaches or pattern deviations.

Bug Fixes

Most bug fixes need only root cause analysis and solution approach in the PR description. Create design documents when fixes require architectural changes or affect system contracts.

Minor Refactoring

Small improvements to existing code rarely justify design documents. Document refactoring rationale in PR descriptions, focusing on maintainability improvements or technical debt reduction.

Decision Framework

When unsure whether a design document is needed, consider:

Reversal cost: High-cost decisions warrant documentation
Team coordination: Multiple teams involved requires alignment
Compliance impact: Regulatory requirements need documented controls
Knowledge preservation: Will future engineers need to understand "why"?

Technical Design Structure

Comprehensive technical designs follow a consistent structure ensuring reviewers can efficiently evaluate completeness and identify gaps. Each section serves a specific purpose in communicating the design effectively.

This structure ensures critical aspects receive explicit consideration. The flow from problem understanding through solution design to operational concerns mirrors the natural thought process while ensuring nothing is overlooked.

1. Overview and Context

Purpose: Orient readers to the problem space and design scope.

The overview establishes what the design covers (and explicitly what it doesn't) with clear goals and non-goals. This prevents scope creep during implementation and manages stakeholder expectations.

Goals define measurable success criteria: "Reduce payment failure rate from 5% to <1%" rather than vague objectives like "improve reliability." Specific goals enable objective evaluation of design success.

Non-goals explicitly exclude out-of-scope work, preventing feature creep: "This design does NOT include multi-currency support (deferred to Q2)" clarifies boundaries upfront.

Background and Context explains the current state, pain points, and business drivers. Without this context, reviewers cannot evaluate whether the proposed solution addresses the right problem. Include business impact, deadlines, and constraints shaping the design.

2. Requirements

Functional Requirements describe what the system must do, stated as verifiable capabilities: "System must process 1000 transactions/second" rather than "System should be fast."

Non-Functional Requirements define quality attributes critical to success:

Performance: Response time targets (P50, P95, P99 latencies), throughput, concurrent user capacity
Scalability: Horizontal scaling limits, traffic spike handling, data volume growth
Availability: Uptime SLA, Recovery Time Objective (RTO), Recovery Point Objective (RPO)
Security: Encryption requirements, PII handling, audit logging, compliance standards
Compliance: Regulatory requirements (PCI-DSS, GDPR, SOC2), data residency

Non-functional requirements constrain design choices and define success beyond feature completion. Missing these requirements leads to rework when production load reveals inadequate capacity or compliance gaps.

Acceptance Criteria for NFRs should be explicit and testable. A good design document states how each NFR will be validated before production rollout.

Example format:

Performance: "P95 latency under 500ms at 1000 req/s" validated via load test in staging.
Availability: "No single-point dependency without failover path" validated by failure-injection test.
Security: "All sensitive fields encrypted at rest and masked in logs" validated by integration tests plus log inspection.
Operability: "Dashboards and alerts exist for top 5 failure modes" validated in pre-production readiness review.

3. Proposed Solution

High-Level Approach provides a one-paragraph summary orienting readers before diving into details. Describe the overall strategy: "We will implement an event-driven architecture using Kafka to decouple payment processing from notification delivery, improving resilience and enabling independent scaling."

Architecture Diagrams visualize system structure, component relationships, and data flows:

Diagrams convey structure faster than text. Include component responsibilities, protocols, and data stores. For complex systems, create multiple diagrams at different abstraction levels (system context, service internals, data flow).

Component Breakdown details each component's responsibility, technology stack, APIs, and data storage. This section bridges high-level architecture to implementation specifics:

Payment Service: Processes payment requests, validates input, calls external payment gateway, publishes success/failure events. Built with Java 25 and Spring Boot 3.5+, exposes RESTful API documented in OpenAPI spec, stores transaction records in PostgreSQL.

Data Model shows database schemas, relationships, and indexes. Include both DDL and explanation:

CREATE TABLE payments (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    customer_id UUID NOT NULL,
    amount DECIMAL(19,4) NOT NULL,  -- Store exact decimal values
    currency VARCHAR(3) NOT NULL,
    status VARCHAR(20) NOT NULL,    -- PENDING, COMPLETED, FAILED
    created_at TIMESTAMP NOT NULL DEFAULT NOW(),
    updated_at TIMESTAMP NOT NULL DEFAULT NOW(),

    CONSTRAINT fk_customer FOREIGN KEY (customer_id)
        REFERENCES customers(id) ON DELETE RESTRICT
);

-- Index for customer payment history queries
CREATE INDEX idx_payments_customer_created
    ON payments(customer_id, created_at DESC);

-- Index for payment status filtering and monitoring
CREATE INDEX idx_payments_status
    ON payments(status) WHERE status != 'COMPLETED';

Explain index choices: the idx_payments_customer_created index supports customer payment history queries efficiently by combining customer lookup with chronological ordering. The partial index idx_payments_status excludes completed payments since operational queries focus on pending/failed payments requiring attention.

API Contracts define request/response formats with examples. Use OpenAPI specifications when available, or show representative JSON:

POST /api/v1/payments
{
  "customerId": "550e8400-e29b-41d4-a716-446655440000",
  "amount": 100.00,
  "currency": "USD",
  "description": "Payment for Order #12345"
}

Response: 201 Created
{
  "paymentId": "7c9e6679-7425-40de-944b-e07fc1f90ae7",
  "status": "PENDING",
  "createdAt": "2025-11-23T10:30:00Z"
}

Document error responses and edge cases. API versioning strategies are detailed in API Versioning.

State Machines clarify complex state transitions for workflows involving multiple states:

State diagrams make valid transitions explicit, revealing edge cases that text descriptions miss. Document terminal states, retry logic, and timeout behaviors.

Sequence Diagrams show interaction timing and flow across components:

Sequence diagrams reveal timing dependencies, error paths, and asynchronous behavior. Show both success and failure scenarios to document error handling.

4. Alternatives Considered

Document each significant alternative with honest evaluation. This section prevents future engineers from proposing already-rejected solutions and demonstrates thorough analysis.

For each alternative, describe the approach, list advantages and disadvantages, and explain the rejection rationale:

Alternative 1: Synchronous Notification via REST

Send email notifications synchronously within the payment transaction instead of publishing events to Kafka.

Pros: Simpler architecture (no message queue), guaranteed notification attempt before response, easier debugging.

Cons: Email service latency (500-2000ms) blocks payment response, email service failures cause payment failures, tight coupling between payment and notification concerns.

Rejected because: Payment response latency unacceptable for user experience (P95 would exceed 2s). Email delivery failures should not prevent payment completion since payment succeeded. Tight coupling prevents independent scaling of notification service.

This level of detail shows decision quality and prevents "why didn't we just..." questions months later. Document alternatives seriously considered, not strawman options.

5. Trade-offs and Risks

Trade-offs Table makes engineering trade-offs explicit:

Trade-off	Decision	Rationale
Consistency vs Availability	Eventual consistency for notifications	Payment records are immediately consistent, but email notifications may be delayed. Acceptable since payments succeed independently of notifications.
Complexity vs Flexibility	Favor simplicity	Implement YAGNI - build for current requirements, not speculative future needs. Kafka provides sufficient flexibility without premature optimization.
Cost vs Performance	Optimize for performance	User-facing payment latency justifies infrastructure cost. Budget approved for read replicas and Kafka cluster.

Trade-offs acknowledge constraints and demonstrate pragmatic decision-making. No perfect solutions exist - only appropriate trade-offs given context.

Risks and Mitigations identify potential problems and mitigation strategies:

Risk	Probability	Impact	Mitigation
Database performance degrades under load	Medium	High	Implement connection pooling (HikariCP), add database indexes, provision read replicas for reporting queries, establish query performance monitoring.
External payment gateway downtime	High	Medium	Implement circuit breaker pattern (Resilience4j), provide graceful degradation (accept payments in degraded mode), establish retry logic with exponential backoff, monitor gateway SLA metrics.
Kafka message queue lag	Low	High	Configure auto-scaling for consumer instances, implement dead letter queue for poison messages, monitor consumer lag metrics, establish lag alerts at 1000 messages.

Risk management demonstrates operational thinking beyond happy-path implementation. Mitigation strategies show preparedness rather than hope.

6. Security and Testing

Security Considerations address authentication, data protection, input validation, audit logging, and vulnerability management. Security requirements integrate into design rather than being retrofitted.

Authentication mechanisms (OAuth 2.0, JWT token expiration, role-based access control), data protection strategies (AES-256 encryption at rest, TLS 1.3 in transit, PII field-level encryption), and input validation approaches (schema validation, parameterized queries, output encoding) require explicit design. Security patterns are detailed in Authentication and Input Validation.

Audit logging must capture who performed what action, when, and from where. Immutable append-only audit logs preserve forensic evidence and satisfy compliance requirements.

Testing Strategy defines coverage targets, testing layers, and verification approaches:

Unit testing targets (>85% code coverage), integration testing approaches (TestContainers for database and Kafka dependencies, WireMock for external APIs), contract testing strategies (Pact for consumer-driven contracts), and performance testing plans (Gatling load tests targeting 500ms P95 latency at 1000 req/s) establish quality gates before production.

Testing approaches are detailed in Testing Strategy, Unit Testing, and Contract Testing.

7. Deployment and Monitoring

Deployment Strategy outlines rollout phases, feature flags, database migrations, and rollback procedures.

Gradual rollout reduces risk: deploy to dev (Week 1), test environment with QA validation (Week 2), staging with smoke testing (Week 3), production at 10% traffic (Week 4), 50% traffic (Week 5), 100% traffic (Week 6). Feature flags enable quick disablement without redeployment.

Database migrations require backward compatibility analysis. Additive changes (new tables, new columns with defaults) deploy safely, while breaking changes (column removal, type changes) require multi-phase migrations. Flyway migration scripts should be tested in staging with production-sized datasets. Database migration patterns are covered in Database Migrations.

Rollback plans provide safety: toggle feature flag off, redeploy previous version if needed, or run emergency rollback migration (only as last resort). Clear rollback procedures reduce deployment stress.

Monitoring and Observability defines metrics, logs, traces, and alerts for operational visibility:

Metrics track request rate, error rate, and latency (RED method) plus business metrics (payments created, success rate, revenue). Infrastructure metrics (CPU, memory, disk, network) establish capacity baselines.

Structured logging in JSON format with correlation IDs enables request tracing across services. Log levels (ERROR for issues requiring action, INFO for audit, DEBUG for troubleshooting) balance signal and noise.

Distributed tracing via OpenTelemetry shows end-to-end request flow through services. Trace sampling balances observability and overhead.

Alerts trigger on actionable conditions: error rate >1% pages on-call, P95 latency >1s warns, queue lag >1000 messages warns, service down pages immediately. Alert thresholds require tuning to minimize false positives. Observability practices are detailed in Observability Overview.

8. Review, Approval, and Implementation

Review Process ensures design quality through peer feedback:

Draft phase (1-3 days): author creates initial design, shares with 1-2 close collaborators for early feedback, iterates on high-level approach.

Review phase (3-7 days): open pull request with design document, tag relevant reviewers (tech leads, architects, security engineers), present in architecture review meeting if significant, collect written feedback in PR comments, address feedback and update design.

Decision phase: tech lead or architect makes final approval, update status to "Approved", merge pull request, communicate decision to broader team.

Implementation Plan breaks work into sprint-sized chunks with dependencies and milestones. Sprint breakdown shows realistic timelines and helps product planning.

Change Log tracks design evolution: initial draft version, revisions based on feedback, approved version, implementation updates if design deviates. Versioning preserves decision history.

Design Exit Criteria should be explicit before implementation starts:

All critical risks have owner, mitigation, and observable trigger.
NFR acceptance criteria are measurable and mapped to validation steps.
Rollback path is documented and has at least one dry-run in non-production.
Monitoring/alerting requirements are defined with dashboard ownership.
Security and compliance reviewers have signed off where applicable.

Architecture Decision Records (ADRs)

Architecture Decision Records capture focused architectural decisions in lightweight format (typically 1-3 pages) compared to comprehensive design documents. ADRs document the single decision, context, alternatives, and consequences without full system design.

Why ADRs Matter

Knowledge Preservation

When engineers leave or switch projects, decision rationale disappears unless documented. Code reveals "what" was built but not "why." ADRs preserve the thinking behind decisions, enabling future engineers to evaluate whether original constraints still apply before proposing changes.

Consider a decision to use PostgreSQL over MongoDB. Months later, new engineers might propose switching to MongoDB for "better scalability" without realizing the decision hinged on ACID transaction requirements for financial data. An ADR documenting this rationale prevents relitigating the decision.

Decision Quality

Writing an ADR forces structured thinking. Documenting alternatives and trade-offs reveals issues before implementation: "While listing alternatives, I realized the 'simple' approach has a critical flaw." The exercise of articulating decision rationale improves decision quality.

Prevent Repeated Debates

Once a decision is documented with clear rationale, teams avoid rehashing the same discussion six months later. When new information emerges, a new ADR can supersede the old one with updated context, creating a clear evolution trail.

Onboarding Efficiency

New team members read ADRs to understand system evolution and architectural philosophy. This context accelerates onboarding beyond reading code alone.

When to Write ADRs

Write ADRs for decisions that are:

Expensive to Reverse

Database technology choice (PostgreSQL vs MongoDB vs DynamoDB), deployment platform selection (Kubernetes vs AWS ECS vs serverless), programming language adoption for new services, or communication pattern selection (synchronous REST vs asynchronous messaging) carry high switching costs. These decisions warrant documentation because reversal requires significant rework.

Architecturally Significant

Microservices vs monolith architecture, event sourcing vs CRUD state management, multi-tenancy approach (schema-per-tenant vs row-level isolation), or authentication mechanism (OAuth 2.0 vs SAML) define fundamental system characteristics. These pattern-setting decisions shape all future development.

Pattern-Setting

State management approach for frontend (Redux vs Zustand vs Context API), error handling strategy (exceptions vs result types), logging patterns, or API versioning strategy establish conventions other engineers will follow. Documenting these patterns ensures consistency.

Technology Adoption

Major framework adoption (Spring Boot vs Micronaut, React vs Angular), new infrastructure introduction (Redis cache, Kafka message queue), or third-party service integration (payment gateway, email provider) introduce new operational complexity and vendor dependencies.

Do Not Write ADRs For

Trivial decisions with no long-term impact, easily reversible decisions (CSS framework for small project), implementation details within established patterns, or decisions already comprehensively documented in design documents avoid duplication.

ADR Format

ADRs follow standardized structure for consistency:

# ADR-003: Use PostgreSQL for Payment Transaction Data

**Status**: Accepted
**Date**: 2025-10-15
**Deciders**: [@tech-lead, @dba, @architect, @security-engineer]
**Technical Story**: [JIRA-1234](https://jira.company.com/JIRA-1234)

## Context

We are building a payment processing service requiring persistent data storage for transaction records. This decision affects data integrity, query performance, compliance, and maintainability.

**Requirements**:
- ACID transactions: financial transactions must be atomic and consistent
- Complex queries: reporting requires JOINs, aggregations, filtering
- Strong consistency: cannot tolerate eventual consistency for financial data
- Compliance: must support audit logging and retention policies
- Performance: sub-100ms query latency at 1000 transactions/second

**Constraints**:
- Team has strong PostgreSQL expertise (5 years production experience)
- Team has limited NoSQL experience
- Must integrate with existing Prometheus/Grafana monitoring
- Must support point-in-time recovery for compliance

**Assumptions**:
- Traffic remains primarily write-heavy (80% writes, 20% reads)
- Horizontal scaling unlikely in next 2-3 years
- Read replicas sufficient for read scaling if needed

## Decision

We will use PostgreSQL 15 as the primary data store for payment transaction data.

**Implementation**:
- PostgreSQL 15+ with streaming replication to 2 read replicas
- Connection pooling via HikariCP (max 20 connections)
- Flyway for schema migrations
- Row-level security for multi-tenancy
- JSONB columns for flexible metadata
- Full-text search via PostgreSQL built-in capabilities

## Consequences

### Positive Consequences

**ACID Guarantees**: PostgreSQL provides full ACID compliance. Multi-step transactions (create payment -> update balance -> create audit log) execute atomically, preventing data inconsistencies that eventual consistency systems permit.

**Rich Query Capabilities**: Complex reporting queries using JOINs, aggregations, and window functions are straightforward. SQL knowledge transfers directly, reducing learning curve.

**Team Expertise**: Team has 5 years PostgreSQL production experience, enabling faster development, confident troubleshooting, and minimal learning overhead.

**Mature Ecosystem**: Extensive tooling for monitoring (pg_stat_statements), backup/restore (pg_dump, WAL archiving), performance tuning (pg_stat_activity, EXPLAIN ANALYZE) reduces operational burden.

**Compliance Ready**: Point-in-time recovery via WAL archiving satisfies audit requirements. Row-level security supports data isolation between tenants.

### Negative Consequences

**Horizontal Scaling Complexity**: Scaling PostgreSQL horizontally (sharding) is complex compared to MongoDB or DynamoDB. Requires manual partitioning strategies if single-server limits reached. However, vertical scaling and read replicas sufficient for 2-3 year horizon.

**Schema Migration Overhead**: Schema changes require careful migration planning (Flyway scripts, backward compatibility, downtime considerations). Less flexible than schema-less NoSQL, though strong schema prevents data quality issues.

**Vertical Scaling Limits**: Single-server architecture has eventual vertical scaling limit. Mitigated by read replicas for read scaling and defer sharding decision until data supports need.

## Alternatives Considered

### Alternative 1: MongoDB

**Description**: Document-oriented NoSQL database with flexible schema and horizontal scaling.

**Pros**: Easy horizontal scaling via sharding, flexible schema (no migrations for adding fields), high write throughput, JSON-native storage.

**Cons**: ACID transactions limited (document-level only, multi-document transactions have performance cost), eventual consistency risks with financial data unacceptable, team has minimal MongoDB experience (learning curve), complex aggregation queries less intuitive than SQL, reporting queries less mature (no true JOIN support).

**Rejected because**: ACID guarantees are non-negotiable for payment data. Eventual consistency risks outweigh scaling benefits. Team expertise gap introduces risk.

### Alternative 2: Amazon DynamoDB

**Description**: Fully managed NoSQL key-value/document database with infinite scalability.

**Pros**: Fully managed (no server maintenance), infinite horizontal scaling, predictable performance at any scale.

**Cons**: No native JOIN support (requires denormalization and data duplication), complex queries require secondary indexes (cost multiplier, eventual consistency), query patterns must be known upfront (inflexible for ad-hoc reporting), team has zero DynamoDB experience, vendor lock-in (AWS-specific), cost unpredictable with variable traffic.

**Rejected because**: Query flexibility critical for reporting. Team unfamiliarity high risk. Vendor lock-in concern. Cost model unpredictable.

## References

- [Payment Data Model](https://confluence.company.com/payment-data-model)
- [PostgreSQL Documentation](https://www.postgresql.org/docs/)
- [Benchmark Results](https://confluence.company.com/db-benchmark-2025)

This ADR provides complete context for the decision. Future engineers understand why PostgreSQL was chosen, what alternatives were evaluated, and what trade-offs were accepted.

ADR Status Lifecycle

Proposed: ADR drafted and under review, seeking feedback and consensus, not yet approved for implementation. May be modified based on discussion.

Accepted: Team has approved the decision, implementation may proceed, becomes part of architectural record. Should not be changed - create new ADR instead if decision evolves.

Deprecated: Decision no longer recommended but still in use. Include deprecation date, migration timeline, and reference to superseding ADR if applicable.

Superseded: Decision replaced by newer ADR. Include link to superseding ADR, preserve for historical context showing evolution of thinking.

ADR Workflow

Draft Phase (1-3 days): Engineer facing decision creates ADR file docs/adr/NNN-short-title.md using next sequential number. Fill template with Status: Proposed, document context and problem, research alternatives thoroughly, draft initial recommendation, share with 1-2 collaborators for early feedback.

Review Phase (3-7 days): Open pull request with ADR, tag relevant reviewers (@tech-leads, @architects, @security), present in architecture review meeting if significant, collect written feedback in PR comments, address feedback and update ADR, iterate until consensus reached.

Review guidelines: challenge assumptions (are constraints real or perceived?), evaluate completeness (are all alternatives documented?), assess consequences (are trade-offs understood?), verify alignment (does this fit architectural vision?).

Decision Phase: Tech lead or architect makes final approval, update Status: Accepted, add decision date, merge pull request, communicate decision to broader team (Slack, email, team meeting), update related documentation if needed.

Implementation Phase: Reference ADR in implementation PRs, ensure implementation follows decision, update ADR if significant deviations discovered with justification.

Evolution Phase: When context changes, new information emerges, or better solutions found - if minor clarification, add note to existing ADR; if decision no longer valid, create new ADR to supersede; update original ADR status to Superseded with reference; document what changed and why new decision is better.

ADR Storage and Organization

Store ADRs in codebase alongside the systems they document:

project-root/
|-- docs/
|   `-- adr/                          # Architecture Decision Records
|       |-- README.md                 # ADR index and introduction
|       |-- 001-use-spring-boot.md
|       |-- 002-postgresql-for-transactions.md
|       |-- 003-kafka-event-streaming.md
|       |-- 004-oauth2-authentication.md
|       `-- ...

Naming Convention: NNN-kebab-case-title.md

Use sequential numbering (001, 002, 003). Never reuse numbers even for deleted ADRs - gaps in sequence show decisions were made and later removed, preserving historical awareness. Keep titles short but descriptive. Use kebab-case for consistency.

ADR Index (docs/adr/README.md) provides quick navigation:

# Architecture Decision Records

This directory contains Architecture Decision Records (ADRs) for the Payment Service.

## Active Decisions

| ADR | Title | Status | Date |
|-----|-------|--------|------|
| `001` | Use Spring Boot for Service Framework | Accepted | 2023-03-15 |
| `004` | Adopt OAuth 2.0 for API Authentication | Accepted | 2023-08-20 |

## Superseded Decisions

| ADR | Title | Status | Superseded By | Date |
|-----|-------|--------|---------------|------|
| `002` | Use MongoDB for Data Storage | Superseded | `006` | 2023-05-10 |

The index shows current architectural state at a glance and helps newcomers navigate decision history.

Superseding ADRs

ADRs are immutable once accepted - don't edit the decision after approval. Instead, create superseding ADRs documenting evolution.

When to Supersede

Create superseding ADR when: original decision no longer valid due to changed context, better solution discovered after implementation, technology evolved (framework deprecated, better alternatives exist), or business requirements changed significantly.

Process

Write new ADR documenting new decision, reference original ADR in Context section explaining what changed and why new decision is better, update original ADR Status to "Superseded by ADR-XXX", add superseded date to original ADR.

Example Evolution

Original ADR (docs/adr/012-rest-apis.md):

# ADR-012: Use REST APIs for Service Communication

**Status**: Superseded by `ADR-045`
**Date**: 2023-05-10
**Superseded Date**: 2025-11-01

:::warning[Superseded]
This decision has been superseded by ADR-045. New internal services should use gRPC. Existing REST APIs remain supported but should migrate when practical.
:::

[Original ADR content preserved below for historical context]

Superseding ADR (docs/adr/045-grpc-for-services.md):

# ADR-045: Adopt gRPC for Internal Service Communication

**Status**: Accepted
**Date**: 2025-11-01
**Supersedes**: ADR-012: Use REST APIs for Service Communication

## Context

**Background**: We originally adopted REST APIs for all service communication (ADR-012, May 2023). This served well for external APIs but has limitations for internal service-to-service communication.

**What Changed**:
- Service mesh now in place (increased operational maturity)
- Performance requirements increased (99.95% SLA -> 99.99% SLA)
- Type safety issues emerged (runtime errors from contract mismatches)
- Team gained gRPC expertise from mobile team collaboration

**Problem**: REST APIs introduce unnecessary latency and complexity for internal communication: JSON serialization overhead (~15-20ms per request), no compile-time contract validation, manual client generation, limited streaming support.

## Decision

We will adopt gRPC for **internal** service-to-service communication. External public APIs will remain REST/JSON for broad compatibility.

**Scope**: All new internal services use gRPC (required), existing services gradual migration over 12 months, public APIs remain REST (out of scope).

## Consequences

### Positive

**Better Performance**: Binary protocol reduces latency by ~40% measured in staging. Benchmarks show P95 latency dropped from 150ms to 90ms for cross-service calls.

**Type Safety**: Compile-time contract validation prevents integration bugs. Proto file changes caught at build time rather than runtime.

**Streaming Support**: Enables real-time data flows (notifications, event streams) that REST requires workarounds to achieve.

### Negative

**Migration Effort**: Must migrate 15 existing services (~3 weeks per service, 45 weeks total effort). Migration prioritized by traffic volume.

**Learning Curve**: Team needs Protocol Buffers training. Initial velocity slowdown estimated at 20% during first month.

**Debugging Harder**: Binary protocol less human-readable than JSON. Requires tooling (grpcurl, grpc-ui) for manual testing.

## Migration Strategy

**Phase 1** (Months 1-2): New services only. Build gRPC template service, document patterns, train team.

**Phase 2** (Months 3-8): High-value migrations. Priority: Payment -> Fraud Detection (highest traffic), Order -> Inventory (second highest).

**Phase 3** (Months 9-12): Complete migration. Migrate remaining services, decommission REST endpoints for internal use.

**Rollback Plan**: All services maintain dual REST/gRPC support during migration. If critical issues emerge, revert to REST endpoint for affected services.

This superseding ADR clearly explains what changed since original decision, why new decision is better, and how to migrate. Original ADR remains for historical context.

Best Practices

Start with Problem, Not Solution

Design documents should begin with problem understanding, not jump directly to solution. Reviewers cannot evaluate solution appropriateness without understanding the problem space, constraints, and requirements.

Consider Multiple Alternatives

Evaluating alternatives forces deeper thinking and demonstrates thorough analysis. Document why alternatives were rejected to prevent relitigating decisions later.

Document Trade-offs Honestly

No perfect solutions exist - only appropriate trade-offs. Acknowledging trade-offs demonstrates mature engineering judgment. Hiding downsides undermines trust when issues emerge.

Include Visual Diagrams

Architecture diagrams, sequence diagrams, and state machines communicate structure faster than text. Visual learners grasp concepts more quickly from diagrams. Use Mermaid for version-controlled diagrams.

Make Designs Reviewable

Clear structure, concise writing, and focused scope help reviewers provide quality feedback. Massive documents discourage thorough review.

Update Designs During Implementation

Treat designs as living documents, not contracts set in stone. When implementation reveals better approaches, update the design and document why. Outdated designs mislead future engineers.

Link Design to Code

Once implemented, link design document to relevant code (repository, initial commit, key PRs). This bidirectional linking helps engineers navigate between documentation and implementation.

Review Designs in PRs

Reference design documents in PR descriptions. Reviewers can verify implementation matches approved design. Deviations should be justified in PR comments.

Anti-Patterns

Over-Design (Perfect is Enemy of Good)

Designing for hypothetical future requirements introduces complexity for problems that may never materialize. Follow YAGNI (You Aren't Gonna Need It): solve current requirements, allow future flexibility, but don't build speculative features.

Design in Isolation

Designing without stakeholder input leads to misalignment with requirements. Gather feedback early from tech leads, architects, security engineers, and product owners. Early course correction is cheaper than late redesign.

Ignoring Non-Functional Requirements

Designs focused solely on functional requirements discover performance, security, or scalability issues in production. Define performance targets, security controls, and operational characteristics upfront.

Assuming Context

Designs that assume readers have full context frustrate reviewers and future engineers. Explain background, constraints, and business drivers. Don't assume tribal knowledge.

Code Before Approval

Implementing before design approval wastes effort on approaches that may be rejected. For significant changes, get design approval before substantial coding.

Treating Design as Immutable

Designs that never update during implementation drift from reality. When learning occurs during implementation, update the design. Mark deviations and explain rationale.

Examples

Example: Circuit Breaker Pattern ADR

# ADR-018: Implement Circuit Breaker Pattern for External Payment Gateway

**Status**: Accepted
**Date**: 2025-11-05
**Deciders**: [@tech-lead, @sre-team]
**Technical Story**: [JIRA-5678](https://jira.company.com/JIRA-5678)

## Context

Our payment service integrates with external payment gateway API. Recent incident showed gateway latency spike (200ms -> 15s) exhausted our thread pool within 5 minutes, making entire service unresponsive for 45 minutes until manual restart.

**Problem**: No resilience mechanism for external dependency failures. When payment gateway degrades, our entire service fails.

**Requirements**: Prevent cascading failures, fail fast when dependency unavailable, automatically recover when dependency recovers, provide visibility into circuit state.

**Constraints**: Cannot control payment gateway reliability (third-party), must maintain 99.95% uptime SLA, cannot introduce >10ms latency overhead.

## Decision

We will implement Circuit Breaker pattern using Resilience4j for all payment gateway calls.

**Configuration**:
```yaml
resilience4j.circuitbreaker:
  instances:
    paymentGateway:
      slidingWindowSize: 100              # Track last 100 requests
      failureRateThreshold: 50            # Open if >50% fail
      waitDurationInOpenState: 30s        # Wait 30s before recovery test
      permittedNumberOfCallsInHalfOpenState: 10
      slowCallDurationThreshold: 5s       # Consider >5s as slow
      slowCallRateThreshold: 50           # Open if >50% slow

States: CLOSED (normal operation), OPEN (fail immediately without calling gateway), HALF_OPEN (testing recovery with limited requests).

Consequences

Positive

Prevents Cascading Failures: Circuit opens immediately when gateway fails. Service remains responsive for other operations.

Fast Failure: Requests fail in under 1ms when circuit open instead of waiting 15s for timeout. Clients receive errors immediately.

Automatic Recovery: Circuit automatically attempts recovery via half-open state. No manual intervention when dependency recovers.

Thread Pool Protection: Failing fast prevents threads blocking on slow responses. Thread pool remains available.

Negative

False Positives: Transient errors might trigger circuit unnecessarily, causing brief service disruption. Mitigated by tuning thresholds based on production traffic patterns.

Complexity: Introduces library dependency and configuration requiring understanding of circuit breaker semantics.

Client Impact: When circuit open, legitimate requests fail. Clients must implement retry logic or fallback behavior.

Tuning Required: Circuit breaker thresholds require tuning based on traffic patterns. Default values may not be optimal.

Implementation

@Service
public class PaymentGatewayClient {
    private final CircuitBreaker circuitBreaker;
    private final PaymentGatewayApi gatewayApi;

    public PaymentGatewayClient(CircuitBreakerRegistry registry, PaymentGatewayApi api) {
        // Retrieve circuit breaker configuration by name
        this.circuitBreaker = registry.circuitBreaker("paymentGateway");
        this.gatewayApi = api;
    }

    public PaymentResult processPayment(PaymentRequest request) {
        // Wrap external call in circuit breaker
        return CircuitBreaker.decorateSupplier(circuitBreaker,
            () -> gatewayApi.charge(request)
        ).get();
    }
}

The decorateSupplier wrapper intercepts calls to the payment gateway. When circuit is closed, calls pass through normally. When circuit is open (failure threshold exceeded), the wrapper immediately throws CallNotPermittedException without invoking the gateway, preventing thread blocking and enabling fast failure.

Alerts: Circuit opened (page on-call, high severity), circuit half-open (warning notification), circuit flapping >3 transitions in 5 min (investigate dependency).

Alternatives Considered

Alternative 1: Timeout Only

Configure aggressive timeouts (2s) without circuit breaker.

Pros: Simple, no additional library. Cons: Does not prevent cascading failures - threads still block during timeout period. Slow recovery as every request waits for timeout. Rejected: Timeouts alone don't solve thread exhaustion. Circuit breaker provides fail-fast behavior.

Alternative 2: Manual Killswitch

Feature flag to disable payment gateway integration when issues detected.

Pros: Simple, full control. Cons: Requires manual intervention, slow response time (human in loop), doesn't automatically recover. Rejected: Manual intervention too slow (45-minute incident). Automatic recovery preferred.

References

This ADR shows a decision driven by production incident, demonstrates pattern implementation with code, and explains trade-offs clearly.

---

## Further Reading

### Internal Guidelines

- [Pull Request Best Practices](./pull-requests.md) - How to document design decisions in PRs. Design documents establish the "what" and "why"; PR descriptions show the "how" implementation achieves it.
- [Code Review Process](./code-review.md) - Reviewing architectural decisions during code review. Verifying implementation matches approved design.
- [Technical Writing Guide](./technical-writing.md) - Documentation best practices for clarity and consistency.
- [Branching Strategy](./git/branching-strategy.md) - GitFlow workflow for coordinating design and implementation branches.
- [Definition of Done](./definition-of-done.md) - Quality criteria including design documentation requirements.

### External Resources

**Technical Design**:
- [Design Docs at Google](https://www.industrialempathy.com/posts/design-docs-at-google/) - How Google engineers write design documents
- [Technical Design Document Template (AWS)](https://aws.amazon.com/blogs/apn/the-aws-well-architected-framework-the-cloud-architects-must-have-checklist/) - AWS Well-Architected Framework approach

**Architecture Decision Records**:
- [Michael Nygard's ADR Template](https://github.com/joelparkerhenderson/architecture-decision-record) - Original ADR format and philosophy
- [ADR GitHub Organization](https://adr.github.io/) - Comprehensive ADR resources, tools, and examples
- [Documenting Architecture Decisions (ThoughtWorks)](https://www.thoughtworks.com/insights/blog/architecture/documenting-architecture-decisions) - Philosophy and benefits of ADRs
- [When to Write an ADR](https://engineering.atspotify.com/2020/04/when-should-i-write-an-architecture-decision-record/) - Spotify's guidance on ADR usage

**Architecture and Design**:
- [C4 Model for Software Architecture](https://c4model.com/) - Visual architecture documentation approach
- [Software Architecture Patterns (O'Reilly)](https://www.oreilly.com/library/view/software-architecture-patterns/9781491971437/) - Common architectural patterns

---

## Related Guidelines

- **Security**: [Security Overview](../security/security-overview.md), [Authentication](../security/authentication.md), [Data Protection](../security/data-protection.md) - Security considerations for design
- **API Design**: [API Design](/api/api-overview), [API Versioning](/api/rest/rest-versioning), [API Contracts](../api/contracts/api-contracts.md) - API architecture patterns
- **Data**: [Database Design](../data/database-design.md), [Database Migrations](../data/database-migrations.md) - Data architecture considerations
- **Testing**: [Testing Strategy](../testing/testing-strategy.md), [Performance Testing](../performance/performance-testing.md), [Contract Testing](../testing/contract-testing.md) - Testing approaches for designs
- **Observability**: [Observability Overview](../observability/observability-overview.md), [Logging](../observability/logging.md), [Metrics](../observability/metrics.md) - Operational design concerns

---

## Summary

**Key Takeaways**:

1. **Write technical designs for significant changes**: New services, architectural changes, cross-team integrations, performance-critical work, security-sensitive modifications, and complex business logic require comprehensive design documents.

2. **Use ADRs for focused decisions**: Database technology choice, framework adoption, architectural patterns, and communication protocols are well-suited to lightweight ADR format.

3. **Follow consistent structure**: Overview -> Requirements -> Solution -> Alternatives -> Trade-offs -> Security/Testing -> Deployment -> Review ensures completeness and reviewability.

4. **Include visual diagrams**: Architecture diagrams, sequence diagrams, and state machines communicate faster than text. Use Mermaid for version-controlled diagrams.

5. **Document alternatives and trade-offs**: Explain why you chose this approach over alternatives. Honest trade-off discussion demonstrates engineering maturity.

6. **Get approval before implementing**: Design reviews catch issues faster and cheaper than code reviews. For significant changes, invest in design approval upfront.

7. **Keep designs current**: Update documentation when implementation deviates. Stale designs mislead future engineers.

8. **ADRs are immutable once accepted**: Don't edit approved ADRs. Create superseding ADRs documenting evolution with clear rationale for changes.

9. **Start with problem understanding**: Begin with problem statement, context, and requirements. Reviewers cannot evaluate solution appropriateness without understanding the problem.

10. **Make designs actionable**: Include success criteria, implementation milestones, rollout plans, and rollback procedures. Design documents should enable implementation, not just describe it.

Overview​

Core Principles​

When to Write Technical Designs​

Always Required​

Optional or Lightweight​

Technical Design Structure​

1. Overview and Context​

2. Requirements​

3. Proposed Solution​

4. Alternatives Considered​

5. Trade-offs and Risks​

6. Security and Testing​

7. Deployment and Monitoring​

8. Review, Approval, and Implementation​

Architecture Decision Records (ADRs)​

Why ADRs Matter​

When to Write ADRs​

ADR Format​

ADR Status Lifecycle​

ADR Workflow​

ADR Storage and Organization​

Superseding ADRs​

Best Practices​

Anti-Patterns​

Examples​

Example: Circuit Breaker Pattern ADR​

Consequences​

Positive​

Negative​

Implementation​

Alternatives Considered​

Alternative 1: Timeout Only​

Alternative 2: Manual Killswitch​

References​