Database Schema Design
Guidelines for designing robust, performant, and maintainable relational database schemas.
Overview
Database schema design is the foundation of data integrity and application performance. A well-designed schema enforces business rules through constraints, supports efficient queries through appropriate indexing, and scales as data volume grows. Poor schema design leads to data anomalies, slow queries, and expensive refactoring efforts.
This guide focuses on PostgreSQL, a powerful open-source relational database with full ACID compliance, but most principles apply to any relational database. For ORM-specific patterns, see Database ORM. For schema evolution, see Database Migrations.
Core Principles
- Normalize to Eliminate Redundancy: Use normal forms to prevent update anomalies and data inconsistencies
- Denormalize for Performance: Strategically duplicate data when read performance is critical
- Enforce Constraints at Database Level: Use primary keys, foreign keys, check constraints, and unique constraints
- Index for Query Patterns: Create indexes based on actual query access patterns, not assumptions
- Choose Appropriate Data Types: Use the most specific type that fits your data (avoid generic TEXT for everything)
- Plan for Audit Requirements: Include created/updated timestamps and user tracking from the start
- Version Control Schema: Treat schema as code with migrations in version control
PostgreSQL: ACID Guarantees
PostgreSQL provides robust ACID guarantees essential for applications requiring data integrity:
ACID Explained:
-
Atomicity: Transactions are all-or-nothing operations. If any part of a transaction fails, the entire transaction rolls back, leaving the database unchanged. This prevents partial updates that would leave data in an inconsistent state.
-
Consistency: Transactions move the database from one valid state to another. All constraints (primary keys, foreign keys, check constraints) are enforced before a transaction commits. Invalid data cannot be persisted.
-
Isolation: Concurrent transactions execute as if they were serialized. Different isolation levels balance between performance and consistency. PostgreSQL supports Read Uncommitted, Read Committed (default), Repeatable Read, and Serializable isolation levels.
-
Durability: Once a transaction commits, the changes are permanent even if the system crashes immediately after. PostgreSQL uses Write-Ahead Logging (WAL) to ensure committed transactions are written to disk before acknowledging success.
For transaction management in Spring Boot applications, see Spring Boot Data Access.
Table Design
Basic Table Structure
Every table should follow these conventions:
CREATE TABLE orders (
-- Primary key (UUID for distributed systems, BIGSERIAL for single instance)
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
-- Business fields
customer_id UUID NOT NULL,
total_amount DECIMAL(19,4) NOT NULL CHECK (total_amount >= 0),
status VARCHAR(20) NOT NULL,
notes TEXT,
-- Audit fields (essential for tracking changes)
created_at TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP,
updated_at TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP,
created_by VARCHAR(100) NOT NULL,
updated_by VARCHAR(100) NOT NULL,
-- Optimistic locking
version BIGINT NOT NULL DEFAULT 0
);
Key Design Decisions:
-
Primary Key Strategy: UUIDs prevent ID collisions in distributed systems but are larger (16 bytes vs 8 bytes for BIGINT). Use BIGSERIAL for single-database systems for better performance and smaller indexes.
-
Audit Fields: Always include
created_at,updated_at,created_by, andupdated_by. These fields are crucial for debugging, compliance, and understanding data history. They should be set automatically by triggers or ORM frameworks (see Database ORM). -
Version Column: Enables optimistic locking to prevent lost updates when multiple users edit the same record concurrently. The application checks the version hasn't changed before committing updates.
Data Type Selection
Choose the most specific data type for your domain:
-- GOOD: Appropriate types
CREATE TABLE products (
id BIGSERIAL PRIMARY KEY,
sku VARCHAR(50) NOT NULL, -- Fixed max length for SKU
name VARCHAR(200) NOT NULL, -- Reasonable limit for product names
description TEXT, -- Unlimited text for descriptions
price DECIMAL(19,4) NOT NULL, -- Precise decimal for money
quantity INTEGER NOT NULL, -- Whole numbers for quantities
is_active BOOLEAN NOT NULL DEFAULT true,
metadata JSONB, -- Structured data with indexing support
tags TEXT[], -- Array for multiple values
published_at TIMESTAMP WITH TIME ZONE -- Store timezone for global apps
);
-- BAD: Generic types everywhere
CREATE TABLE products (
id TEXT PRIMARY KEY, -- Inefficient, no type safety
sku TEXT, -- No length constraint
name TEXT, -- No length constraint
price FLOAT, -- NEVER use FLOAT for money (rounding errors)
quantity TEXT, -- Should be INTEGER
is_active TEXT, -- Should be BOOLEAN
metadata TEXT, -- JSON as text loses indexing capability
published_at TIMESTAMP -- Missing timezone information
);
Decimal vs Float for Money: Floating-point numbers (FLOAT, DOUBLE PRECISION) cannot accurately represent decimal values like 0.1 in binary, leading to rounding errors. Always use DECIMAL or NUMERIC for monetary values where precision matters.
JSONB vs JSON: PostgreSQL offers both JSON (stores exact text) and JSONB (stores binary format). Use JSONB for better query performance and indexing capabilities. JSONB supports GIN indexes for efficient queries on nested fields.
Normalization
Normalization organizes data to eliminate redundancy and prevent update anomalies. Understanding normal forms helps you design schemas that maintain data integrity.
First Normal Form (1NF)
Rule: Each column contains atomic (indivisible) values, and each row is unique.
BAD: -- VIOLATES 1NF: Multiple values in single column
CREATE TABLE orders_bad (
id UUID PRIMARY KEY,
customer_name VARCHAR(200),
product_ids TEXT -- "prod-1,prod-2,prod-3" - comma-separated!
);
GOOD: -- SATISFIES 1NF: Atomic values, separate table for many-to-many
CREATE TABLE orders (
id UUID PRIMARY KEY,
customer_id UUID NOT NULL
);
CREATE TABLE order_items (
id UUID PRIMARY KEY,
order_id UUID NOT NULL REFERENCES orders(id),
product_id UUID NOT NULL REFERENCES products(id),
quantity INTEGER NOT NULL
);
Second Normal Form (2NF)
Rule: Satisfies 1NF, and no non-key column depends on only part of the primary key (applies to composite keys).
BAD: -- VIOLATES 2NF: product_name depends only on product_id, not full key
CREATE TABLE order_items_bad (
order_id UUID,
product_id UUID,
product_name VARCHAR(200), -- Depends only on product_id
quantity INTEGER,
PRIMARY KEY (order_id, product_id)
);
GOOD: -- SATISFIES 2NF: product_name moved to products table
CREATE TABLE order_items (
order_id UUID,
product_id UUID REFERENCES products(id),
quantity INTEGER,
PRIMARY KEY (order_id, product_id)
);
CREATE TABLE products (
id UUID PRIMARY KEY,
name VARCHAR(200) NOT NULL
);
Third Normal Form (3NF)
Rule: Satisfies 2NF, and no non-key column depends on another non-key column (no transitive dependencies).
BAD: -- VIOLATES 3NF: country_name depends on country_code, not on customer_id
CREATE TABLE customers_bad (
id UUID PRIMARY KEY,
name VARCHAR(200),
country_code CHAR(2),
country_name VARCHAR(100) -- Transitive dependency: id -> country_code -> country_name
);
GOOD: -- SATISFIES 3NF: country_name in separate table
CREATE TABLE customers (
id UUID PRIMARY KEY,
name VARCHAR(200),
country_code CHAR(2) REFERENCES countries(code)
);
CREATE TABLE countries (
code CHAR(2) PRIMARY KEY,
name VARCHAR(100) NOT NULL
);
When to Denormalize: While normalization prevents data anomalies, sometimes controlled denormalization improves read performance. Common scenarios:
- Materialized views: Pre-compute expensive joins for reporting
- Caching computed values: Store aggregations (e.g., order total) to avoid recalculating
- Read-heavy workloads: Duplicate data when reads vastly outnumber writes
Always denormalize intentionally with a clear understanding of the trade-offs (data consistency vs performance).
Constraints
Constraints enforce data integrity at the database level, preventing invalid data from being stored. They are your first line of defense against data corruption.
Primary Keys
-- UUID primary key (distributed systems)
CREATE TABLE orders (
id UUID PRIMARY KEY DEFAULT gen_random_uuid()
);
-- BIGSERIAL primary key (single instance, better performance)
CREATE TABLE customers (
id BIGSERIAL PRIMARY KEY
);
-- Composite primary key (many-to-many relationships)
CREATE TABLE order_items (
order_id UUID NOT NULL,
product_id UUID NOT NULL,
PRIMARY KEY (order_id, product_id)
);
Foreign Keys
Foreign keys maintain referential integrity between tables. The ON DELETE and ON UPDATE clauses control cascading behavior:
CREATE TABLE order_items (
id UUID PRIMARY KEY,
order_id UUID NOT NULL,
product_id UUID NOT NULL,
-- RESTRICT: Prevent deletion if referenced (safe default)
CONSTRAINT fk_order FOREIGN KEY (order_id)
REFERENCES orders(id) ON DELETE RESTRICT,
-- CASCADE: Delete child records when parent is deleted
CONSTRAINT fk_product FOREIGN KEY (product_id)
REFERENCES products(id) ON DELETE CASCADE,
-- SET NULL: Set to null when parent is deleted
-- SET DEFAULT: Set to default value when parent is deleted
-- NO ACTION: Check constraint at end of transaction
);
Cascade Rules Explained:
- RESTRICT: Prevents deletion of parent if children exist (safest option, prevents accidental data loss)
- CASCADE: Automatically deletes children when parent is deleted (use carefully, can delete large amounts of data)
- SET NULL: Sets foreign key to NULL when parent is deleted (requires column to be nullable)
- SET DEFAULT: Sets foreign key to default value when parent is deleted
- NO ACTION: Similar to RESTRICT but checks constraint at transaction end, not immediately
Check Constraints
CREATE TABLE products (
id UUID PRIMARY KEY,
price DECIMAL(19,4) NOT NULL CHECK (price >= 0),
discount_percent INTEGER CHECK (discount_percent BETWEEN 0 AND 100),
status VARCHAR(20) NOT NULL CHECK (status IN ('DRAFT', 'ACTIVE', 'ARCHIVED')),
start_date DATE,
end_date DATE,
CONSTRAINT chk_date_range CHECK (end_date IS NULL OR end_date >= start_date)
);
Check constraints enforce business rules at the database level. Even if application code has bugs, invalid data cannot be inserted.
Unique Constraints
CREATE TABLE users (
id UUID PRIMARY KEY,
email VARCHAR(255) NOT NULL,
username VARCHAR(50) NOT NULL,
-- Single column unique
CONSTRAINT uk_email UNIQUE (email),
-- Composite unique (username is unique within account)
account_id UUID NOT NULL,
CONSTRAINT uk_account_username UNIQUE (account_id, username)
);
-- Partial unique index (conditional uniqueness)
CREATE UNIQUE INDEX uk_active_email ON users (email)
WHERE deleted_at IS NULL;
Indexing Strategy
Indexes dramatically improve query performance by allowing the database to quickly locate rows without scanning the entire table. However, indexes have costs: they consume disk space and slow down writes (INSERT, UPDATE, DELETE) because indexes must be maintained.
When to Create Indexes
Create indexes for:
- Primary keys: Automatically indexed
- Foreign keys: Essential for join performance (not automatically indexed in PostgreSQL)
- WHERE clause columns: Columns frequently used for filtering
- ORDER BY columns: Columns used for sorting results
- JOIN columns: Columns used in join conditions
- Unique constraints: Enforced through unique indexes
Don't over-index:
- Each index adds overhead to writes
- Indexes consume disk space
- PostgreSQL query planner may ignore indexes if table is small
Index Types
-- B-Tree (default, most common - supports equality and range queries)
CREATE INDEX idx_orders_created_at ON orders(created_at);
CREATE INDEX idx_orders_customer ON orders(customer_id);
-- Partial index (smaller, faster for specific conditions)
CREATE INDEX idx_active_orders ON orders(customer_id)
WHERE status = 'PENDING';
-- Composite index (order matters! Leftmost prefix rule)
CREATE INDEX idx_customer_status_date ON orders(customer_id, status, created_at DESC);
-- This index can be used for queries filtering on:
-- - customer_id alone
-- - customer_id + status
-- - customer_id + status + created_at
-- But NOT for status alone or created_at alone
-- GIN index (for full-text search and JSONB)
CREATE INDEX idx_product_metadata ON products USING GIN (metadata);
-- Hash index (equality only, smaller than B-tree)
CREATE INDEX idx_user_email_hash ON users USING HASH (email);
B-Tree Index: The default index type. B-tree (balanced tree) indexes support equality (=), comparison (<, >, <=, >=), range (BETWEEN), and pattern matching (LIKE 'prefix%'). They are sorted, making them efficient for ORDER BY queries.
GIN (Generalized Inverted Index): Designed for indexing composite values like arrays, JSONB documents, and full-text search. GIN indexes are larger but essential for queries on nested JSON fields or array contains operations.
Partial Index: Indexes only rows matching a WHERE condition. Smaller and faster than full indexes when you consistently query a subset of data (e.g., only active records).
Composite Index Leftmost Prefix Rule: PostgreSQL can use a composite index (a, b, c) for queries filtering on a, a + b, or a + b + c, but not for queries filtering only on b or c. The order of columns in a composite index matters.
Analyzing Index Usage
-- Find missing indexes (queries doing sequential scans)
SELECT schemaname, tablename, seq_scan, seq_tup_read, idx_scan, idx_tup_fetch
FROM pg_stat_user_tables
WHERE seq_scan > 0
ORDER BY seq_tup_read DESC
LIMIT 10;
-- Find unused indexes (candidates for removal)
SELECT schemaname, tablename, indexname, idx_scan
FROM pg_stat_user_indexes
WHERE idx_scan = 0
ORDER BY pg_relation_size(indexrelid) DESC
LIMIT 10;
-- Explain query execution plan
EXPLAIN ANALYZE
SELECT * FROM orders
WHERE customer_id = 'some-uuid'
AND status = 'PENDING';
Use EXPLAIN ANALYZE to understand whether PostgreSQL is using your indexes. Look for "Index Scan" (good) vs "Seq Scan" (potentially slow for large tables). For query optimization patterns, see Spring Boot Data Access.
Enum Types and Status Fields
PostgreSQL Enum Types
-- Create enum type
CREATE TYPE order_status AS ENUM ('DRAFT', 'PENDING', 'PROCESSING', 'COMPLETED', 'CANCELLED');
CREATE TABLE orders (
id UUID PRIMARY KEY,
status order_status NOT NULL DEFAULT 'DRAFT'
);
-- Query using enum
SELECT * FROM orders WHERE status = 'PENDING';
Enum Trade-offs:
- [GOOD] Type safety: Invalid values rejected at database level
- [GOOD] Storage efficient: Stored as integers internally
- [BAD] Difficult to modify: Adding/removing values requires schema changes
- [BAD] Ordering matters: Enums are compared based on definition order
Alternative: Check Constraints
CREATE TABLE orders (
id UUID PRIMARY KEY,
status VARCHAR(20) NOT NULL CHECK (status IN ('DRAFT', 'PENDING', 'PROCESSING', 'COMPLETED', 'CANCELLED'))
);
Check Constraint Trade-offs:
- [GOOD] Easier to modify: Change constraint without data migration
- [GOOD] More flexible: Can easily add/remove values
- [BAD] Less type safety: No custom type in queries
- [BAD] Less efficient: Stored as strings
For ORM enum mapping patterns (JPA @Enumerated), see Database ORM.
Triggers and Generated Columns
Auto-Update Timestamp Trigger
CREATE OR REPLACE FUNCTION update_updated_at_column()
RETURNS TRIGGER AS $$
BEGIN
NEW.updated_at = CURRENT_TIMESTAMP;
RETURN NEW;
END;
$$ language 'plpgsql';
CREATE TRIGGER update_orders_updated_at
BEFORE UPDATE ON orders
FOR EACH ROW
EXECUTE FUNCTION update_updated_at_column();
This trigger automatically sets updated_at whenever a row is modified, preventing stale timestamps from application bugs.
Generated Columns (PostgreSQL 12+)
CREATE TABLE products (
id UUID PRIMARY KEY,
price DECIMAL(19,4) NOT NULL,
discount_percent INTEGER NOT NULL DEFAULT 0,
-- Generated column (computed on read, not stored)
discounted_price DECIMAL(19,4) GENERATED ALWAYS AS (
price * (1 - discount_percent::DECIMAL / 100)
) STORED
);
Generated vs Stored Computed Values:
- STORED: Computed on write, stored on disk, can be indexed
- VIRTUAL (not supported by PostgreSQL): Computed on read, not stored
Summary
Key Takeaways:
- ACID guarantees prevent data corruption and ensure consistency in concurrent environments
- Normalize to 3NF to eliminate redundancy, denormalize intentionally for performance
- Use constraints (primary keys, foreign keys, check constraints) to enforce data integrity at the database level
- Index strategically based on actual query patterns, not assumptions
- Choose appropriate data types: DECIMAL for money, JSONB for structured data, UUID or BIGSERIAL for primary keys
- Always include audit fields: created_at, updated_at, created_by, updated_by, version
- Foreign key cascade rules control what happens when parent records are deleted
- Composite index order matters: Leftmost prefix rule determines which queries can use the index
- Analyze index usage with EXPLAIN ANALYZE to ensure queries are efficient
- Triggers automate common tasks like updating timestamps, but add complexity
Related Guides:
- Database ORM - JPA/Hibernate entity mapping, repositories, transactions
- Database Migrations - Flyway versioning and zero-downtime migrations
- Spring Boot Data Access - Integration patterns
- Spring Boot Testing - Database testing with TestContainers