Skip to main content

Data & Database Overview

Data management encompasses how applications store, retrieve, and manipulate persistent information. This includes database design, ORM usage, migration strategies, caching, and search implementations.

Why Data Management Matters

Poor data management creates compounding problems:

  • Slow queries degrade user experience
  • Data inconsistencies lead to bugs and customer issues
  • Schema complexity makes changes risky and expensive
  • Lack of scalability limits business growth

Good data management enables:

  • Fast, predictable performance even as data grows
  • Data integrity through constraints and transactions
  • Easy schema evolution as requirements change
  • Horizontal scalability when vertical limits are reached

Database Fundamentals

Relational Databases

Our primary data store. Relational databases organize data into tables with defined relationships.

Key strengths:

  • ACID transactions: Atomicity, Consistency, Isolation, Durability
  • Schema enforcement: Data integrity through constraints
  • Powerful queries: SQL supports complex queries and joins
  • Mature ecosystem: Well-understood, extensive tooling

We use:

  • PostgreSQL (preferred): Advanced features, JSON support, strong consistency
  • MySQL: Where legacy systems require it
  • H2: For testing and local development
-- Example: Well-designed relational schema
CREATE TABLE accounts (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
account_number VARCHAR(20) UNIQUE NOT NULL,
owner_id UUID NOT NULL REFERENCES users(id),
balance DECIMAL(19,4) NOT NULL DEFAULT 0 CHECK (balance >= 0),
currency CHAR(3) NOT NULL,
status VARCHAR(20) NOT NULL,
created_at TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP,
updated_at TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP,

CONSTRAINT valid_status CHECK (status IN ('ACTIVE', 'SUSPENDED', 'CLOSED'))
);

CREATE INDEX idx_accounts_owner ON accounts(owner_id);
CREATE INDEX idx_accounts_status ON accounts(status) WHERE status = 'ACTIVE';

Key features demonstrated:

  • Primary key: UUID for global uniqueness
  • Foreign keys: Referential integrity (owner_id references users)
  • Constraints: CHECK constraints ensure valid data
  • Indexes: Speed up common queries
  • Timestamps: Track creation and modification

See Database Design for comprehensive patterns.

NoSQL Databases

Specialized databases for specific use cases.

Document databases (MongoDB, DynamoDB):

  • Schema-flexible JSON documents
  • Good for rapidly evolving schemas
  • Trade consistency for availability/partition tolerance

Key-value stores (Redis, DynamoDB):

  • Simple key-value pairs
  • Extremely fast reads/writes
  • Often used for caching

When to use NoSQL:

  • Caching layer (Redis)
  • Session storage
  • Real-time analytics
  • Highly variable schema

When NOT to use NoSQL:

  • Financial transactions (use PostgreSQL with ACID)
  • Complex relationships (relational better)
  • Strong consistency requirements

Recommendation: Default to PostgreSQL. Use NoSQL only when specific requirements justify it.

Data Modeling

Normalization

Organize data to reduce redundancy and improve integrity.

Normal forms (progressive levels of organization):

1NF (First Normal Form): Atomic values, no repeating groups

-- Violates 1NF: Multiple values in one field
CREATE TABLE users (
id INT PRIMARY KEY,
name VARCHAR(100),
phone_numbers VARCHAR(200) -- "555-1234, 555-5678, 555-9012"
);

-- 1NF: Separate table for phone numbers
CREATE TABLE user_phone_numbers (
id INT PRIMARY KEY,
user_id INT REFERENCES users(id),
phone_number VARCHAR(20)
);

2NF (Second Normal Form): 1NF + no partial dependencies (all non-key attributes depend on entire primary key)

3NF (Third Normal Form): 2NF + no transitive dependencies (non-key attributes don't depend on other non-key attributes)

-- Violates 3NF: city_name depends on zip_code, not directly on id
CREATE TABLE addresses (
id INT PRIMARY KEY,
street VARCHAR(200),
zip_code VARCHAR(10),
city_name VARCHAR(100) -- Depends on zip_code, not id
);

-- 3NF: Separate cities table
CREATE TABLE cities (
zip_code VARCHAR(10) PRIMARY KEY,
city_name VARCHAR(100)
);

CREATE TABLE addresses (
id INT PRIMARY KEY,
street VARCHAR(200),
zip_code VARCHAR(10) REFERENCES cities(zip_code)
);

When to denormalize: For performance, intentionally introduce redundancy:

  • Read-heavy systems where joins are expensive
  • Reporting databases
  • Caching layers

See Database Design for normalization patterns.

Relationships

One-to-Many (most common):

-- One user has many orders
CREATE TABLE orders (
id UUID PRIMARY KEY,
user_id UUID NOT NULL REFERENCES users(id),
total DECIMAL(19,4)
);

Many-to-Many (requires junction table):

-- Many users can have many roles
CREATE TABLE user_roles (
user_id UUID REFERENCES users(id),
role_id UUID REFERENCES roles(id),
granted_at TIMESTAMP NOT NULL,
PRIMARY KEY (user_id, role_id)
);

One-to-One (less common):

-- One user has one profile
CREATE TABLE user_profiles (
user_id UUID PRIMARY KEY REFERENCES users(id),
bio TEXT,
avatar_url VARCHAR(500)
);

Object-Relational Mapping (ORM)

ORMs bridge the gap between object-oriented code and relational databases, translating between database rows and application objects.

ORM Concepts

Entity mapping: Classes map to database tables, with properties mapping to columns:

Account Entity
├── id (UUID) → accounts.id
├── accountNumber (String) → accounts.account_number
├── owner (User) → accounts.owner_id (foreign key)
├── balance (Decimal) → accounts.balance
├── status (Enum) → accounts.status
├── createdAt (Timestamp) → accounts.created_at
└── updatedAt (Timestamp) → accounts.updated_at

Relationship mapping: ORMs represent database relationships as object references:

  • One-to-Many: Parent holds collection of children (User has Orders)
  • Many-to-One: Child holds reference to parent (Order references User)
  • Many-to-Many: Both sides hold collections via junction table

Key ORM concerns:

  • Identity management: ORM tracks which objects correspond to which rows
  • Change detection: ORM detects modified objects for persistence
  • Lazy loading: Related objects loaded only when accessed
  • Caching: First-level cache (session) and second-level cache (shared)

Common ORM Pitfalls

N+1 Query Problem:

When loading a collection of objects with relationships, the ORM executes one query to fetch the parent objects, then N additional queries to fetch related objects - one per parent.

Query 1: SELECT * FROM users (returns 100 users)
Query 2: SELECT * FROM orders WHERE user_id = 1
Query 3: SELECT * FROM orders WHERE user_id = 2
...
Query 101: SELECT * FROM orders WHERE user_id = 100

Solution: Use eager loading or explicit joins to fetch related data in a single query:

-- Single query fetches users and their orders
SELECT u.*, o.* FROM users u
LEFT JOIN orders o ON o.user_id = u.id

Unintended Updates:

ORMs track entity state. Modifying a "managed" entity causes automatic persistence at transaction end, even without explicit save:

// Within transaction
user = orm.find(User, id) // Entity is now "managed"
user.email = "[email protected]" // Change tracked automatically
// Transaction commits → change persists to database (!)

Solution: Be explicit about what you intend to persist. Use read-only transactions for queries, or detach entities before modification.

Framework Implementations:

See Database ORM for comprehensive ORM patterns and anti-patterns.

Database Migrations

Schema changes must be versioned and automated. Migration tools track which changes have been applied and apply only new ones.

Migration Concepts

Version tracking: Each migration has a version identifier. The migration tool records applied versions in a tracking table:

schema_version table:
| version | description | applied_at |
|---------|----------------------|---------------------|
| 1 | create_users_table | 2024-01-15 10:30:00 |
| 2 | add_status_to_users | 2024-01-16 14:00:00 |
| 3 | create_orders_table | 2024-01-20 09:15:00 |

Sequential execution: Migrations run in version order. Never modify past migrations - create new ones instead.

Example migration sequence:

-- Migration 1: Create users table
CREATE TABLE users (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
email VARCHAR(255) UNIQUE NOT NULL,
name VARCHAR(100) NOT NULL,
created_at TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP
);

-- Migration 2: Add status column
ALTER TABLE users ADD COLUMN status VARCHAR(20) NOT NULL DEFAULT 'ACTIVE';

-- Migration 3: Create orders table with foreign key
CREATE TABLE orders (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
user_id UUID NOT NULL REFERENCES users(id),
total DECIMAL(19,4) NOT NULL,
created_at TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP
);

CREATE INDEX idx_orders_user_id ON orders(user_id);

Migration principles:

  • Sequential numbering: V1, V2, V3... (never modify past migrations)
  • Idempotent: Safe to run multiple times (use IF NOT EXISTS where possible)
  • Backward compatible: Avoid breaking changes in production
  • Test migrations: Run against production-like data before deploying

Migration tools by platform:

  • Java: Flyway, Liquibase
  • Node.js: Knex, TypeORM migrations, Prisma migrations
  • Ruby: ActiveRecord migrations
  • Python: Alembic, Django migrations

See Database Migrations for detailed strategies.

Zero-Downtime Migrations

For high-availability systems:

-- Step 1: Add new column (nullable)
ALTER TABLE users ADD COLUMN new_email VARCHAR(255);

-- Step 2: Deploy code that writes to both old_email and new_email
-- (Code deployed here)

-- Step 3: Backfill data
UPDATE users SET new_email = old_email WHERE new_email IS NULL;

-- Step 4: Make new column NOT NULL
ALTER TABLE users ALTER COLUMN new_email SET NOT NULL;

-- Step 5: Deploy code that reads from new_email
-- (Code deployed here)

-- Step 6: Drop old column
ALTER TABLE users DROP COLUMN old_email;

Each step is independently deployable without downtime.

Transactions

Transactions ensure data consistency across multiple operations.

ACID Properties

  • Atomicity: All operations succeed or all fail (no partial updates)
  • Consistency: Data remains valid according to all constraints
  • Isolation: Concurrent transactions don't interfere
  • Durability: Committed data survives system crashes

Transaction example (conceptual):

BEGIN TRANSACTION

fromAccount = findAccount(fromAccountId)
toAccount = findAccount(toAccountId)

if (fromAccount.balance < amount) {
ROLLBACK // Abort everything
throw InsufficientFundsException
}

fromAccount.balance = fromAccount.balance - amount
toAccount.balance = toAccount.balance + amount

save(fromAccount)
save(toAccount)

COMMIT // All changes persist atomically

// If any operation fails between BEGIN and COMMIT,
// database rolls back ALL changes automatically

This atomicity guarantee ensures that money is never "lost" - either both accounts update or neither does.

Isolation levels control how transactions see each other's changes:

LevelDirty ReadNon-Repeatable ReadPhantom ReadPerformance
READ UNCOMMITTEDYesYesYesFastest
READ COMMITTEDNoYesYesFast (default in PostgreSQL)
REPEATABLE READNoNoYesSlower
SERIALIZABLENoNoNoSlowest

Default: Use READ COMMITTED. Only use stricter levels when specific consistency requirements demand it.

Caching

Caching reduces database load and improves performance by storing frequently accessed data in fast storage (memory).

Caching Strategies

Cache-aside (lazy loading): Application manages cache explicitly.

function getUser(id):
cacheKey = "user:" + id
user = cache.get(cacheKey)

if (user is null):
user = database.findById(id)
cache.set(cacheKey, user, ttl=15 minutes)

return user

Read-through: Cache handles database lookup automatically on miss.

Write-through: Update both cache and database on write (cache always consistent).

function updateUser(id, data):
user = database.update(id, data)
cache.set("user:" + id, user, ttl=15 minutes) // Keep cache in sync
return user

Write-behind (write-back): Update cache immediately, persist to database asynchronously (faster writes, eventual consistency).

Cache invalidation: Delete from cache when data changes.

function deleteUser(id):
database.delete(id)
cache.delete("user:" + id) // Invalidate cache

TTL (Time-To-Live): Data expires automatically after a duration. Balances freshness vs database load.

Distributed caching: Use Redis or Memcached for shared cache across application instances.

See Caching for detailed patterns and cache coherence strategies.

Full-text search requires specialized tools beyond SQL LIKE queries. See Search.

Built-in search capabilities:

-- Create full-text search index
ALTER TABLE products ADD COLUMN search_vector tsvector;

CREATE INDEX idx_products_search ON products USING gin(search_vector);

-- Update search vector when data changes
UPDATE products SET search_vector =
to_tsvector('english', name || ' ' || description);

-- Search query
SELECT * FROM products
WHERE search_vector @@ to_tsquery('english', 'wireless & headphones')
ORDER BY ts_rank(search_vector, to_tsquery('english', 'wireless & headphones')) DESC;

Good for: Simple search within a single database

Elasticsearch

Dedicated search engine for complex requirements:

When to use Elasticsearch:

  • Fuzzy matching and typo tolerance
  • Faceted search (filters)
  • Highlighting search terms
  • Analyzing search behavior
  • Searching across multiple data sources

Trade-offs: Additional infrastructure, eventual consistency

See Search for comprehensive search strategies.

Data Patterns

Soft Deletes

Keep deleted records for audit trails by marking them as deleted rather than actually removing them:

-- Schema with soft delete column
CREATE TABLE users (
id UUID PRIMARY KEY,
email VARCHAR(255) NOT NULL,
deleted_at TIMESTAMP NULL
);

-- "Delete" operation
UPDATE users SET deleted_at = CURRENT_TIMESTAMP WHERE id = ?;

-- Normal queries exclude deleted records
SELECT * FROM users WHERE deleted_at IS NULL;

-- Include deleted for audit/recovery
SELECT * FROM users WHERE id = ?;

ORMs can automate this with annotations/decorators that rewrite queries to filter soft-deleted records automatically.

Audit Logging

Track who changed what and when with audit columns:

CREATE TABLE accounts (
id UUID PRIMARY KEY,
balance DECIMAL(19,4),
-- Audit fields
created_by VARCHAR(100) NOT NULL,
created_at TIMESTAMP NOT NULL,
last_modified_by VARCHAR(100),
updated_at TIMESTAMP
);

ORMs typically provide automatic population of these fields from the security context.

For comprehensive audit logs, consider:

  • Event sourcing: Store all changes as events, derive current state
  • CDC (Change Data Capture): Stream database changes to an audit log

Optimistic Locking

Prevent lost updates in concurrent scenarios using a version column:

CREATE TABLE accounts (
id UUID PRIMARY KEY,
version BIGINT NOT NULL DEFAULT 0, -- Version counter
balance DECIMAL(19,4)
);

-- Update with version check
UPDATE accounts
SET balance = 1500.00, version = version + 1
WHERE id = ? AND version = 5;

-- If no rows updated, someone else modified the record → retry

How it works: Read record with version 5, attempt update requiring version 5. If another transaction already updated to version 6, your update affects 0 rows. Application catches this and retries or reports conflict.

Use optimistic locking for most cases (conflicts rare). Use pessimistic locking (SELECT FOR UPDATE) only when conflicts are frequent.

Performance Optimization

Indexing

Indexes dramatically speed up queries but slow down writes.

-- Index for equality searches
CREATE INDEX idx_users_email ON users(email);

-- Composite index for multiple columns
CREATE INDEX idx_orders_user_status ON orders(user_id, status);

-- Partial index (smaller, faster)
CREATE INDEX idx_accounts_active ON accounts(owner_id)
WHERE status = 'ACTIVE';

-- Covering index (includes all columns needed)
CREATE INDEX idx_users_login ON users(email) INCLUDE (name, status);

Guidelines:

  • Index foreign keys
  • Index columns in WHERE clauses
  • Index columns used in ORDER BY
  • Don't over-index (each index slows writes)

Query Optimization

Paginate results instead of loading entire tables:

-- BAD: Load all users
SELECT * FROM users;

-- GOOD: Paginate
SELECT * FROM users ORDER BY id LIMIT 20 OFFSET 0;

Use explicit joins to avoid N+1 queries:

-- BAD: Application loads orders, then loops to load each user
SELECT * FROM orders;
-- Then for each order: SELECT * FROM users WHERE id = ?

-- GOOD: Single query with join
SELECT o.*, u.name FROM orders o
JOIN users u ON u.id = o.user_id;

Database Pooling

Connection pooling reuses database connections instead of creating new ones per request.

Connection pool parameters:

  • Maximum pool size: Upper bound on concurrent connections (10-20 typical)
  • Minimum idle: Connections to keep ready for immediate use
  • Connection timeout: Max wait time for available connection
  • Idle timeout: When to close idle connections
  • Max lifetime: When to recycle connections (before database closes them)

Why pooling matters: Database connections are expensive to establish (TCP handshake, authentication, resource allocation). Pooling amortizes this cost across many requests.

Common connection pools: HikariCP (Java), pg-pool (Node.js), connection_pool (Ruby).

See Database Design for comprehensive performance patterns.

Framework-Specific Data Guides:

Further Learning

Books:

  • Designing Data-Intensive Applications by Martin Kleppmann (2017)
  • Database Internals by Alex Petrov (2019)
  • SQL Performance Explained by Markus Winand (2012)
  • High Performance MySQL by Baron Schwartz et al. (2012)

Online Resources:

Practice:

  • Use EXPLAIN ANALYZE to understand query performance
  • Monitor slow query logs in production
  • Practice writing migrations for schema changes
  • Profile application queries with tools like Hibernate Statistics