Skip to main content

Infrastructure & DevOps Overview

Infrastructure and DevOps encompass the practices, tools, and platforms that enable software to run reliably in production. This includes everything from containerization and orchestration to cloud services, CI/CD pipelines, and disaster recovery.

What is Infrastructure?

Infrastructure refers to the computing resources and platforms that applications run on:

  • Compute: Servers, containers, serverless functions
  • Storage: Databases, object storage, file systems, caches
  • Networking: Load balancers, DNS, CDNs, service meshes
  • Platform services: Message queues, managed databases, monitoring systems

Modern infrastructure is:

  • Code-defined: Infrastructure as Code (IaC) with tools like Terraform
  • Automated: Provisioning, scaling, and recovery happen automatically
  • Observable: Rich telemetry for understanding system behavior
  • Resilient: Designed to handle and recover from failures

What is DevOps?

DevOps is a cultural and technical movement that breaks down silos between development and operations teams. The goal is to deliver software faster, more reliably, and with higher quality.

Core DevOps principles:

  1. Collaboration: Developers and operations work together, sharing responsibility for production
  2. Automation: Automate repetitive tasks - testing, deployment, infrastructure provisioning
  3. Continuous improvement: Measure everything, learn from incidents, iteratively improve
  4. Fast feedback: Shorten the time between code commit and production deployment
  5. Shared responsibility: Everyone owns reliability, security, and performance

DevOps is not just about tools - it's about culture, practices, and how teams work together.

Infrastructure Components

Containerization

Containers package applications with their dependencies into portable, isolated units. Docker is the most common container runtime.

Why containers?

  • Consistency: Same environment from dev to production
  • Isolation: Applications don't interfere with each other
  • Efficiency: Lighter than virtual machines
  • Portability: Run anywhere containers are supported
# Example: Multi-stage Docker build
FROM eclipse-temurin:21-jdk-alpine AS build
WORKDIR /app
COPY gradlew .
COPY gradle gradle
COPY build.gradle settings.gradle ./
RUN ./gradlew dependencies --no-daemon --quiet
COPY src ./src
RUN ./gradlew bootJar --no-daemon

FROM eclipse-temurin:21-jre-alpine
WORKDIR /app
COPY --from=build /app/build/libs/*.jar app.jar
EXPOSE 8080
ENTRYPOINT ["java", "-jar", "app.jar"]

This approach:

  • Uses multi-stage builds to keep final image small
  • Separates build dependencies from runtime
  • Results in smaller, more secure images

See Docker Guidelines for detailed best practices.

Container Orchestration

Container orchestration manages the lifecycle of containers at scale. Kubernetes is the de facto standard.

What Kubernetes provides:

  • Scheduling: Automatically places containers on available nodes
  • Self-healing: Restarts failed containers, replaces unhealthy nodes
  • Scaling: Automatically scale based on demand
  • Service discovery: Applications can find each other using DNS
  • Load balancing: Distribute traffic across container instances
  • Rolling updates: Deploy new versions without downtime
# Example: Kubernetes Deployment
apiVersion: apps/v1
kind: Deployment
metadata:
name: payment-service
spec:
replicas: 3
selector:
matchLabels:
app: payment-service
template:
metadata:
labels:
app: payment-service
spec:
containers:
- name: payment-service
image: payment-service:1.2.3
ports:
- containerPort: 8080
resources:
requests:
memory: "512Mi"
cpu: "500m"
limits:
memory: "1Gi"
cpu: "1000m"
livenessProbe:
httpGet:
path: /actuator/health/liveness
port: 8080
initialDelaySeconds: 30
periodSeconds: 10
readinessProbe:
httpGet:
path: /actuator/health/readiness
port: 8080
initialDelaySeconds: 10
periodSeconds: 5

Key features demonstrated:

  • Replication: 3 instances for availability
  • Resource limits: Prevents resource exhaustion
  • Health checks: Liveness (restart if unhealthy) and readiness (remove from load balancer if not ready)

See Kubernetes Guidelines for comprehensive patterns.

Infrastructure as Code (IaC)

Define infrastructure using declarative code. Terraform is our primary IaC tool.

Benefits:

  • Version control: Track infrastructure changes like application code
  • Reproducibility: Spin up identical environments on demand
  • Documentation: Code describes the infrastructure
  • Safety: Preview changes before applying them
# Example: Terraform AWS infrastructure
resource "aws_ecs_cluster" "main" {
name = "payment-cluster"
}

resource "aws_ecs_service" "payment" {
name = "payment-service"
cluster = aws_ecs_cluster.main.id
task_definition = aws_ecs_task_definition.payment.arn
desired_count = 3

load_balancer {
target_group_arn = aws_lb_target_group.payment.arn
container_name = "payment"
container_port = 8080
}
}

See Terraform Guidelines for module patterns, state management, and best practices.

Package Management

Helm packages Kubernetes applications into versioned, reusable charts.

Why Helm?

  • Templating: Parameterize configurations for different environments
  • Versioning: Track application versions and rollback if needed
  • Reusability: Share common patterns across applications
  • Dependency management: Applications can depend on other charts

See Helm Guidelines for chart development and deployment patterns.

Cloud Platforms

Cloud Fundamentals

Cloud computing provides on-demand access to computing resources without managing physical hardware.

Service models:

  • IaaS (Infrastructure as a Service): Virtual machines, storage, networks (e.g., AWS EC2)
  • PaaS (Platform as a Service): Managed platforms for running applications (e.g., AWS ECS, Google Cloud Run)
  • SaaS (Software as a Service): Complete applications (e.g., GitHub, Datadog)

Key benefits:

  • Elasticity: Scale resources up or down based on demand
  • Pay-per-use: Only pay for what you consume
  • Global reach: Deploy applications worldwide
  • Managed services: Offload operational burden (databases, message queues, monitoring)

AWS

AWS is our primary cloud provider. Key services include:

  • Compute: ECS (containers), Lambda (serverless), EC2 (virtual machines)
  • Storage: S3 (object storage), EBS (block storage), EFS (file storage)
  • Databases: RDS (relational), DynamoDB (NoSQL), ElastiCache (caching)
  • Networking: VPC, ALB/NLB (load balancers), Route 53 (DNS), CloudFront (CDN)
  • Security: IAM (identity and access), KMS (encryption), Secrets Manager
  • Observability: CloudWatch (metrics/logs), X-Ray (tracing)

See AWS Guidelines for service-specific best practices.

CI/CD Pipelines

Continuous Integration and Continuous Deployment automate the path from code to production. See Pipeline Guidelines.

Continuous Integration (CI)

Automatically build and test code changes:

# GitLab CI example
stages:
- build
- test
- scan
- deploy

build:
stage: build
script:
- ./gradlew build
artifacts:
paths:
- build/libs/*.jar

test:
stage: test
script:
- ./gradlew test
- ./gradlew integrationTest
coverage: '/Total coverage: (\d+\.\d+)%/'

security-scan:
stage: scan
script:
- trivy image $IMAGE_NAME
- dependency-check --project $PROJECT --scan .

CI principles:

  • Fast feedback: Run tests quickly (< 10 minutes ideal)
  • Fail fast: Catch issues early in the pipeline
  • Comprehensive: Test functionality, security, quality
  • Consistent: Same tests run locally and in CI

Continuous Deployment (CD)

Automatically deploy tested code to production:

CD principles:

  • Automated: No manual steps from commit to production
  • Progressive: Deploy to environments in sequence (dev → staging → production)
  • Safe: Include automated tests, health checks, and rollback capability
  • Observable: Monitor deployments and track deployment frequency/success rate

See Pipeline Guidelines for detailed pipeline patterns.

Repository Management

Repository structure and organization impact team collaboration and deployment strategies.

Key decisions:

Monorepo vs. Polyrepo

Monorepo: Single repository containing multiple projects

  • Pros: Atomic changes across projects, easier refactoring, simplified dependency management
  • Cons: Large repository size, potential CI/CD complexity, access control challenges

Polyrepo: Separate repository per project

  • Pros: Independent versioning, clear ownership, simpler CI/CD per repo
  • Cons: Cross-project changes require multiple PRs, dependency management complexity

See Repository Guidelines for choosing and implementing these strategies.

Operational Concerns

Secrets Management

Never commit secrets (passwords, API keys, certificates) to version control. Use secrets management tools:

  • AWS Secrets Manager: Managed secret storage with automatic rotation
  • HashiCorp Vault: Enterprise secret management with dynamic secrets
  • Kubernetes Secrets: Built-in secret storage (ensure encryption at rest)

Best practices:

  • Rotate secrets regularly
  • Use short-lived credentials when possible (e.g., IAM roles instead of access keys)
  • Encrypt secrets at rest and in transit
  • Audit secret access

See Secrets Management for implementation patterns.

Disaster Recovery

Plan for failures before they happen. See Disaster Recovery.

Key metrics:

  • RTO (Recovery Time Objective): How long can you be down?
  • RPO (Recovery Point Objective): How much data loss is acceptable?

Strategies:

  • Backups: Regular automated backups with tested restore procedures
  • Multi-region: Deploy to multiple AWS regions for geographic redundancy
  • Chaos engineering: Deliberately cause failures to test recovery (see Chaos Engineering)

Runbooks: Document recovery procedures for common failures:

  • Database corruption
  • Region outage
  • Accidental deletion
  • Security breach

DevOps Metrics

Measure what matters. DORA (DevOps Research and Assessment) identified four key metrics:

1. Deployment Frequency

How often do you deploy to production?

  • Elite: Multiple deployments per day
  • High: Weekly to monthly
  • Medium: Monthly to every 6 months
  • Low: Less than every 6 months

Why it matters: Frequent deployments mean smaller changes, which are easier to test and debug.

2. Lead Time for Changes

Time from code commit to code running in production?

  • Elite: Less than 1 day
  • High: 1 day to 1 week
  • Medium: 1 week to 1 month
  • Low: More than 1 month

Why it matters: Shorter lead times mean faster feedback and quicker value delivery.

3. Time to Restore Service

How long does it take to recover from an incident?

  • Elite: Less than 1 hour
  • High: Less than 1 day
  • Medium: 1 day to 1 week
  • Low: More than 1 week

Why it matters: Faster recovery reduces impact of failures.

4. Change Failure Rate

What percentage of deployments cause a failure?

  • Elite: 0-15%
  • High: 16-30%
  • Medium: 31-45%
  • Low: More than 45%

Why it matters: Lower failure rates indicate higher quality releases.

These metrics are interconnected. Improving deployment frequency without increasing change failure rate requires investment in testing and observability.

Platform Engineering

Platform engineering builds internal platforms that make developers more productive by:

  • Abstracting complexity: Provide simple interfaces to complex infrastructure
  • Enforcing standards: Build security, observability, and resilience into platforms
  • Self-service: Developers can provision infrastructure without waiting for operations

Example: An internal platform might provide:

# Developer runs a simple command
platform deploy --app payment-service --env production

# Behind the scenes, the platform:
# - Builds and scans the Docker image
# - Runs tests
# - Updates Kubernetes deployment
# - Configures load balancer
# - Sets up monitoring/alerting
# - Updates service mesh

Developers don't need to understand Kubernetes, Terraform, or AWS - they just deploy.

DevOps Culture

Technology enables DevOps, but culture makes it work:

Blameless Postmortems

When incidents happen, focus on systems and processes, not individuals. See Incident Post-Mortems.

Questions to ask:

  • What happened?
  • Why did our systems allow this to happen?
  • How do we prevent this class of failure?
  • What did we learn?

Not: "Who caused this?" or "Who's responsible?"

Shared Responsibility

Everyone owns production:

  • Developers participate in on-call rotation
  • Operations contribute to feature development
  • Product managers understand infrastructure constraints

Continuous Learning

Invest in learning:

  • Blameless postmortems for knowledge sharing
  • Internal tech talks and demos
  • Experimentation time
  • Conference attendance and training budgets

Getting Started

For New Services

  1. Containerize from day one: Write a Dockerfile before writing application code
  2. Automate deployment: Set up CI/CD pipeline early
  3. Include observability: Structured logging, metrics, health endpoints (see Observability)
  4. Plan for failure: Include retry logic, circuit breakers, timeouts
  5. Document operations: Write runbooks for common tasks

For Existing Services

Improve incrementally:

  1. Add health checks: Kubernetes liveness/readiness probes
  2. Improve observability: Add structured logging and metrics
  3. Automate deployments: Move from manual to automated deployments
  4. Infrastructure as Code: Convert manual infrastructure to Terraform
  5. Containerize: Move from VMs to containers when re-architecting

Don't try to modernize everything at once. Prioritize high-value or high-pain services.

  • Observability: Logging, metrics, tracing for production systems
  • Security: Security practices for infrastructure and operations
  • Testing: Testing strategies including integration and chaos testing
  • Architecture: How infrastructure decisions impact architecture
  • Performance: Performance optimization and testing

Further Learning

Books:

  • The Phoenix Project by Gene Kim et al. (2013) - DevOps novel
  • The DevOps Handbook by Gene Kim et al. (2016) - Practical guide
  • Site Reliability Engineering by Google (2016) - SRE practices
  • Kubernetes in Action by Marko Lukša (2023) - Comprehensive K8s guide

Online Resources:

Certifications:

  • AWS Certified Solutions Architect
  • Certified Kubernetes Administrator (CKA)
  • HashiCorp Certified: Terraform Associate