Infrastructure & DevOps Overview

Infrastructure and DevOps encompass the practices, tools, and platforms that enable software to run reliably in production. This includes everything from containerization and orchestration to cloud services, CI/CD pipelines, and disaster recovery.

What is Infrastructure?

Infrastructure refers to the computing resources and platforms that applications run on:

Compute: Servers, containers, serverless functions
Storage: Databases, object storage, file systems, caches
Networking: Load balancers, DNS, CDNs, service meshes
Platform services: Message queues, managed databases, monitoring systems

Modern infrastructure is:

Code-defined: Infrastructure as Code (IaC) with tools like Terraform
Automated: Provisioning, scaling, and recovery happen automatically
Observable: Rich telemetry for understanding system behavior
Resilient: Designed to handle and recover from failures

What is DevOps?

DevOps is a cultural and technical movement that breaks down silos between development and operations teams. The goal is to deliver software faster, more reliably, and with higher quality.

Core DevOps principles:

Collaboration: Developers and operations work together, sharing responsibility for production
Automation: Automate repetitive tasks - testing, deployment, infrastructure provisioning
Continuous improvement: Measure everything, learn from incidents, iteratively improve
Fast feedback: Shorten the time between code commit and production deployment
Shared responsibility: Everyone owns reliability, security, and performance

DevOps is not just about tools - it's about culture, practices, and how teams work together.

Infrastructure Components

Containerization

Containers package applications with their dependencies into portable, isolated units. Docker is the most common container runtime.

Why containers?

Consistency: Same environment from dev to production
Isolation: Applications don't interfere with each other
Efficiency: Lighter than virtual machines
Portability: Run anywhere containers are supported

# Example: Multi-stage Docker build
FROM eclipse-temurin:21-jdk-alpine AS build
WORKDIR /app
COPY gradlew .
COPY gradle gradle
COPY build.gradle settings.gradle ./
RUN ./gradlew dependencies --no-daemon --quiet
COPY src ./src
RUN ./gradlew bootJar --no-daemon

FROM eclipse-temurin:21-jre-alpine
WORKDIR /app
COPY --from=build /app/build/libs/*.jar app.jar
EXPOSE 8080
ENTRYPOINT ["java", "-jar", "app.jar"]

This approach:

Uses multi-stage builds to keep final image small
Separates build dependencies from runtime
Results in smaller, more secure images

See Docker Guidelines for detailed best practices.

Container Orchestration

Container orchestration manages the lifecycle of containers at scale. Kubernetes is the de facto standard.

What Kubernetes provides:

Scheduling: Automatically places containers on available nodes
Self-healing: Restarts failed containers, replaces unhealthy nodes
Scaling: Automatically scale based on demand
Service discovery: Applications can find each other using DNS
Load balancing: Distribute traffic across container instances
Rolling updates: Deploy new versions without downtime

# Example: Kubernetes Deployment
apiVersion: apps/v1
kind: Deployment
metadata:
  name: payment-service
spec:
  replicas: 3
  selector:
    matchLabels:
      app: payment-service
  template:
    metadata:
      labels:
        app: payment-service
    spec:
      containers:
      - name: payment-service
        image: payment-service:1.2.3
        ports:
        - containerPort: 8080
        resources:
          requests:
            memory: "512Mi"
            cpu: "500m"
          limits:
            memory: "1Gi"
            cpu: "1000m"
        livenessProbe:
          httpGet:
            path: /actuator/health/liveness
            port: 8080
          initialDelaySeconds: 30
          periodSeconds: 10
        readinessProbe:
          httpGet:
            path: /actuator/health/readiness
            port: 8080
          initialDelaySeconds: 10
          periodSeconds: 5

Key features demonstrated:

Replication: 3 instances for availability
Resource limits: Prevents resource exhaustion
Health checks: Liveness (restart if unhealthy) and readiness (remove from load balancer if not ready)

See Kubernetes Guidelines for comprehensive patterns.

Infrastructure as Code (IaC)

Define infrastructure using declarative code. Terraform is our primary IaC tool.

Benefits:

Version control: Track infrastructure changes like application code
Reproducibility: Spin up identical environments on demand
Documentation: Code describes the infrastructure
Safety: Preview changes before applying them

# Example: Terraform AWS infrastructure
resource "aws_ecs_cluster" "main" {
  name = "payment-cluster"
}

resource "aws_ecs_service" "payment" {
  name            = "payment-service"
  cluster         = aws_ecs_cluster.main.id
  task_definition = aws_ecs_task_definition.payment.arn
  desired_count   = 3

  load_balancer {
    target_group_arn = aws_lb_target_group.payment.arn
    container_name   = "payment"
    container_port   = 8080
  }
}

See Terraform Guidelines for module patterns, state management, and best practices.

Package Management

Helm packages Kubernetes applications into versioned, reusable charts.

Why Helm?

Templating: Parameterize configurations for different environments
Versioning: Track application versions and rollback if needed
Reusability: Share common patterns across applications
Dependency management: Applications can depend on other charts

See Helm Guidelines for chart development and deployment patterns.

Cloud Platforms

Cloud Fundamentals

Cloud computing provides on-demand access to computing resources without managing physical hardware.

Service models:

IaaS (Infrastructure as a Service): Virtual machines, storage, networks (e.g., AWS EC2)
PaaS (Platform as a Service): Managed platforms for running applications (e.g., AWS ECS, Google Cloud Run)
SaaS (Software as a Service): Complete applications (e.g., GitHub, Datadog)

Key benefits:

Elasticity: Scale resources up or down based on demand
Pay-per-use: Only pay for what you consume
Global reach: Deploy applications worldwide
Managed services: Offload operational burden (databases, message queues, monitoring)

AWS

AWS is our primary cloud provider. Key services include:

Compute: ECS (containers), Lambda (serverless), EC2 (virtual machines)
Storage: S3 (object storage), EBS (block storage), EFS (file storage)
Databases: RDS (relational), DynamoDB (NoSQL), ElastiCache (caching)
Networking: VPC, ALB/NLB (load balancers), Route 53 (DNS), CloudFront (CDN)
Security: IAM (identity and access), KMS (encryption), Secrets Manager
Observability: CloudWatch (metrics/logs), X-Ray (tracing)

See AWS Guidelines for service-specific best practices.

CI/CD Pipelines

Continuous Integration and Continuous Deployment automate the path from code to production. See Pipeline Guidelines.

Continuous Integration (CI)

Automatically build and test code changes:

# GitLab CI example
stages:
  - build
  - test
  - scan
  - deploy

build:
  stage: build
  script:
    - ./gradlew build
  artifacts:
    paths:
      - build/libs/*.jar

test:
  stage: test
  script:
    - ./gradlew test
    - ./gradlew integrationTest
  coverage: '/Total coverage: (\d+\.\d+)%/'

security-scan:
  stage: scan
  script:
    - trivy image $IMAGE_NAME
    - dependency-check --project $PROJECT --scan .

CI principles:

Fast feedback: Run tests quickly (< 10 minutes ideal)
Fail fast: Catch issues early in the pipeline
Comprehensive: Test functionality, security, quality
Consistent: Same tests run locally and in CI

Continuous Deployment (CD)

Automatically deploy tested code to production:

CD principles:

Automated: No manual steps from commit to production
Progressive: Deploy to environments in sequence (dev → staging → production)
Safe: Include automated tests, health checks, and rollback capability
Observable: Monitor deployments and track deployment frequency/success rate

See Pipeline Guidelines for detailed pipeline patterns.

Repository Management

Repository structure and organization impact team collaboration and deployment strategies.

Key decisions:

Monorepo vs. Polyrepo

Monorepo: Single repository containing multiple projects

Pros: Atomic changes across projects, easier refactoring, simplified dependency management
Cons: Large repository size, potential CI/CD complexity, access control challenges

Polyrepo: Separate repository per project

Pros: Independent versioning, clear ownership, simpler CI/CD per repo
Cons: Cross-project changes require multiple PRs, dependency management complexity

See Repository Guidelines for choosing and implementing these strategies.

Operational Concerns

Secrets Management

Never commit secrets (passwords, API keys, certificates) to version control. Use secrets management tools:

AWS Secrets Manager: Managed secret storage with automatic rotation
HashiCorp Vault: Enterprise secret management with dynamic secrets
Kubernetes Secrets: Built-in secret storage (ensure encryption at rest)

Best practices:

Rotate secrets regularly
Use short-lived credentials when possible (e.g., IAM roles instead of access keys)
Encrypt secrets at rest and in transit
Audit secret access

See Secrets Management for implementation patterns.

Disaster Recovery

Plan for failures before they happen. See Disaster Recovery.

Key metrics:

RTO (Recovery Time Objective): How long can you be down?
RPO (Recovery Point Objective): How much data loss is acceptable?

Strategies:

Backups: Regular automated backups with tested restore procedures
Multi-region: Deploy to multiple AWS regions for geographic redundancy
Chaos engineering: Deliberately cause failures to test recovery (see Chaos Engineering)

Runbooks: Document recovery procedures for common failures:

Database corruption
Region outage
Accidental deletion
Security breach

DevOps Metrics

Measure what matters. DORA (DevOps Research and Assessment) identified four key metrics:

1. Deployment Frequency

How often do you deploy to production?

Elite: Multiple deployments per day
High: Weekly to monthly
Medium: Monthly to every 6 months
Low: Less than every 6 months

Why it matters: Frequent deployments mean smaller changes, which are easier to test and debug.

2. Lead Time for Changes

Time from code commit to code running in production?

Elite: Less than 1 day
High: 1 day to 1 week
Medium: 1 week to 1 month
Low: More than 1 month

Why it matters: Shorter lead times mean faster feedback and quicker value delivery.

3. Time to Restore Service

How long does it take to recover from an incident?

Elite: Less than 1 hour
High: Less than 1 day
Medium: 1 day to 1 week
Low: More than 1 week

Why it matters: Faster recovery reduces impact of failures.

4. Change Failure Rate

What percentage of deployments cause a failure?

Elite: 0-15%
High: 16-30%
Medium: 31-45%
Low: More than 45%

Why it matters: Lower failure rates indicate higher quality releases.

These metrics are interconnected. Improving deployment frequency without increasing change failure rate requires investment in testing and observability.

Platform Engineering

Platform engineering builds internal platforms that make developers more productive by:

Abstracting complexity: Provide simple interfaces to complex infrastructure
Enforcing standards: Build security, observability, and resilience into platforms
Self-service: Developers can provision infrastructure without waiting for operations

Example: An internal platform might provide:

# Developer runs a simple command
platform deploy --app payment-service --env production

# Behind the scenes, the platform:
# - Builds and scans the Docker image
# - Runs tests
# - Updates Kubernetes deployment
# - Configures load balancer
# - Sets up monitoring/alerting
# - Updates service mesh

Developers don't need to understand Kubernetes, Terraform, or AWS - they just deploy.

DevOps Culture

Technology enables DevOps, but culture makes it work:

Blameless Postmortems

When incidents happen, focus on systems and processes, not individuals. See Incident Post-Mortems.

Questions to ask:

What happened?
Why did our systems allow this to happen?
How do we prevent this class of failure?
What did we learn?

Not: "Who caused this?" or "Who's responsible?"

Shared Responsibility

Everyone owns production:

Developers participate in on-call rotation
Operations contribute to feature development
Product managers understand infrastructure constraints

Continuous Learning

Invest in learning:

Blameless postmortems for knowledge sharing
Internal tech talks and demos
Experimentation time
Conference attendance and training budgets

Getting Started

For New Services

Containerize from day one: Write a Dockerfile before writing application code
Automate deployment: Set up CI/CD pipeline early
Include observability: Structured logging, metrics, health endpoints (see Observability)
Plan for failure: Include retry logic, circuit breakers, timeouts
Document operations: Write runbooks for common tasks

For Existing Services

Improve incrementally:

Add health checks: Kubernetes liveness/readiness probes
Improve observability: Add structured logging and metrics
Automate deployments: Move from manual to automated deployments
Infrastructure as Code: Convert manual infrastructure to Terraform
Containerize: Move from VMs to containers when re-architecting

Don't try to modernize everything at once. Prioritize high-value or high-pain services.

Observability: Logging, metrics, tracing for production systems
Security: Security practices for infrastructure and operations
Testing: Testing strategies including integration and chaos testing
Architecture: How infrastructure decisions impact architecture
Performance: Performance optimization and testing

Further Learning

Books:

The Phoenix Project by Gene Kim et al. (2013) - DevOps novel
The DevOps Handbook by Gene Kim et al. (2016) - Practical guide
Site Reliability Engineering by Google (2016) - SRE practices
Kubernetes in Action by Marko Lukša (2023) - Comprehensive K8s guide

Online Resources:

Certifications:

AWS Certified Solutions Architect
Certified Kubernetes Administrator (CKA)
HashiCorp Certified: Terraform Associate

What is Infrastructure?​

What is DevOps?​

Infrastructure Components​

Containerization​

Container Orchestration​

Infrastructure as Code (IaC)​

Package Management​

Cloud Platforms​

Cloud Fundamentals​

AWS​

CI/CD Pipelines​

Continuous Integration (CI)​

Continuous Deployment (CD)​

Repository Management​

Monorepo vs. Polyrepo​

Operational Concerns​

Secrets Management​

Disaster Recovery​

DevOps Metrics​

1. Deployment Frequency​

2. Lead Time for Changes​

3. Time to Restore Service​

4. Change Failure Rate​

Platform Engineering​

DevOps Culture​

Blameless Postmortems​

Shared Responsibility​

Continuous Learning​

Getting Started​

For New Services​

For Existing Services​

Related Guidelines​

Further Learning​