AWS Elastic Kubernetes Service (EKS)

AWS Elastic Kubernetes Service (EKS) is a managed Kubernetes service that runs the Kubernetes control plane across multiple AWS availability zones, eliminating the operational burden of managing Kubernetes master nodes. EKS provides AWS-native integrations (IAM, VPC, CloudWatch, ALB) while maintaining compatibility with standard Kubernetes APIs and tooling.

EKS enables you to run production Kubernetes workloads with AWS-managed reliability, security, and scalability. Understanding EKS-specific features - particularly IAM Roles for Service Accounts (IRSA), VPC CNI networking, and AWS Load Balancer Controller - is essential for building secure, cost-effective Kubernetes architectures on AWS.

Prerequisites

This guide assumes familiarity with core Kubernetes concepts (Pods, Deployments, Services, Ingress). For comprehensive Kubernetes fundamentals, see Kubernetes Best Practices. For application packaging, see Helm documentation.

EKS Architecture Overview

EKS clusters consist of two primary components:

Control Plane (AWS-managed):

Kubernetes API server, scheduler, controller manager, etcd
Runs across 3 availability zones for high availability
AWS handles patching, scaling, and recovery
You interact via kubectl (standard Kubernetes API)

Data Plane (customer-managed):

Worker nodes (EC2 instances) or Fargate pods
Runs your application containers
You configure node types, scaling, networking

This architecture illustrates EKS's core components. The control plane runs in an AWS-managed VPC (you never see or manage these instances). Worker nodes run in your VPC, with kubelet communicating to the control plane via AWS PrivateLink or public endpoints. The ALB Ingress Controller provisions Application Load Balancers in your public subnets to route traffic to pods. NAT Gateways enable pods in private subnets to access the internet for pulling images and external API calls.

Cluster Endpoint Access

EKS clusters have configurable API endpoint access:

Public endpoint: API server accessible from internet (with optional CIDR restrictions)
Private endpoint: API server accessible only from within VPC via AWS PrivateLink
Both public and private: Recommended for production (CI/CD uses public, nodes use private)

Production recommendation: Enable both endpoints. Configure public endpoint with CIDR restrictions (only your office/VPN IPs and CI/CD systems). Worker nodes communicate via private endpoint (no NAT Gateway cost, lower latency, increased security).

# Create cluster with both endpoints enabled
eksctl create cluster \
  --name production-cluster \
  --region us-east-1 \
  --nodegroup-name standard-workers \
  --node-type m6i.xlarge \
  --nodes 3 \
  --nodes-min 3 \
  --nodes-max 10 \
  --vpc-private-subnets subnet-abc123,subnet-def456 \
  --vpc-public-subnets subnet-ghi789,subnet-jkl012 \
  --endpoint-private-access=true \
  --endpoint-public-access=true \
  --public-access-cidrs 203.0.113.0/24  # Your office/VPN IP range

For cluster networking within VPCs, see VPC design patterns.

Node Groups and Compute Options

EKS supports three compute models for running pods:

Managed Node Groups (Recommended)

Managed Node Groups automate EC2 instance provisioning, upgrades, and lifecycle management. EKS handles:

Creating Auto Scaling Groups with optimal configurations
Gracefully draining nodes before termination during upgrades
Applying security patches and AMI updates
Tagging instances for AWS integrations

# Managed Node Group configuration (eksctl)
apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig

metadata:
  name: production-cluster
  region: us-east-1

managedNodeGroups:
  - name: general-purpose
    instanceType: m6i.xlarge
    minSize: 3
    maxSize: 10
    desiredCapacity: 3
    volumeSize: 50
    volumeType: gp3
    privateNetworking: true  # Launch in private subnets
    labels:
      workload-type: general
    tags:
      Environment: production
      Team: platform-engineering
    iam:
      withAddonPolicies:
        imageBuilder: true
        autoScaler: true
        cloudWatch: true
        albIngress: true
        ebs: true
        efs: true

  - name: compute-optimized
    instanceType: c6i.2xlarge
    minSize: 0
    maxSize: 5
    desiredCapacity: 0
    labels:
      workload-type: compute-intensive
    taints:
      - key: workload-type
        value: compute-intensive
        effect: NoSchedule

Node group best practices:

Use multiple node groups for workload isolation. Create separate node groups for different workload types (general-purpose, memory-optimized, GPU) with appropriate taints and node selectors. This prevents batch jobs from stealing resources from user-facing APIs. See Kubernetes resource management for request/limit patterns.

Enable auto-scaling. Configure Cluster Autoscaler or Karpenter (preferred) to automatically add/remove nodes based on pending pods. See Auto-Scaling section below.

Use Graviton instances (m7g, c7g, r7g) for 20-40% cost savings with equivalent or better performance. Ensure container images are multi-arch (linux/amd64, linux/arm64). See compute instance types.

Launch nodes in private subnets. Public subnets increase attack surface. Use private subnets with NAT Gateways for internet access. See subnet design.

Self-Managed Node Groups

Self-Managed Node Groups give you full control over EC2 instances and Auto Scaling Groups. Use when you need:

Custom AMIs with specialized software
Specific instance storage (NVMe SSDs for databases)
Advanced Auto Scaling Group configurations

Trade-off: You manage all node lifecycle operations (AMI updates, Kubernetes version upgrades, security patching).

Recommendation: Use Managed Node Groups unless specific requirements demand self-managed nodes. The operational overhead of self-managed nodes outweighs benefits for most use cases.

Fargate Profiles

AWS Fargate for EKS runs pods serverless - no EC2 instances to manage. You define Fargate profiles specifying which pods run on Fargate based on namespace and labels.

# Fargate Profile configuration
apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig

metadata:
  name: production-cluster
  region: us-east-1

fargateProfiles:
  - name: batch-jobs
    selectors:
      - namespace: batch-processing
        labels:
          compute: fargate

  - name: serverless-apps
    selectors:
      - namespace: serverless

Pods matching selectors run on Fargate; all others run on EC2 node groups.

When to use Fargate:

Batch jobs with unpredictable schedules (only pay when job runs)
Extremely isolated workloads requiring dedicated compute (no noisy neighbors)
Development environments (no idle node costs)

Fargate limitations:

Higher per-vCPU cost than EC2 (not cost-effective for sustained workloads)
No DaemonSets (Fargate runs one pod per VM; DaemonSets require running on every node)
No privileged containers or host networking
Slower pod startup (~60s vs ~5s on warm nodes)

Recommendation: Use Fargate for specific workloads (batch jobs, isolated services). Use EC2 node groups for general workloads (lower cost, faster pod startup, DaemonSet support).

For detailed compute comparison (EC2 vs Fargate vs Lambda), see AWS Compute Services.

IAM Roles for Service Accounts (IRSA)

IRSA provides pod-level IAM permissions, enabling fine-grained access control. Instead of granting node-level permissions (all pods on a node share permissions), IRSA grants permissions to specific Kubernetes ServiceAccounts, which pods reference.

Why IRSA Matters

Without IRSA (legacy approach):

Attach IAM policies to EC2 node IAM role
All pods on node inherit permissions (overly permissive, violates least privilege)
Payment service pod can access S3 buckets intended for reporting service

With IRSA:

Create IAM role for specific service (e.g., payment-service-role with S3 access to payment bucket)
Associate IAM role with Kubernetes ServiceAccount (payment-service-sa)
Pod uses ServiceAccount, automatically receives temporary IAM credentials
Only payment service pods get payment bucket access

This sequence shows IRSA's authentication flow. The pod's ServiceAccount is annotated with an IAM role ARN. When the pod makes AWS SDK calls, the SDK retrieves temporary credentials by exchanging the pod's OIDC token (projected into the pod via a volume) for IAM credentials via STS AssumeRoleWithWebIdentity. These credentials are scoped to the ServiceAccount's IAM role policies.

Setting Up IRSA

1. Enable OIDC provider for your cluster (one-time setup):

eksctl utils associate-iam-oidc-provider \
  --cluster production-cluster \
  --approve

This creates an OIDC identity provider in IAM, allowing Kubernetes ServiceAccounts to assume IAM roles.

2. Create IAM role with trust policy for ServiceAccount:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "Federated": "arn:aws:iam::123456789012:oidc-provider/oidc.eks.us-east-1.amazonaws.com/id/EXAMPLED539D4633E53DE1B71EXAMPLE"
      },
      "Action": "sts:AssumeRoleWithWebIdentity",
      "Condition": {
        "StringEquals": {
          "oidc.eks.us-east-1.amazonaws.com/id/EXAMPLED539D4633E53DE1B71EXAMPLE:sub": "system:serviceaccount:payments:payment-service-sa",
          "oidc.eks.us-east-1.amazonaws.com/id/EXAMPLED539D4633E53DE1B71EXAMPLE:aud": "sts.amazonaws.com"
        }
      }
    }
  ]
}

The Condition restricts this role to the specific ServiceAccount (payment-service-sa in payments namespace). This prevents other ServiceAccounts from assuming the role.

3. Attach policies to the role:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "s3:GetObject",
        "s3:PutObject",
        "s3:DeleteObject"
      ],
      "Resource": "arn:aws:s3:::payment-data-bucket/*"
    }
  ]
}

4. Create ServiceAccount with role annotation:

apiVersion: v1
kind: ServiceAccount
metadata:
  name: payment-service-sa
  namespace: payments
  annotations:
    eks.amazonaws.com/role-arn: arn:aws:iam::123456789012:role/payment-service-role

5. Reference ServiceAccount in pod spec:

apiVersion: v1
kind: Pod
metadata:
  name: payment-service
  namespace: payments
spec:
  serviceAccountName: payment-service-sa
  containers:
  - name: app
    image: 123456789012.dkr.ecr.us-east-1.amazonaws.com/payment-service:1.2.3
    env:
    - name: AWS_REGION
      value: us-east-1
    # AWS SDK automatically discovers credentials via IRSA

Application code requires no changes. AWS SDKs automatically discover IRSA credentials via the pod's projected token volume. See IAM documentation for detailed trust policy patterns.

VPC CNI and Networking

EKS uses the Amazon VPC Container Network Interface (CNI) plugin, which assigns pods IP addresses from your VPC CIDR. This differs from many Kubernetes networking plugins (Calico, Flannel) that use overlay networks.

VPC CNI Architecture

How VPC CNI works:

Each worker node has an Elastic Network Interface (ENI) with a primary IP
VPC CNI attaches additional ENIs to the node as needed
Each ENI gets secondary IP addresses from your VPC subnet
Pods receive secondary IPs, making them first-class VPC citizens
Pods communicate directly with other VPC resources (RDS, ElastiCache) using VPC routing (no NAT)

Each pod gets an IP directly from your VPC subnet. This enables pods to communicate with RDS databases, ElastiCache clusters, and other VPC resources without additional networking layers. Security Groups can apply directly to pods (via Security Groups for Pods feature), providing network-level isolation.

Pod Capacity Planning

Maximum pods per node depends on instance type (determined by number of ENIs and IPs per ENI). Example:

Instance Type	Max ENIs	IPs per ENI	Max Pods
t3.small	3	4	11
t3.medium	3	6	17
m5.large	3	10	29
m5.xlarge	4	15	58
m5.2xlarge	4	15	58

Formula: Max Pods = (Max ENIs × (IPs per ENI - 1)) + 2

IP exhaustion is a common EKS issue. If your subnet has a /24 CIDR (256 IPs), you can run ~8 m5.large nodes before exhausting IPs (8 nodes × 29 pods/node × 1 IP/pod = 232 IPs + node IPs).

Mitigation strategies:

1. Use larger VPC CIDR blocks (plan for growth). Start with /16 for production VPCs. See CIDR planning.

2. Enable CNI custom networking to use separate subnets for pods (nodes use one CIDR, pods use another). This separates concerns but adds complexity.

3. Use prefix delegation mode (assign entire /28 prefixes to ENIs instead of individual IPs). This increases pod density 3-4x.

4. Use Fargate for specific workloads (Fargate pods don't consume VPC IPs from node subnets).

5. Right-size node groups (fewer large nodes vs. many small nodes - large nodes are more IP-efficient).

Security Groups for Pods

Security Groups for Pods apply EC2 Security Groups directly to individual pods (not just nodes). This enables:

Database pods allowing ingress only from application pods
Application pods allowing ingress only from ALB
Compliance with network isolation requirements (PCI-DSS, HIPAA)

# Define SecurityGroupPolicy (custom resource)
apiVersion: vpcresources.k8s.aws/v1beta1
kind: SecurityGroupPolicy
metadata:
  name: payment-service-sg-policy
  namespace: payments
spec:
  podSelector:
    matchLabels:
      app: payment-service
  securityGroups:
    groupIds:
      - sg-0123456789abcdef0  # payment-service-sg (allows ingress from ALB SG)

Pods matching the selector get the specified Security Groups applied. This works alongside Kubernetes Network Policies, providing defense-in-depth. See network security for layered security patterns.

For comprehensive VPC design (subnets, routing, Security Groups, NACLs), see VPC and Networking.

EKS Add-Ons

EKS add-ons are Kubernetes components required for cluster operation. AWS manages versions, compatibility, and updates for these add-ons.

Core Add-Ons

1. VPC CNI (kube-proxy): Provides pod networking. Updated regularly for performance improvements and bug fixes. Enable automatic version updates for non-production clusters; manually control versions in production.

2. CoreDNS: Provides DNS resolution for Services and Pods. Pods query CoreDNS for service discovery (payment-service.payments.svc.cluster.local). See Kubernetes service discovery.

3. kube-proxy: Maintains network rules for Service load balancing. Implements Service abstraction (ClusterIP → pod IPs).

Storage Drivers

EBS CSI Driver: Enables dynamic provisioning of EBS volumes as Persistent Volumes. Required for stateful workloads (databases, caching layers).

# StorageClass using EBS CSI driver
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: gp3-encrypted
provisioner: ebs.csi.aws.com
parameters:
  type: gp3
  iops: "3000"
  throughput: "125"
  encrypted: "true"
volumeBindingMode: WaitForFirstConsumer  # Create volume in same AZ as pod
allowVolumeExpansion: true

EFS CSI Driver: Provides shared file storage (multi-pod read/write). Use for shared configuration, logs, or stateful apps requiring shared storage.

Install CSI drivers as EKS add-ons (managed lifecycle) rather than manually installing via Helm.

For comprehensive persistent storage patterns, see Kubernetes storage management and AWS storage services.

AWS Load Balancer Controller

The AWS Load Balancer Controller (formerly ALB Ingress Controller) provisions Application Load Balancers (ALB) and Network Load Balancers (NLB) from Kubernetes Ingress and Service resources.

Architecture

The controller continuously watches Kubernetes Ingress and Service resources. When you create an Ingress, the controller provisions an ALB, configures listener rules, creates Target Groups, and registers pod IPs (via VPC CNI, pods have VPC IPs). The ALB routes directly to pod IPs, bypassing kube-proxy and NodePort overhead.

Installing AWS Load Balancer Controller

Prerequisites:

EKS cluster with OIDC provider enabled
IAM role with load balancer permissions (using IRSA)

# Create IAM policy for load balancer controller
curl -o iam_policy.json https://raw.githubusercontent.com/kubernetes-sigs/aws-load-balancer-controller/v2.7.0/docs/install/iam_policy.json

aws iam create-policy \
  --policy-name AWSLoadBalancerControllerIAMPolicy \
  --policy-document file://iam_policy.json

# Create ServiceAccount with IRSA
eksctl create iamserviceaccount \
  --cluster=production-cluster \
  --namespace=kube-system \
  --name=aws-load-balancer-controller \
  --attach-policy-arn=arn:aws:iam::123456789012:policy/AWSLoadBalancerControllerIAMPolicy \
  --approve

# Install controller via Helm
helm repo add eks https://aws.github.io/eks-charts
helm repo update

helm install aws-load-balancer-controller eks/aws-load-balancer-controller \
  -n kube-system \
  --set clusterName=production-cluster \
  --set serviceAccount.create=false \
  --set serviceAccount.name=aws-load-balancer-controller

For Helm chart management patterns, see Helm documentation.

Ingress Annotations

The controller uses annotations to configure ALB behavior:

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: payment-api-ingress
  namespace: payments
  annotations:
    # ALB configuration
    alb.ingress.kubernetes.io/scheme: internet-facing
    alb.ingress.kubernetes.io/target-type: ip  # Route to pod IPs (requires VPC CNI)
    alb.ingress.kubernetes.io/subnets: subnet-ghi789,subnet-jkl012  # Public subnets

    # HTTPS configuration
    alb.ingress.kubernetes.io/listen-ports: '[{"HTTP": 80}, {"HTTPS": 443}]'
    alb.ingress.kubernetes.io/ssl-redirect: "443"
    alb.ingress.kubernetes.io/certificate-arn: arn:aws:acm:us-east-1:123456789012:certificate/abc-123

    # Health check configuration
    alb.ingress.kubernetes.io/healthcheck-path: /actuator/health
    alb.ingress.kubernetes.io/healthcheck-interval-seconds: "30"
    alb.ingress.kubernetes.io/healthcheck-timeout-seconds: "5"
    alb.ingress.kubernetes.io/healthy-threshold-count: "2"
    alb.ingress.kubernetes.io/unhealthy-threshold-count: "3"

    # Security
    alb.ingress.kubernetes.io/security-groups: sg-0123456789abcdef0

    # Tags
    alb.ingress.kubernetes.io/tags: Environment=production,Team=platform

spec:
  ingressClassName: alb
  rules:
  - host: api.example.com
    http:
      paths:
      - path: /payments
        pathType: Prefix
        backend:
          service:
            name: payment-service
            port:
              number: 8080
      - path: /accounts
        pathType: Prefix
        backend:
          service:
            name: account-service
            port:
              number: 8080

Key annotations:

target-type: ip (recommended): ALB routes directly to pod IPs. Requires VPC CNI. Lower latency than instance mode (no NodePort overhead).
target-type: instance: ALB routes to node IPs via NodePort. Use if not using VPC CNI.
scheme: internal (private ALB, accessible only from VPC) or internet-facing (public ALB).
certificate-arn: ACM certificate for HTTPS. See Route 53 and DNS for domain management.

For comprehensive Ingress patterns (path-based routing, header-based routing, canary deployments), see Kubernetes Ingress.

Cluster Autoscaling

EKS clusters need to scale both pods (Horizontal Pod Autoscaler) and nodes (Cluster Autoscaler or Karpenter).

Horizontal Pod Autoscaler (HPA)

HPA scales pod replicas based on CPU/memory utilization or custom metrics (requests per second, queue depth). See Kubernetes autoscaling for detailed HPA configuration.

Cluster Autoscaler

Cluster Autoscaler watches for pending pods (pods that can't be scheduled due to insufficient node capacity) and adds nodes to the Auto Scaling Group. Conversely, it removes nodes when utilization is low.

# Cluster Autoscaler deployment (Helm values)
autoDiscovery:
  clusterName: production-cluster
  enabled: true

awsRegion: us-east-1

rbac:
  serviceAccount:
    annotations:
      eks.amazonaws.com/role-arn: arn:aws:iam::123456789012:role/cluster-autoscaler-role

extraArgs:
  scale-down-delay-after-add: 10m
  scale-down-unneeded-time: 10m
  skip-nodes-with-local-storage: false
  skip-nodes-with-system-pods: false

Cluster Autoscaler limitations:

Scales incrementally (adds one node at a time, checks, adds another if still needed)
Requires pre-defined node groups (can't change instance types dynamically)
Slower to respond to traffic spikes (5-10 minutes to provision new nodes)
Doesn't consider cost optimization (always uses configured instance type)

Karpenter (Recommended)

Karpenter is an open-source autoscaler that provisions nodes directly (bypassing Auto Scaling Groups). Karpenter:

Provisions nodes in seconds (vs. minutes with Cluster Autoscaler)
Chooses optimal instance types dynamically (considers spot/on-demand, instance families, availability zones)
Consolidates workloads onto fewer nodes to reduce costs
Simplifies node management (no manual node group creation)

# Karpenter Provisioner (defines instance selection criteria)
apiVersion: karpenter.sh/v1alpha5
kind: Provisioner
metadata:
  name: default
spec:
  requirements:
    - key: karpenter.sh/capacity-type
      operator: In
      values: ["spot", "on-demand"]  # Prefer spot, fall back to on-demand
    - key: kubernetes.io/arch
      operator: In
      values: ["amd64", "arm64"]  # Support both x86 and Graviton
    - key: karpenter.k8s.aws/instance-category
      operator: In
      values: ["c", "m", "r"]  # Compute, general, memory optimized
    - key: karpenter.k8s.aws/instance-generation
      operator: Gt
      values: ["5"]  # Use generation 6+ (m6i, c6i, r6i or newer)

  limits:
    resources:
      cpu: 1000  # Max 1000 vCPUs across all Karpenter-managed nodes
      memory: 1000Gi

  providerRef:
    name: default

  ttlSecondsAfterEmpty: 30  # Remove empty nodes after 30 seconds
  ttlSecondsUntilExpired: 604800  # Recycle nodes after 7 days (force patching)

---
apiVersion: karpenter.k8s.aws/v1alpha1
kind: AWSNodeTemplate
metadata:
  name: default
spec:
  subnetSelector:
    karpenter.sh/discovery: production-cluster
  securityGroupSelector:
    karpenter.sh/discovery: production-cluster
  instanceProfile: KarpenterNodeInstanceProfile
  amiFamily: AL2  # Amazon Linux 2
  blockDeviceMappings:
    - deviceName: /dev/xvda
      ebs:
        volumeSize: 50Gi
        volumeType: gp3
        encrypted: true
  tags:
    Environment: production
    ManagedBy: karpenter

Karpenter analyzes pending pod requirements (CPU, memory, architecture, node selectors, taints/tolerations) and provisions the most cost-effective instances matching those requirements. If a pod requests arch: arm64, Karpenter provisions Graviton instances. If pods request GPU, Karpenter provisions GPU instances.

Recommendation: Use Karpenter for production clusters. It significantly reduces operational overhead and cost compared to Cluster Autoscaler + manual node group management.

For autoscaling costs and tradeoffs, see cost optimization.

Observability Integration

EKS integrates with AWS CloudWatch and third-party observability tools (Prometheus, Grafana, Datadog, New Relic).

CloudWatch Container Insights

Container Insights collects metrics and logs from EKS clusters, providing visibility into cluster, node, pod, and container performance.

Metrics collected:

Cluster-level: CPU/memory utilization, node count, pod count
Node-level: CPU/memory/disk/network per node
Pod-level: CPU/memory per pod, container restarts
Namespace-level: Resource usage per namespace

Install via Helm or CloudFormation:

# Install CloudWatch Agent and Fluentd via Helm
helm repo add aws https://aws.github.io/eks-charts
helm install aws-cloudwatch-metrics aws/aws-cloudwatch-metrics \
  --namespace amazon-cloudwatch \
  --create-namespace \
  --set clusterName=production-cluster

helm install aws-for-fluent-bit aws/aws-for-fluent-bit \
  --namespace amazon-cloudwatch \
  --set cloudWatch.region=us-east-1 \
  --set cloudWatch.logGroupName=/aws/eks/production-cluster/containers

View metrics in CloudWatch Console → Container Insights → EKS Clusters.

For comprehensive observability patterns, see:

Observability Overview - Three pillars (logs, metrics, traces)
Logging Best Practices - Structured logging, correlation IDs
Metrics and Monitoring - Application metrics, RED/USE methods
Distributed Tracing - X-Ray, OpenTelemetry

Prometheus and Grafana

For advanced metrics (custom application metrics, service-level objectives, long-term retention), deploy Prometheus and Grafana via Helm:

# Install Prometheus Operator (includes Grafana)
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm install prometheus prometheus-community/kube-prometheus-stack \
  --namespace monitoring \
  --create-namespace

Expose custom metrics from Spring Boot applications via Micrometer and scrape with Prometheus. See Spring Boot observability.

Security Best Practices

EKS security requires multiple layers: IAM, network policies, pod security, secrets management.

Pod Security Standards

Pod Security Standards (PSS) replace deprecated PodSecurityPolicies, enforcing security controls at the namespace level. Three policies:

Privileged: Unrestricted (for system components like CNI plugins)
Baseline: Prevents known privilege escalations (blocks privileged containers, host namespaces)
Restricted: Strict hardening (requires dropping all capabilities, running as non-root, read-only root filesystem)

# Enforce Restricted policy at namespace level
apiVersion: v1
kind: Namespace
metadata:
  name: payments
  labels:
    pod-security.kubernetes.io/enforce: restricted
    pod-security.kubernetes.io/audit: restricted
    pod-security.kubernetes.io/warn: restricted

Pods violating the policy are rejected. See Kubernetes security.

Network Policies

Network Policies control pod-to-pod communication at the network layer. Default Kubernetes behavior allows all pods to communicate. Network Policies enforce deny-by-default.

# Deny all ingress traffic to payment-service pods (except from API gateway)
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: payment-service-ingress
  namespace: payments
spec:
  podSelector:
    matchLabels:
      app: payment-service
  policyTypes:
  - Ingress
  ingress:
  - from:
    - namespaceSelector:
        matchLabels:
          name: api-gateway
      podSelector:
        matchLabels:
          app: api-gateway
    ports:
    - protocol: TCP
      port: 8080

Combine Network Policies with Security Groups for Pods for defense-in-depth. See security overview.

Secrets Management

Never store secrets in ConfigMaps or environment variables in plain text. Use AWS Secrets Manager or SSM Parameter Store with the Secrets Store CSI Driver to inject secrets as mounted volumes.

# SecretProviderClass (mounts AWS Secrets Manager secret as volume)
apiVersion: secrets-store.csi.x-k8s.io/v1
kind: SecretProviderClass
metadata:
  name: payment-service-secrets
  namespace: payments
spec:
  provider: aws
  parameters:
    objects: |
      - objectName: "payment-db-password"
        objectType: "secretsmanager"
        objectAlias: "db-password"
---
# Pod using SecretProviderClass
apiVersion: v1
kind: Pod
metadata:
  name: payment-service
  namespace: payments
spec:
  serviceAccountName: payment-service-sa
  containers:
  - name: app
    image: payment-service:1.2.3
    volumeMounts:
    - name: secrets
      mountPath: /mnt/secrets
      readOnly: true
  volumes:
  - name: secrets
    csi:
      driver: secrets-store.csi.k8s.io
      readOnly: true
      volumeAttributes:
        secretProviderClass: payment-service-secrets

Application reads secret from /mnt/secrets/db-password. Secrets are never stored in etcd. See secrets management.

Image Scanning

Scan container images for vulnerabilities before deploying to EKS. Use Amazon ECR image scanning or third-party tools (Trivy, Snyk, Aqua).

# Enable ECR image scanning on push
aws ecr put-image-scanning-configuration \
  --repository-name payment-service \
  --image-scanning-configuration scanOnPush=true

# Retrieve scan findings
aws ecr describe-image-scan-findings \
  --repository-name payment-service \
  --image-id imageTag=1.2.3

Block deployments of images with critical vulnerabilities in CI/CD pipelines. See CI/CD pipelines.

Cluster Upgrades and Maintenance

EKS clusters require periodic upgrades to stay on supported Kubernetes versions. AWS supports each Kubernetes minor version for 14 months after release.

Upgrade Strategy

1. Check compatibility: Review EKS Kubernetes version compatibility for breaking changes, deprecated APIs, and add-on version requirements.

2. Upgrade add-ons: Update VPC CNI, CoreDNS, kube-proxy, and CSI drivers to versions compatible with target Kubernetes version.

3. Upgrade control plane:

eksctl upgrade cluster --name production-cluster --version 1.29 --approve

Control plane upgrades complete in 20-30 minutes with zero downtime (API server remains available).

4. Upgrade node groups:

eksctl upgrade nodegroup \
  --cluster production-cluster \
  --name general-purpose \
  --kubernetes-version 1.29

Node upgrades replace nodes one at a time, gracefully draining pods before termination. This process takes 30-60 minutes per node group.

5. Validate: Run integration tests, check application health, monitor metrics.

Testing recommendations:

Test upgrades in non-production environment first
Review Kubernetes changelogs for deprecated APIs (use kubectl-convert to migrate manifests)
Use kubectl get all --all-namespaces to identify deprecated API usage before upgrading

For comprehensive testing strategies, see integration testing and chaos engineering.

Cost Optimization

EKS costs include:

Control plane: $0.10/hour per cluster (~$73/month)
Worker nodes: EC2 instance costs (or Fargate task costs)
Data transfer: Cross-AZ traffic, NAT Gateway traffic

Optimization Strategies

1. Right-size pods and nodes: Use Vertical Pod Autoscaler to recommend resource requests/limits. Avoid over-provisioning. See resource management.

2. Use Spot instances: Run 50-80% of workload on Spot (stateless apps, batch jobs) for 70-90% discount. Use Karpenter to automatically mix Spot and On-Demand.

3. Use Graviton instances: 20-40% cost savings vs. x86. Ensure multi-arch images.

4. Consolidate workloads: Use fewer, larger nodes instead of many small nodes (reduces per-node overhead: kubelet, kube-proxy, CNI).

5. Reduce cross-AZ traffic: Use topology-aware routing to prefer pods in the same AZ as the requester (reduces data transfer costs).

6. Use VPC endpoints: Access S3, ECR, and other AWS services via VPC endpoints (free gateway endpoints, cost-effective interface endpoints) instead of routing through NAT Gateways.

7. Implement PodDisruptionBudgets: Enable safe node draining during scale-down without affecting availability. See high availability patterns.

For comprehensive cost optimization, see AWS cost optimization.

Common EKS Anti-Patterns

Avoid these mistakes that create operational complexity, security risks, or cost overruns:

Running control plane in public subnets only: Enable both public and private endpoints. Nodes should use private endpoint (lower latency, no NAT costs).

Not using IRSA: Attaching IAM policies to node roles grants permissions to all pods on the node (overly permissive). Use IRSA for pod-level permissions.

Ignoring IP exhaustion: VPC CNI consumes many IPs. Plan subnet sizes carefully or use prefix delegation mode.

Over-provisioning pods: Setting requests: 8Gi memory when app uses 512Mi wastes capacity and increases costs. Profile applications and set realistic requests.

Not setting resource limits: Pods without limits can consume all node resources, starving other pods. Always set limits.

Using latest image tags: image: myapp:latest makes deployments non-reproducible. Use semantic versioning (myapp:1.2.3).

Storing secrets in ConfigMaps: ConfigMaps are not encrypted. Use Secrets with encryption at rest or Secrets Manager.

Single node group for all workloads: Mixing batch jobs and critical APIs on the same nodes creates resource contention. Use separate node groups with taints/tolerations.

Not implementing health checks: Pods without readiness probes receive traffic before ready, causing errors. Always implement liveness and readiness probes. See health checks.

Manual kubectl deployments: Deploying via kubectl apply manually creates inconsistency. Use GitOps (ArgoCD, Flux) or CI/CD pipelines with version-controlled manifests.

EKS Architecture Overview​

Cluster Endpoint Access​

Node Groups and Compute Options​

Managed Node Groups (Recommended)​

Self-Managed Node Groups​

Fargate Profiles​

IAM Roles for Service Accounts (IRSA)​

Why IRSA Matters​

Setting Up IRSA​

VPC CNI and Networking​

VPC CNI Architecture​

Pod Capacity Planning​

Security Groups for Pods​

EKS Add-Ons​

Core Add-Ons​

Storage Drivers​

AWS Load Balancer Controller​

Architecture​

Installing AWS Load Balancer Controller​

Ingress Annotations​

Cluster Autoscaling​

Horizontal Pod Autoscaler (HPA)​

Cluster Autoscaler​

Karpenter (Recommended)​

Observability Integration​

CloudWatch Container Insights​

Prometheus and Grafana​

Security Best Practices​

Pod Security Standards​

Network Policies​

Secrets Management​

Image Scanning​

Cluster Upgrades and Maintenance​

Upgrade Strategy​

Cost Optimization​

Optimization Strategies​

Common EKS Anti-Patterns​

Further Reading​