AWS Elastic Kubernetes Service (EKS)
AWS Elastic Kubernetes Service (EKS) is a managed Kubernetes service that runs the Kubernetes control plane across multiple AWS availability zones, eliminating the operational burden of managing Kubernetes master nodes. EKS provides AWS-native integrations (IAM, VPC, CloudWatch, ALB) while maintaining compatibility with standard Kubernetes APIs and tooling.
EKS enables you to run production Kubernetes workloads with AWS-managed reliability, security, and scalability. Understanding EKS-specific features - particularly IAM Roles for Service Accounts (IRSA), VPC CNI networking, and AWS Load Balancer Controller - is essential for building secure, cost-effective Kubernetes architectures on AWS.
This guide assumes familiarity with core Kubernetes concepts (Pods, Deployments, Services, Ingress). For comprehensive Kubernetes fundamentals, see Kubernetes Best Practices. For application packaging, see Helm documentation.
EKS Architecture Overview
EKS clusters consist of two primary components:
Control Plane (AWS-managed):
- Kubernetes API server, scheduler, controller manager, etcd
- Runs across 3 availability zones for high availability
- AWS handles patching, scaling, and recovery
- You interact via
kubectl(standard Kubernetes API)
Data Plane (customer-managed):
- Worker nodes (EC2 instances) or Fargate pods
- Runs your application containers
- You configure node types, scaling, networking
This architecture illustrates EKS's core components. The control plane runs in an AWS-managed VPC (you never see or manage these instances). Worker nodes run in your VPC, with kubelet communicating to the control plane via AWS PrivateLink or public endpoints. The ALB Ingress Controller provisions Application Load Balancers in your public subnets to route traffic to pods. NAT Gateways enable pods in private subnets to access the internet for pulling images and external API calls.
Cluster Endpoint Access
EKS clusters have configurable API endpoint access:
- Public endpoint: API server accessible from internet (with optional CIDR restrictions)
- Private endpoint: API server accessible only from within VPC via AWS PrivateLink
- Both public and private: Recommended for production (CI/CD uses public, nodes use private)
Production recommendation: Enable both endpoints. Configure public endpoint with CIDR restrictions (only your office/VPN IPs and CI/CD systems). Worker nodes communicate via private endpoint (no NAT Gateway cost, lower latency, increased security).
# Create cluster with both endpoints enabled
eksctl create cluster \
--name production-cluster \
--region us-east-1 \
--nodegroup-name standard-workers \
--node-type m6i.xlarge \
--nodes 3 \
--nodes-min 3 \
--nodes-max 10 \
--vpc-private-subnets subnet-abc123,subnet-def456 \
--vpc-public-subnets subnet-ghi789,subnet-jkl012 \
--endpoint-private-access=true \
--endpoint-public-access=true \
--public-access-cidrs 203.0.113.0/24 # Your office/VPN IP range
For cluster networking within VPCs, see VPC design patterns.
Node Groups and Compute Options
EKS supports three compute models for running pods:
Managed Node Groups (Recommended)
Managed Node Groups automate EC2 instance provisioning, upgrades, and lifecycle management. EKS handles:
- Creating Auto Scaling Groups with optimal configurations
- Gracefully draining nodes before termination during upgrades
- Applying security patches and AMI updates
- Tagging instances for AWS integrations
# Managed Node Group configuration (eksctl)
apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig
metadata:
name: production-cluster
region: us-east-1
managedNodeGroups:
- name: general-purpose
instanceType: m6i.xlarge
minSize: 3
maxSize: 10
desiredCapacity: 3
volumeSize: 50
volumeType: gp3
privateNetworking: true # Launch in private subnets
labels:
workload-type: general
tags:
Environment: production
Team: platform-engineering
iam:
withAddonPolicies:
imageBuilder: true
autoScaler: true
cloudWatch: true
albIngress: true
ebs: true
efs: true
- name: compute-optimized
instanceType: c6i.2xlarge
minSize: 0
maxSize: 5
desiredCapacity: 0
labels:
workload-type: compute-intensive
taints:
- key: workload-type
value: compute-intensive
effect: NoSchedule
Node group best practices:
Use multiple node groups for workload isolation. Create separate node groups for different workload types (general-purpose, memory-optimized, GPU) with appropriate taints and node selectors. This prevents batch jobs from stealing resources from user-facing APIs. See Kubernetes resource management for request/limit patterns.
Enable auto-scaling. Configure Cluster Autoscaler or Karpenter (preferred) to automatically add/remove nodes based on pending pods. See Auto-Scaling section below.
Use Graviton instances (m7g, c7g, r7g) for 20-40% cost savings with equivalent or better performance. Ensure container images are multi-arch (linux/amd64, linux/arm64). See compute instance types.
Launch nodes in private subnets. Public subnets increase attack surface. Use private subnets with NAT Gateways for internet access. See subnet design.
Self-Managed Node Groups
Self-Managed Node Groups give you full control over EC2 instances and Auto Scaling Groups. Use when you need:
- Custom AMIs with specialized software
- Specific instance storage (NVMe SSDs for databases)
- Advanced Auto Scaling Group configurations
Trade-off: You manage all node lifecycle operations (AMI updates, Kubernetes version upgrades, security patching).
Recommendation: Use Managed Node Groups unless specific requirements demand self-managed nodes. The operational overhead of self-managed nodes outweighs benefits for most use cases.
Fargate Profiles
AWS Fargate for EKS runs pods serverless - no EC2 instances to manage. You define Fargate profiles specifying which pods run on Fargate based on namespace and labels.
# Fargate Profile configuration
apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig
metadata:
name: production-cluster
region: us-east-1
fargateProfiles:
- name: batch-jobs
selectors:
- namespace: batch-processing
labels:
compute: fargate
- name: serverless-apps
selectors:
- namespace: serverless
Pods matching selectors run on Fargate; all others run on EC2 node groups.
When to use Fargate:
- Batch jobs with unpredictable schedules (only pay when job runs)
- Extremely isolated workloads requiring dedicated compute (no noisy neighbors)
- Development environments (no idle node costs)
Fargate limitations:
- Higher per-vCPU cost than EC2 (not cost-effective for sustained workloads)
- No DaemonSets (Fargate runs one pod per VM; DaemonSets require running on every node)
- No privileged containers or host networking
- Slower pod startup (~60s vs ~5s on warm nodes)
Recommendation: Use Fargate for specific workloads (batch jobs, isolated services). Use EC2 node groups for general workloads (lower cost, faster pod startup, DaemonSet support).
For detailed compute comparison (EC2 vs Fargate vs Lambda), see AWS Compute Services.
IAM Roles for Service Accounts (IRSA)
IRSA provides pod-level IAM permissions, enabling fine-grained access control. Instead of granting node-level permissions (all pods on a node share permissions), IRSA grants permissions to specific Kubernetes ServiceAccounts, which pods reference.
Why IRSA Matters
Without IRSA (legacy approach):
- Attach IAM policies to EC2 node IAM role
- All pods on node inherit permissions (overly permissive, violates least privilege)
- Payment service pod can access S3 buckets intended for reporting service
With IRSA:
- Create IAM role for specific service (e.g.,
payment-service-rolewith S3 access to payment bucket) - Associate IAM role with Kubernetes ServiceAccount (
payment-service-sa) - Pod uses ServiceAccount, automatically receives temporary IAM credentials
- Only payment service pods get payment bucket access
This sequence shows IRSA's authentication flow. The pod's ServiceAccount is annotated with an IAM role ARN. When the pod makes AWS SDK calls, the SDK retrieves temporary credentials by exchanging the pod's OIDC token (projected into the pod via a volume) for IAM credentials via STS AssumeRoleWithWebIdentity. These credentials are scoped to the ServiceAccount's IAM role policies.
Setting Up IRSA
1. Enable OIDC provider for your cluster (one-time setup):
eksctl utils associate-iam-oidc-provider \
--cluster production-cluster \
--approve
This creates an OIDC identity provider in IAM, allowing Kubernetes ServiceAccounts to assume IAM roles.
2. Create IAM role with trust policy for ServiceAccount:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": {
"Federated": "arn:aws:iam::123456789012:oidc-provider/oidc.eks.us-east-1.amazonaws.com/id/EXAMPLED539D4633E53DE1B71EXAMPLE"
},
"Action": "sts:AssumeRoleWithWebIdentity",
"Condition": {
"StringEquals": {
"oidc.eks.us-east-1.amazonaws.com/id/EXAMPLED539D4633E53DE1B71EXAMPLE:sub": "system:serviceaccount:payments:payment-service-sa",
"oidc.eks.us-east-1.amazonaws.com/id/EXAMPLED539D4633E53DE1B71EXAMPLE:aud": "sts.amazonaws.com"
}
}
}
]
}
The Condition restricts this role to the specific ServiceAccount (payment-service-sa in payments namespace). This prevents other ServiceAccounts from assuming the role.
3. Attach policies to the role:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"s3:GetObject",
"s3:PutObject",
"s3:DeleteObject"
],
"Resource": "arn:aws:s3:::payment-data-bucket/*"
}
]
}
4. Create ServiceAccount with role annotation:
apiVersion: v1
kind: ServiceAccount
metadata:
name: payment-service-sa
namespace: payments
annotations:
eks.amazonaws.com/role-arn: arn:aws:iam::123456789012:role/payment-service-role
5. Reference ServiceAccount in pod spec:
apiVersion: v1
kind: Pod
metadata:
name: payment-service
namespace: payments
spec:
serviceAccountName: payment-service-sa
containers:
- name: app
image: 123456789012.dkr.ecr.us-east-1.amazonaws.com/payment-service:1.2.3
env:
- name: AWS_REGION
value: us-east-1
# AWS SDK automatically discovers credentials via IRSA
Application code requires no changes. AWS SDKs automatically discover IRSA credentials via the pod's projected token volume. See IAM documentation for detailed trust policy patterns.
VPC CNI and Networking
EKS uses the Amazon VPC Container Network Interface (CNI) plugin, which assigns pods IP addresses from your VPC CIDR. This differs from many Kubernetes networking plugins (Calico, Flannel) that use overlay networks.
VPC CNI Architecture
How VPC CNI works:
- Each worker node has an Elastic Network Interface (ENI) with a primary IP
- VPC CNI attaches additional ENIs to the node as needed
- Each ENI gets secondary IP addresses from your VPC subnet
- Pods receive secondary IPs, making them first-class VPC citizens
- Pods communicate directly with other VPC resources (RDS, ElastiCache) using VPC routing (no NAT)
Each pod gets an IP directly from your VPC subnet. This enables pods to communicate with RDS databases, ElastiCache clusters, and other VPC resources without additional networking layers. Security Groups can apply directly to pods (via Security Groups for Pods feature), providing network-level isolation.
Pod Capacity Planning
Maximum pods per node depends on instance type (determined by number of ENIs and IPs per ENI). Example:
| Instance Type | Max ENIs | IPs per ENI | Max Pods |
|---|---|---|---|
| t3.small | 3 | 4 | 11 |
| t3.medium | 3 | 6 | 17 |
| m5.large | 3 | 10 | 29 |
| m5.xlarge | 4 | 15 | 58 |
| m5.2xlarge | 4 | 15 | 58 |
Formula: Max Pods = (Max ENIs × (IPs per ENI - 1)) + 2
IP exhaustion is a common EKS issue. If your subnet has a /24 CIDR (256 IPs), you can run ~8 m5.large nodes before exhausting IPs (8 nodes × 29 pods/node × 1 IP/pod = 232 IPs + node IPs).
Mitigation strategies:
1. Use larger VPC CIDR blocks (plan for growth). Start with /16 for production VPCs. See CIDR planning.
2. Enable CNI custom networking to use separate subnets for pods (nodes use one CIDR, pods use another). This separates concerns but adds complexity.
3. Use prefix delegation mode (assign entire /28 prefixes to ENIs instead of individual IPs). This increases pod density 3-4x.
4. Use Fargate for specific workloads (Fargate pods don't consume VPC IPs from node subnets).
5. Right-size node groups (fewer large nodes vs. many small nodes - large nodes are more IP-efficient).
Security Groups for Pods
Security Groups for Pods apply EC2 Security Groups directly to individual pods (not just nodes). This enables:
- Database pods allowing ingress only from application pods
- Application pods allowing ingress only from ALB
- Compliance with network isolation requirements (PCI-DSS, HIPAA)
# Define SecurityGroupPolicy (custom resource)
apiVersion: vpcresources.k8s.aws/v1beta1
kind: SecurityGroupPolicy
metadata:
name: payment-service-sg-policy
namespace: payments
spec:
podSelector:
matchLabels:
app: payment-service
securityGroups:
groupIds:
- sg-0123456789abcdef0 # payment-service-sg (allows ingress from ALB SG)
Pods matching the selector get the specified Security Groups applied. This works alongside Kubernetes Network Policies, providing defense-in-depth. See network security for layered security patterns.
For comprehensive VPC design (subnets, routing, Security Groups, NACLs), see VPC and Networking.
EKS Add-Ons
EKS add-ons are Kubernetes components required for cluster operation. AWS manages versions, compatibility, and updates for these add-ons.
Core Add-Ons
1. VPC CNI (kube-proxy): Provides pod networking. Updated regularly for performance improvements and bug fixes. Enable automatic version updates for non-production clusters; manually control versions in production.
2. CoreDNS: Provides DNS resolution for Services and Pods. Pods query CoreDNS for service discovery (payment-service.payments.svc.cluster.local). See Kubernetes service discovery.
3. kube-proxy: Maintains network rules for Service load balancing. Implements Service abstraction (ClusterIP → pod IPs).
Storage Drivers
EBS CSI Driver: Enables dynamic provisioning of EBS volumes as Persistent Volumes. Required for stateful workloads (databases, caching layers).
# StorageClass using EBS CSI driver
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: gp3-encrypted
provisioner: ebs.csi.aws.com
parameters:
type: gp3
iops: "3000"
throughput: "125"
encrypted: "true"
volumeBindingMode: WaitForFirstConsumer # Create volume in same AZ as pod
allowVolumeExpansion: true
EFS CSI Driver: Provides shared file storage (multi-pod read/write). Use for shared configuration, logs, or stateful apps requiring shared storage.
Install CSI drivers as EKS add-ons (managed lifecycle) rather than manually installing via Helm.
For comprehensive persistent storage patterns, see Kubernetes storage management and AWS storage services.
AWS Load Balancer Controller
The AWS Load Balancer Controller (formerly ALB Ingress Controller) provisions Application Load Balancers (ALB) and Network Load Balancers (NLB) from Kubernetes Ingress and Service resources.
Architecture
The controller continuously watches Kubernetes Ingress and Service resources. When you create an Ingress, the controller provisions an ALB, configures listener rules, creates Target Groups, and registers pod IPs (via VPC CNI, pods have VPC IPs). The ALB routes directly to pod IPs, bypassing kube-proxy and NodePort overhead.
Installing AWS Load Balancer Controller
Prerequisites:
- EKS cluster with OIDC provider enabled
- IAM role with load balancer permissions (using IRSA)
# Create IAM policy for load balancer controller
curl -o iam_policy.json https://raw.githubusercontent.com/kubernetes-sigs/aws-load-balancer-controller/v2.7.0/docs/install/iam_policy.json
aws iam create-policy \
--policy-name AWSLoadBalancerControllerIAMPolicy \
--policy-document file://iam_policy.json
# Create ServiceAccount with IRSA
eksctl create iamserviceaccount \
--cluster=production-cluster \
--namespace=kube-system \
--name=aws-load-balancer-controller \
--attach-policy-arn=arn:aws:iam::123456789012:policy/AWSLoadBalancerControllerIAMPolicy \
--approve
# Install controller via Helm
helm repo add eks https://aws.github.io/eks-charts
helm repo update
helm install aws-load-balancer-controller eks/aws-load-balancer-controller \
-n kube-system \
--set clusterName=production-cluster \
--set serviceAccount.create=false \
--set serviceAccount.name=aws-load-balancer-controller
For Helm chart management patterns, see Helm documentation.
Ingress Annotations
The controller uses annotations to configure ALB behavior:
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: payment-api-ingress
namespace: payments
annotations:
# ALB configuration
alb.ingress.kubernetes.io/scheme: internet-facing
alb.ingress.kubernetes.io/target-type: ip # Route to pod IPs (requires VPC CNI)
alb.ingress.kubernetes.io/subnets: subnet-ghi789,subnet-jkl012 # Public subnets
# HTTPS configuration
alb.ingress.kubernetes.io/listen-ports: '[{"HTTP": 80}, {"HTTPS": 443}]'
alb.ingress.kubernetes.io/ssl-redirect: "443"
alb.ingress.kubernetes.io/certificate-arn: arn:aws:acm:us-east-1:123456789012:certificate/abc-123
# Health check configuration
alb.ingress.kubernetes.io/healthcheck-path: /actuator/health
alb.ingress.kubernetes.io/healthcheck-interval-seconds: "30"
alb.ingress.kubernetes.io/healthcheck-timeout-seconds: "5"
alb.ingress.kubernetes.io/healthy-threshold-count: "2"
alb.ingress.kubernetes.io/unhealthy-threshold-count: "3"
# Security
alb.ingress.kubernetes.io/security-groups: sg-0123456789abcdef0
# Tags
alb.ingress.kubernetes.io/tags: Environment=production,Team=platform
spec:
ingressClassName: alb
rules:
- host: api.example.com
http:
paths:
- path: /payments
pathType: Prefix
backend:
service:
name: payment-service
port:
number: 8080
- path: /accounts
pathType: Prefix
backend:
service:
name: account-service
port:
number: 8080
Key annotations:
- target-type: ip (recommended): ALB routes directly to pod IPs. Requires VPC CNI. Lower latency than
instancemode (no NodePort overhead). - target-type: instance: ALB routes to node IPs via NodePort. Use if not using VPC CNI.
- scheme:
internal(private ALB, accessible only from VPC) orinternet-facing(public ALB). - certificate-arn: ACM certificate for HTTPS. See Route 53 and DNS for domain management.
For comprehensive Ingress patterns (path-based routing, header-based routing, canary deployments), see Kubernetes Ingress.
Cluster Autoscaling
EKS clusters need to scale both pods (Horizontal Pod Autoscaler) and nodes (Cluster Autoscaler or Karpenter).
Horizontal Pod Autoscaler (HPA)
HPA scales pod replicas based on CPU/memory utilization or custom metrics (requests per second, queue depth). See Kubernetes autoscaling for detailed HPA configuration.
Cluster Autoscaler
Cluster Autoscaler watches for pending pods (pods that can't be scheduled due to insufficient node capacity) and adds nodes to the Auto Scaling Group. Conversely, it removes nodes when utilization is low.
# Cluster Autoscaler deployment (Helm values)
autoDiscovery:
clusterName: production-cluster
enabled: true
awsRegion: us-east-1
rbac:
serviceAccount:
annotations:
eks.amazonaws.com/role-arn: arn:aws:iam::123456789012:role/cluster-autoscaler-role
extraArgs:
scale-down-delay-after-add: 10m
scale-down-unneeded-time: 10m
skip-nodes-with-local-storage: false
skip-nodes-with-system-pods: false
Cluster Autoscaler limitations:
- Scales incrementally (adds one node at a time, checks, adds another if still needed)
- Requires pre-defined node groups (can't change instance types dynamically)
- Slower to respond to traffic spikes (5-10 minutes to provision new nodes)
- Doesn't consider cost optimization (always uses configured instance type)
Karpenter (Recommended)
Karpenter is an open-source autoscaler that provisions nodes directly (bypassing Auto Scaling Groups). Karpenter:
- Provisions nodes in seconds (vs. minutes with Cluster Autoscaler)
- Chooses optimal instance types dynamically (considers spot/on-demand, instance families, availability zones)
- Consolidates workloads onto fewer nodes to reduce costs
- Simplifies node management (no manual node group creation)
# Karpenter Provisioner (defines instance selection criteria)
apiVersion: karpenter.sh/v1alpha5
kind: Provisioner
metadata:
name: default
spec:
requirements:
- key: karpenter.sh/capacity-type
operator: In
values: ["spot", "on-demand"] # Prefer spot, fall back to on-demand
- key: kubernetes.io/arch
operator: In
values: ["amd64", "arm64"] # Support both x86 and Graviton
- key: karpenter.k8s.aws/instance-category
operator: In
values: ["c", "m", "r"] # Compute, general, memory optimized
- key: karpenter.k8s.aws/instance-generation
operator: Gt
values: ["5"] # Use generation 6+ (m6i, c6i, r6i or newer)
limits:
resources:
cpu: 1000 # Max 1000 vCPUs across all Karpenter-managed nodes
memory: 1000Gi
providerRef:
name: default
ttlSecondsAfterEmpty: 30 # Remove empty nodes after 30 seconds
ttlSecondsUntilExpired: 604800 # Recycle nodes after 7 days (force patching)
---
apiVersion: karpenter.k8s.aws/v1alpha1
kind: AWSNodeTemplate
metadata:
name: default
spec:
subnetSelector:
karpenter.sh/discovery: production-cluster
securityGroupSelector:
karpenter.sh/discovery: production-cluster
instanceProfile: KarpenterNodeInstanceProfile
amiFamily: AL2 # Amazon Linux 2
blockDeviceMappings:
- deviceName: /dev/xvda
ebs:
volumeSize: 50Gi
volumeType: gp3
encrypted: true
tags:
Environment: production
ManagedBy: karpenter
Karpenter analyzes pending pod requirements (CPU, memory, architecture, node selectors, taints/tolerations) and provisions the most cost-effective instances matching those requirements. If a pod requests arch: arm64, Karpenter provisions Graviton instances. If pods request GPU, Karpenter provisions GPU instances.
Recommendation: Use Karpenter for production clusters. It significantly reduces operational overhead and cost compared to Cluster Autoscaler + manual node group management.
For autoscaling costs and tradeoffs, see cost optimization.
Observability Integration
EKS integrates with AWS CloudWatch and third-party observability tools (Prometheus, Grafana, Datadog, New Relic).
CloudWatch Container Insights
Container Insights collects metrics and logs from EKS clusters, providing visibility into cluster, node, pod, and container performance.
Metrics collected:
- Cluster-level: CPU/memory utilization, node count, pod count
- Node-level: CPU/memory/disk/network per node
- Pod-level: CPU/memory per pod, container restarts
- Namespace-level: Resource usage per namespace
Install via Helm or CloudFormation:
# Install CloudWatch Agent and Fluentd via Helm
helm repo add aws https://aws.github.io/eks-charts
helm install aws-cloudwatch-metrics aws/aws-cloudwatch-metrics \
--namespace amazon-cloudwatch \
--create-namespace \
--set clusterName=production-cluster
helm install aws-for-fluent-bit aws/aws-for-fluent-bit \
--namespace amazon-cloudwatch \
--set cloudWatch.region=us-east-1 \
--set cloudWatch.logGroupName=/aws/eks/production-cluster/containers
View metrics in CloudWatch Console → Container Insights → EKS Clusters.
For comprehensive observability patterns, see:
- Observability Overview - Three pillars (logs, metrics, traces)
- Logging Best Practices - Structured logging, correlation IDs
- Metrics and Monitoring - Application metrics, RED/USE methods
- Distributed Tracing - X-Ray, OpenTelemetry
Prometheus and Grafana
For advanced metrics (custom application metrics, service-level objectives, long-term retention), deploy Prometheus and Grafana via Helm:
# Install Prometheus Operator (includes Grafana)
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm install prometheus prometheus-community/kube-prometheus-stack \
--namespace monitoring \
--create-namespace
Expose custom metrics from Spring Boot applications via Micrometer and scrape with Prometheus. See Spring Boot observability.
Security Best Practices
EKS security requires multiple layers: IAM, network policies, pod security, secrets management.
Pod Security Standards
Pod Security Standards (PSS) replace deprecated PodSecurityPolicies, enforcing security controls at the namespace level. Three policies:
- Privileged: Unrestricted (for system components like CNI plugins)
- Baseline: Prevents known privilege escalations (blocks privileged containers, host namespaces)
- Restricted: Strict hardening (requires dropping all capabilities, running as non-root, read-only root filesystem)
# Enforce Restricted policy at namespace level
apiVersion: v1
kind: Namespace
metadata:
name: payments
labels:
pod-security.kubernetes.io/enforce: restricted
pod-security.kubernetes.io/audit: restricted
pod-security.kubernetes.io/warn: restricted
Pods violating the policy are rejected. See Kubernetes security.
Network Policies
Network Policies control pod-to-pod communication at the network layer. Default Kubernetes behavior allows all pods to communicate. Network Policies enforce deny-by-default.
# Deny all ingress traffic to payment-service pods (except from API gateway)
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: payment-service-ingress
namespace: payments
spec:
podSelector:
matchLabels:
app: payment-service
policyTypes:
- Ingress
ingress:
- from:
- namespaceSelector:
matchLabels:
name: api-gateway
podSelector:
matchLabels:
app: api-gateway
ports:
- protocol: TCP
port: 8080
Combine Network Policies with Security Groups for Pods for defense-in-depth. See security overview.
Secrets Management
Never store secrets in ConfigMaps or environment variables in plain text. Use AWS Secrets Manager or SSM Parameter Store with the Secrets Store CSI Driver to inject secrets as mounted volumes.
# SecretProviderClass (mounts AWS Secrets Manager secret as volume)
apiVersion: secrets-store.csi.x-k8s.io/v1
kind: SecretProviderClass
metadata:
name: payment-service-secrets
namespace: payments
spec:
provider: aws
parameters:
objects: |
- objectName: "payment-db-password"
objectType: "secretsmanager"
objectAlias: "db-password"
---
# Pod using SecretProviderClass
apiVersion: v1
kind: Pod
metadata:
name: payment-service
namespace: payments
spec:
serviceAccountName: payment-service-sa
containers:
- name: app
image: payment-service:1.2.3
volumeMounts:
- name: secrets
mountPath: /mnt/secrets
readOnly: true
volumes:
- name: secrets
csi:
driver: secrets-store.csi.k8s.io
readOnly: true
volumeAttributes:
secretProviderClass: payment-service-secrets
Application reads secret from /mnt/secrets/db-password. Secrets are never stored in etcd. See secrets management.
Image Scanning
Scan container images for vulnerabilities before deploying to EKS. Use Amazon ECR image scanning or third-party tools (Trivy, Snyk, Aqua).
# Enable ECR image scanning on push
aws ecr put-image-scanning-configuration \
--repository-name payment-service \
--image-scanning-configuration scanOnPush=true
# Retrieve scan findings
aws ecr describe-image-scan-findings \
--repository-name payment-service \
--image-id imageTag=1.2.3
Block deployments of images with critical vulnerabilities in CI/CD pipelines. See CI/CD pipelines.
Cluster Upgrades and Maintenance
EKS clusters require periodic upgrades to stay on supported Kubernetes versions. AWS supports each Kubernetes minor version for 14 months after release.
Upgrade Strategy
1. Check compatibility: Review EKS Kubernetes version compatibility for breaking changes, deprecated APIs, and add-on version requirements.
2. Upgrade add-ons: Update VPC CNI, CoreDNS, kube-proxy, and CSI drivers to versions compatible with target Kubernetes version.
3. Upgrade control plane:
eksctl upgrade cluster --name production-cluster --version 1.29 --approve
Control plane upgrades complete in 20-30 minutes with zero downtime (API server remains available).
4. Upgrade node groups:
eksctl upgrade nodegroup \
--cluster production-cluster \
--name general-purpose \
--kubernetes-version 1.29
Node upgrades replace nodes one at a time, gracefully draining pods before termination. This process takes 30-60 minutes per node group.
5. Validate: Run integration tests, check application health, monitor metrics.
Testing recommendations:
- Test upgrades in non-production environment first
- Review Kubernetes changelogs for deprecated APIs (use
kubectl-convertto migrate manifests) - Use
kubectl get all --all-namespacesto identify deprecated API usage before upgrading
For comprehensive testing strategies, see integration testing and chaos engineering.
Cost Optimization
EKS costs include:
- Control plane: $0.10/hour per cluster (~$73/month)
- Worker nodes: EC2 instance costs (or Fargate task costs)
- Data transfer: Cross-AZ traffic, NAT Gateway traffic
Optimization Strategies
1. Right-size pods and nodes: Use Vertical Pod Autoscaler to recommend resource requests/limits. Avoid over-provisioning. See resource management.
2. Use Spot instances: Run 50-80% of workload on Spot (stateless apps, batch jobs) for 70-90% discount. Use Karpenter to automatically mix Spot and On-Demand.
3. Use Graviton instances: 20-40% cost savings vs. x86. Ensure multi-arch images.
4. Consolidate workloads: Use fewer, larger nodes instead of many small nodes (reduces per-node overhead: kubelet, kube-proxy, CNI).
5. Reduce cross-AZ traffic: Use topology-aware routing to prefer pods in the same AZ as the requester (reduces data transfer costs).
6. Use VPC endpoints: Access S3, ECR, and other AWS services via VPC endpoints (free gateway endpoints, cost-effective interface endpoints) instead of routing through NAT Gateways.
7. Implement PodDisruptionBudgets: Enable safe node draining during scale-down without affecting availability. See high availability patterns.
For comprehensive cost optimization, see AWS cost optimization.
Common EKS Anti-Patterns
Avoid these mistakes that create operational complexity, security risks, or cost overruns:
Running control plane in public subnets only: Enable both public and private endpoints. Nodes should use private endpoint (lower latency, no NAT costs).
Not using IRSA: Attaching IAM policies to node roles grants permissions to all pods on the node (overly permissive). Use IRSA for pod-level permissions.
Ignoring IP exhaustion: VPC CNI consumes many IPs. Plan subnet sizes carefully or use prefix delegation mode.
Over-provisioning pods: Setting requests: 8Gi memory when app uses 512Mi wastes capacity and increases costs. Profile applications and set realistic requests.
Not setting resource limits: Pods without limits can consume all node resources, starving other pods. Always set limits.
Using latest image tags: image: myapp:latest makes deployments non-reproducible. Use semantic versioning (myapp:1.2.3).
Storing secrets in ConfigMaps: ConfigMaps are not encrypted. Use Secrets with encryption at rest or Secrets Manager.
Single node group for all workloads: Mixing batch jobs and critical APIs on the same nodes creates resource contention. Use separate node groups with taints/tolerations.
Not implementing health checks: Pods without readiness probes receive traffic before ready, causing errors. Always implement liveness and readiness probes. See health checks.
Manual kubectl deployments: Deploying via kubectl apply manually creates inconsistency. Use GitOps (ArgoCD, Flux) or CI/CD pipelines with version-controlled manifests.
Further Reading
EKS documentation and Kubernetes ecosystem are vast. Continue learning with these resources:
- AWS EKS Best Practices Guide - Comprehensive AWS-maintained guide
- AWS EKS Workshop - Hands-on tutorials
- VPC CNI Documentation - Deep dive on networking
- Karpenter Documentation - Advanced autoscaling patterns
- AWS Load Balancer Controller Documentation - Ingress patterns
For Kubernetes fundamentals and best practices, see Kubernetes documentation. For application packaging, see Helm charts. For Infrastructure as Code, see Terraform on AWS and Terraform best practices. For observability, see CloudWatch observability and general observability practices. For container image optimization, see Docker guidelines.