EKS is AWS's managed Kubernetes — AWS owns the control plane (etcd, API server, scheduler, controllers) spread across 3+ AZs. You own the data plane (nodes, pods, workloads). This split is the foundation of every EKS tradeoff.
etcd (3–5 nodes, quorum-managed) · API Server (multi-instance behind NLB) · Controller Manager · Scheduler · Control plane VPC + cross-account ENIs · Automatic HA across 3 AZs · Patching, backups, scaling · OIDC endpoint · CloudWatch log forwarding
Worker nodes (EC2 or Fargate) · OS patches & AMI updates · Kubernetes workloads, Namespaces, RBAC, CRDs · Pod config, resource requests · Security Groups, IAM roles · Node group scaling policies · Add-on version upgrades · VPC/subnet design
| Component | Role | EKS Behavior |
|---|---|---|
| etcd | Distributed key-value store — all cluster state | 3–5 nodes, AWS manages quorum, automatic backups, multi-AZ |
| API Server | All kubectl / SDK calls entry point | Multiple instances behind NLB, auto-scales under load |
| Controller Manager | Reconcile loops (ReplicaSets, Deployments…) | AWS managed; cloud controller runs separately in modern EKS |
| Scheduler | Assigns pods to nodes (resources/taints/affinity) | AWS managed, auto-replaced on failure |
| OIDC Endpoint | Issues tokens for IAM↔K8s identity mapping (IRSA) | Auto-created per cluster; associate via eksctl |
AWS provisions Cross-Account ENIs (X-ENIs) inside your VPC subnets. Traffic between your nodes and the AWS-managed API Server flows through these ENIs — no public internet required in private endpoint mode.
Click-through creation. Good for learning. Not reproducible. Avoid for production.
Official CLI. Fastest path to a cluster. Generates CloudFormation under the hood. Best for learning/labs.
HCL + official EKS Blueprints module. Remote state in S3+DynamoDB. Best for production IaC.
TypeScript/Python code → CloudFormation or direct API. Good for platform teams with existing code pipelines.
Always configure remote backend (S3 + DynamoDB locking) before creating EKS with Terraform. Local state + team collaboration = corruption.
If you create via Console then try to manage with eksctl/Terraform, you'll have state drift. Pick one tool per cluster and stick to it.
| Tool | Purpose | Install |
|---|---|---|
| kubectl | K8s control — deployments, pods, services, logs | brew install kubectl |
| eksctl | EKS-specific cluster/nodegroup/addon management | brew tap weaveworks/tap && brew install eksctl |
| aws CLI v2 | AWS API — EKS, IAM, EC2, ECR etc. | brew install awscli |
| helm | Package manager for Kubernetes charts | brew install helm |
| eksdemo | Demo/lab tool — VPC/subnet/ENI inspection | brew install eksdemo |
| kubectx / kubens | Fast context and namespace switching | brew install kubectx |
| kubectl-no-trouble | Scan for deprecated API usage before upgrades | brew install nodetaint/tap/kubectl-no-trouble |
EKS networking uses AWS VPC CNI by default — each pod gets a real VPC IP from your subnet via secondary ENI IPs. No overlay, lowest latency, but IP exhaustion is a real production risk you must plan for upfront.
| CNI | Pod IP Source | Overlay | Latency | Use Case |
|---|---|---|---|---|
| AWS VPC CNI | VPC subnet IPs | No | Lowest | Default EKS, direct VPC routing, no overhead |
| Cilium | Custom CIDR | Yes (Geneve/eBPF) | Variable | Advanced L7 policies, observability, service mesh |
| Calico | Custom CIDR | Yes (IP-in-IP) | Low overhead | NetworkPolicy-heavy, BGP routing |
| Flannel | Custom CIDR | Yes (VXLAN) | Medium | Simple dev clusters |
Without prefix delegation, each secondary ENI slot holds 1 IP → 1 pod. With prefix delegation, each slot holds a /28 block = 16 IPs. Same instance, dramatically more pods.
Keep N spare ENIs attached. Default: 1. Higher = faster pod starts, more idle IPs consumed.
Keep N spare IPs ready. More granular than ENI target. Combine with MIN_IP_TARGET for small nodes.
For prefix delegation: keep N spare /28 prefixes attached. Default: 1. Enables instant pod scheduling.
Enable Prefix Delegation# Enable on existing cluster kubectl set env ds aws-node -n kube-system \ ENABLE_PREFIX_DELEGATION=true \ WARM_PREFIX_TARGET=1 # Verify — sort pods by IP to see /28 blocks kubectl get pods -o wide --sort-by='.status.podIP' # You'll see blocks like 192.168.114.16–31 on same node
By default all pods can reach all other pods (no isolation). NetworkPolicy resources add L3/L4 traffic rules. Use aws-eks-nodeagent (eBPF) or Cilium/Calico for enforcement.
deny-all ingress defaultapiVersion: networking.k8s.io/v1 kind: NetworkPolicy metadata: name: default-deny-ingress spec: podSelector: {} # all pods policyTypes: - Ingress # no rules = deny all ingress
allow frontend → backendapiVersion: networking.k8s.io/v1 kind: NetworkPolicy spec: podSelector: matchLabels: app: backend ingress: - from: - podSelector: matchLabels: app: frontend ports: - port: 8080
kubectl set env ds aws-node -n kube-system ENABLE_NETWORK_POLICY_CONTROLLER=true. Or use Cilium for L7 policies + deep observability.Nodes with public IPs, IGW route. For load balancers only. Tag: kubernetes.io/role/elb=1
Worker nodes (recommended). NAT Gateway for egress. Tag: kubernetes.io/role/internal-elb=1
Use /16 VPC per cluster. /22 subnets minimum per AZ. A /24 (250 IPs) fills up with prefix delegation enabled on just a few nodes.
With AWS VPC CNI, each pod consumes a real VPC IP. A single m5.xlarge can consume 60 IPs (4 ENIs × 15 IPs). On a /24 subnet = 4 nodes max. Plan /22 or /21 subnets per AZ.
Running multiple clusters in one VPC multiplies IP consumption. Isolate clusters in dedicated VPCs unless you have a specific reason to share.
Three compute models: Managed Node Groups (EC2 ASG, you own the OS), Fargate (serverless pods), and Karpenter (dynamic EC2 provisioner, workload-driven). The right choice depends on workload type, cost target, and ops burden tolerance.
| Dimension | Managed Node Groups | Fargate | Karpenter |
|---|---|---|---|
| Node ownership | You (EC2) | AWS managed | You (EC2) |
| OS patching | You update AMIs | AWS manages | You (NodeClass AMI config) |
| Startup latency | Fast (pre-provisioned) | 30–60s cold start | 30–90s (EC2 boot) |
| Scaling granularity | Group-level ASG | Per pod | Per pod |
| Spot support | Manual per node group | No spot on Fargate | Auto least-cost with spot |
| DaemonSets | ✓ | ✗ | ✓ |
| GPUs / NVMe | ✓ | ✗ | ✓ |
| Host networking | ✓ | ✗ | ✓ |
| Persistent EBS | ✓ | ✗ (EFS only) | ✓ |
| Auto consolidation | ✗ manual | N/A (pod-per-node) | ✓ built-in |
Default is 1 node at a time — very slow for large clusters. Increase maxUnavailable in the nodegroup update config. Ensure your workloads have replicas > maxUnavailable to avoid downtime.
If a pod uses EBS and gets rescheduled to a different AZ, it can't reattach the volume. Use volumeBindingMode: WaitForFirstConsumer on StorageClass and consider per-AZ node groups for stateful workloads.
NodePool CRDapiVersion: karpenter.sh/v1 kind: NodePool metadata: name: default spec: template: spec: nodeClassRef: name: default requirements: - key: karpenter.sh/capacity-type operator: In values: ["spot", "on-demand"] - key: kubernetes.io/arch operator: In values: ["amd64"] limits: cpu: 1000 memory: 1000Gi disruption: consolidationPolicy: WhenUnderutilized consolidateAfter: 30s
EC2NodeClassapiVersion: karpenter.k8s.aws/v1 kind: EC2NodeClass metadata: name: default spec: amiSelectorTerms: - alias: al2023@latest subnetSelectorTerms: - tags: karpenter.sh/discovery: my-cluster securityGroupSelectorTerms: - tags: karpenter.sh/discovery: my-cluster role: KarpenterNodeRole-my-cluster
Always set resources.requests on all containers. Without them, Karpenter can't determine the right instance size and may bin-pack everything onto one tiny instance.
Karpenter drains nodes during consolidation. Without PDBs, critical services may briefly have 0 replicas. Always define PDBs for production workloads.
They will fight over scale-down decisions on the same node groups. Pick one — Karpenter is the recommended choice for new clusters.
EKS storage: EBS (block, AZ-bound, databases), EFS (shared file, multi-AZ, Fargate-compatible), FSx Lustre (HPC/ML), FSx ONTAP (enterprise). All use the CSI driver pattern — a controller Deployment + node DaemonSet.
| Service | Type | Access Mode | AZ Scope | Best For | Fargate |
|---|---|---|---|---|---|
| EBS | Block | RWO | Single AZ | Databases, stateful apps | ✗ |
| EFS | File (NFS) | RWX | Multi-AZ | Shared config, CMS, ML datasets | ✓ |
| FSx Lustre | High-perf parallel FS | RWX | Single AZ | HPC, ML training, batch | ✗ |
| FSx ONTAP | Enterprise NAS | RWO/RWX | Multi-AZ | SMB/NFS/iSCSI enterprise lift-shift | ✗ |
| Instance Store | NVMe local | RWO | Node-local ephemeral | Temp data, cache, shuffle | ✗ |
General purpose SSD. 3,000 IOPS baseline, configurable up to 16,000. Cheaper than gp2 at same size. Default for new clusters. Always use gp3 unless specific IOPS needs.
IOPS tied to size (3 IOPS/GB, burst 3000). Many clusters still default to gp2 — override explicitly. Avoid for new provisioning.
Provisioned IOPS SSD up to 64,000 (io2 Block Express: 256,000). For latency-sensitive databases. io2 supports multi-attach (RWX for EC2).
StorageClass · gp3 recommendedapiVersion: storage.k8s.io/v1 kind: StorageClass metadata: name: gp3 annotations: storageclass.kubernetes.io/is-default-class: "true" provisioner: ebs.csi.aws.com volumeBindingMode: WaitForFirstConsumer # ← CRITICAL: delays PV creation until pod AZ is known allowVolumeExpansion: true parameters: type: gp3 encrypted: "true" iops: "3000" throughput: "125"
EBS volumes are bound to a single AZ. If your pod reschedules to a different AZ, it cannot attach the volume → pod stays Pending. Always use volumeBindingMode: WaitForFirstConsumer so the PV is provisioned in the same AZ as the scheduled pod.
Clusters created before EKS 1.23 may still have gp2 as the default StorageClass. Override: annotate gp3 with storageclass.kubernetes.io/is-default-class: "true" and remove it from gp2.
Fargate pods cannot attach EBS volumes. Use EFS with ReadWriteMany for Fargate persistent storage.
EFS is NFS-backed, multi-AZ, and supports ReadWriteMany — multiple pods across nodes/AZs can mount simultaneously. The only persistent storage option for Fargate.
StorageClass · EFS dynamic provisioningapiVersion: storage.k8s.io/v1 kind: StorageClass metadata: name: efs-sc provisioner: efs.csi.aws.com parameters: provisioningMode: efs-ap # access point per PVC (isolation) fileSystemId: fs-0123456789abcdef directoryPerms: "700" --- accessModes: [ReadWriteMany] # key difference — multiple pods simultaneously
EKS security has two axes: cluster access (who can kubectl) via aws-auth/EKS Access Entries, and pod AWS permissions (what AWS APIs can a pod call) via IRSA or Pod Identity. Never give nodes broad IAM roles.
Maps IAM ARNs → K8s usernames/groups. Lives in kube-system. YAML-based, fragile. A malformed YAML can lock everyone out of the cluster. Being replaced by EKS Access Entries.
Native EKS API for cluster access. No ConfigMap required. Manages via AWS Console/CLI/Terraform. More auditable, CloudTrail-logged. Recommended for all new clusters.
aws-auth — safe management via eksctl# Safe: eksctl validates before applying eksctl create iamidentitymapping \ --cluster my-cluster \ --arn arn:aws:iam::123456789012:role/TeamDevRole \ --group system:masters \ --username team-dev # RISKY: direct edit — backup first! kubectl get cm aws-auth -n kube-system -o yaml > aws-auth-backup.yaml kubectl edit cm aws-auth -n kube-system # EKS Access Entries (new API) aws eks create-access-entry \ --cluster-name my-cluster \ --principal-arn arn:aws:iam::123456789012:role/TeamDevRole aws eks associate-access-policy \ --cluster-name my-cluster \ --principal-arn arn:aws:iam::123456789012:role/TeamDevRole \ --policy-arn arn:aws:eks::aws:cluster-access-policy/AmazonEKSClusterAdminPolicy \ --access-scope type=cluster
eksctl create iamidentitymapping instead of direct kubectl edit. Keep a backup of the current ConfigMap before any changes.IRSA binds an IAM role to a K8s ServiceAccount. The EKS mutating webhook injects env vars + a projected token into pods. The AWS SDK exchanges the token for STS temporary credentials automatically.
# Required once eksctl utils associate-iam-oidc-provider --cluster my-cluster --approve # Get OIDC issuer URL aws eks describe-cluster --name my-cluster --query "cluster.identity.oidc.issuer" --output text
{
"Principal": { "Federated": "arn:aws:iam::ACCOUNT:oidc-provider/oidc.eks.REGION.amazonaws.com/id/OIDCID" },
"Action": "sts:AssumeRoleWithWebIdentity",
"Condition": { "StringEquals": {
"oidc...sub": "system:serviceaccount:NAMESPACE:SA-NAME"
}}
}
annotations: eks.amazonaws.com/role-arn: arn:aws:iam::123456789012:role/MyPodRole
spec: serviceAccountName: s3-reader # webhook injects AWS_ROLE_ARN + AWS_WEB_IDENTITY_TOKEN_FILE
Each EKS cluster gets its own OIDC provider. At 100+ clusters you hit the account limit. Consider Pod Identity (no OIDC required) for large fleets.
Each pod startup = 1 STS AssumeRoleWithWebIdentity call. At thousands of pod starts/min, you can hit per-role STS throttling. Pod Identity caches credentials locally — better at scale.
Pod Identity (EKS 1.24+) removes the OIDC provider requirement. Mappings stored in EKS control plane, not pod annotations. A privileged DaemonSet on each node proxies and caches credentials.
| Feature | Kube2IAM | IRSA | Pod Identity |
|---|---|---|---|
| OIDC provider needed | ✗ | ✓ | ✗ |
| Pod annotations needed | ✓ | ✓ | ✗ |
| Local credential proxy + cache | ✗ | ✗ | ✓ |
| ABAC / tag-based policies | ✗ | ✗ | ✓ |
| Works off EKS | ✓ | ✓ (OIDC) | ✗ EKS only |
Pod Identity setup# 1. Install Pod Identity Agent addon eksctl create addon --name eks-pod-identity-agent --cluster my-cluster # 2. Associate role — no SA annotation needed aws eks create-pod-identity-association \ --cluster-name my-cluster \ --namespace default \ --service-account my-app-sa \ --role-arn arn:aws:iam::123456789012:role/MyPodRole # 3. Verify aws eks list-pod-identity-associations --cluster-name my-cluster
Base64 encoded, stored in etcd. Not encrypted at rest by default. Any pod in namespace can access. Always enable KMS envelope encryption.
External Secrets Operator (ESO) or Secrets Store CSI Driver syncs secrets → K8s. Rotation support, CloudTrail audit trail, cross-account support.
Cheaper than Secrets Manager for non-sensitive config. Use SecureString for sensitive values. Accessible via ESO or SSM Agent.
Enable KMS encryption for etcd secrets# Enable at cluster creation eksctl create cluster \ --name my-cluster \ --with-oidc \ --secrets-encryption-key-arn arn:aws:kms:REGION:ACCOUNT:key/KEY-ID
AWS Load Balancer Controller (AWS LBC) provisions ALBs (HTTP/HTTPS L7) and NLBs (TCP/UDP L4) from Kubernetes Ingress and Service resources. Understanding the full traffic path is essential for debugging and avoiding cross-AZ cost surprises.
| Feature | ALB (Application LB) | NLB (Network LB) |
|---|---|---|
| OSI Layer | L7 (HTTP/HTTPS) | L4 (TCP/UDP/TLS) |
| Routing rules | Host, path, header, query string | Port-based only |
| Latency | Higher (L7 processing overhead) | Ultra-low (<1ms) |
| Source IP | Via X-Forwarded-For header | Preserved natively |
| Static IP / Elastic IP | ✗ (DNS hostname only) | ✓ |
| AWS PrivateLink | ✗ | ✓ |
| gRPC / HTTP2 | ✓ (native) | ✓ (TLS passthrough) |
| WebSocket | ✓ | ✓ |
| K8s resource | Ingress / IngressClass | Service type:LoadBalancer |
| Best for | HTTP APIs, microservices, HTTPS termination | gRPC, databases, real-time, static IP needs |
Install AWS LBC via Helm# Add Helm repo helm repo add eks https://aws.github.io/eks-charts helm repo update # Install (requires IAM SA pre-created with IRSA) helm upgrade --install aws-load-balancer-controller eks/aws-load-balancer-controller \ -n kube-system \ --set clusterName=my-cluster \ --set serviceAccount.create=false \ --set serviceAccount.name=aws-load-balancer-controller # Verify kubectl get deployment -n kube-system aws-load-balancer-controller kubectl logs -n kube-system -l app.kubernetes.io/name=aws-load-balancer-controller | tail -20
Ingress · ALB with IngressGroup (share one ALB)apiVersion: networking.k8s.io/v1 kind: Ingress metadata: annotations: kubernetes.io/ingress.class: alb alb.ingress.kubernetes.io/scheme: internet-facing alb.ingress.kubernetes.io/target-type: ip # direct to pod, skips kube-proxy alb.ingress.kubernetes.io/certificate-arn: arn:aws:acm:... alb.ingress.kubernetes.io/listen-ports: '[{"HTTP":80},{"HTTPS":443}]' alb.ingress.kubernetes.io/ssl-redirect: "443" alb.ingress.kubernetes.io/group.name: my-app # share one ALB across Ingresses
target-type: ip (routes directly to pod IP) over target-type: instance (routes via NodePort → kube-proxy). IP mode skips kube-proxy entirely, reduces hops, and requires VPC CNI pod IPs — which EKS provides by default.Without alb.ingress.kubernetes.io/group.name, every Ingress resource provisions its own ALB. Use IngressGroups to share one ALB across multiple services — critical for cost control.
Unlike NLBs, ALBs go through target health check warm-up. Don't expect instant availability after applying an Ingress. Check events on the Ingress object: kubectl describe ingress <name>
Kubernetes releases 3 minor versions/year. EKS supports N-2 (3 active at any time). After ~14 months a version reaches end-of-support and AWS will auto-upgrade your control plane with warning. Plan upgrades annually minimum — one minor version at a time.
| Add-on | Purpose | Required? |
|---|---|---|
| vpc-cni | AWS VPC CNI — pod IP allocation from VPC | Yes |
| kube-proxy | Service networking (iptables / ipvs rules) | Yes |
| coredns | Cluster DNS resolution | Yes |
| aws-ebs-csi-driver | EBS PersistentVolume support | If using EBS |
| eks-pod-identity-agent | Pod Identity credential proxy DaemonSet | If using Pod Identity |
| adot | AWS Distro for OpenTelemetry | If using OTEL |
| aws-guardduty-agent | Runtime threat detection for pods | Security baseline |
Enable API server, audit, auth, scheduler, controller-manager log types. Essential for auth debugging. Configure via EKS console or CLI.
kube-state-metrics + node-exporter for cluster metrics. Helm chart: kube-prometheus-stack. AWS Managed Prometheus available for serverless option.
CloudWatch Container Insights: per-pod CPU/memory/disk/network via Fluent Bit DaemonSet. Simpler for CloudWatch-centric teams.
Enable all control plane loggingaws eks update-cluster-config \
--name my-cluster \
--logging \
'{"clusterLogging":[{"types":["api","audit","authenticator","controllerManager","scheduler"],"enabled":true}]}'
EKS only supports upgrading one minor version at a time: 1.27→1.28→1.29→1.30. Attempting to skip fails with an API error. Plan for ~30 min per version hop.
After upgrading the control plane, manually upgrade each addon. Running mismatched versions (e.g., old vpc-cni with new K8s) causes silent networking issues.
PSP was removed in Kubernetes 1.25. If on <1.25 with PSPs, migrate to Pod Security Admission (PSA) or OPA/Gatekeeper before upgrading past 1.24. Use kubectl-no-trouble to detect.
Fargate pods don't upgrade in place. After upgrading control plane, trigger a Deployment rollout (kubectl rollout restart deployment/<name>). New pods land on Fargate nodes with the updated K8s runtime.
Running both clusters during migration window doubles your EC2 + control plane costs. Also requires careful ALB/NLB migration and DNS TTL management. Only use for large version jumps or CNI changes.
Staff L6 interview prep — 40 questions across all 8 topic clusters. Each has key points to memorize, a full model answer, and interviewer follow-up probes. Click any question to expand.
"EKS splits Kubernetes responsibilities at the control plane / data plane boundary. AWS fully manages the control plane — that means etcd with 3–5 nodes spread across AZs maintaining quorum automatically, multiple API server instances behind an NLB that auto-scales under load, the scheduler and controller manager, and all patching, backup, and version management of those components."
"You own everything in your AWS account: worker nodes whether EC2 or Fargate, the OS and AMI lifecycle, all Kubernetes workloads, RBAC policies, CRDs, network policies, IAM role bindings, and security groups. The bridge between the two is AWS provisioning cross-account ENIs into your VPC subnets — your nodes register with the API server through those ENIs, which is why your VPC security groups must allow port 443 outbound to the control plane endpoint."
"The practical implication is that if you misconfigure VPC routing or security groups, your nodes can't join. If AWS has a control plane issue, your existing pods keep running but you can't schedule new ones or change cluster state — etcd goes read-only."
kubectl apply -f deployment.yaml — from CLI to pod running.
▶
aws eks get-token → STS presigned URL → API server validates via EKS auth webhook → RBAC authorization"The full path has about 8 distinct phases. Authentication first: kubectl generates a bearer token by calling STS with your IAM credentials to get a presigned URL. The API server passes this to the EKS authentication webhook, which validates the IAM identity and maps it to a Kubernetes username via the aws-auth ConfigMap or EKS Access Entries."
"Then authorization: the API server checks your RBAC permissions for the resource and verb. If you pass, it hits the admission controllers — mutating first, then validating. In a typical EKS cluster this is where IRSA webhook injects environment variables, and where OPA/Gatekeeper blocks policy violations."
"Once admitted, the object is written to etcd — only after a quorum of etcd nodes acknowledges the write. The Deployment controller running inside the controller manager watches etcd via a list-watch and creates a ReplicaSet, which creates Pod objects with no nodeName."
"The scheduler watches for Pods with empty nodeName, scores candidate nodes using predicates (resource fit, taints, pod affinity) and priorities, then writes the chosen node back to the Pod spec in etcd. The kubelet on that node is also watching, picks up the binding, calls containerd to pull the image, creates Linux namespaces, calls the CNI plugin (aws-cni) to allocate a secondary VPC IP and set up the veth pair, then starts the container. The pod transitions through Pending → ContainerCreating → Running."
/registry/ prefix"etcd is a distributed key-value store built on the Raft consensus algorithm. It elects a leader that handles all writes; followers replicate. To commit a write, the leader needs acknowledgment from a majority — quorum. In EKS, AWS runs 3–5 etcd nodes distributed across AZs. With 3 nodes, you can tolerate one AZ failure and still have quorum. With 5, you can tolerate two."
"The critical failure mode to understand: if quorum is lost, etcd becomes read-only. Your existing pods keep running because kubelet caches the pod spec locally — it doesn't need etcd to manage running containers. But you cannot create, update, or delete any Kubernetes objects until quorum is restored. This is why AWS runs this in 3+ AZs."
"As an operator, the things I watch for are: etcd compaction lag (if not compacted, the database grows unboundedly — AWS handles this but it's important to understand), the 8MB request size limit (trying to apply a very large ConfigMap or CRD can hit this), and at scale, object count — etcd's performance degrades with millions of objects, so namespacing and cleaning up stale resources matters."
"The aws-node DaemonSet — the VPC CNI plugin — runs on each worker node and is responsible for pre-allocating a pool of IPs from your VPC subnet. It attaches secondary ENIs to the EC2 instance and assigns secondary private IPs to those ENIs. When a pod is scheduled, the kubelet calls the CNI plugin, which picks a free IP from the warm pool, creates a veth pair — one end in the pod's network namespace, one on the host — and programs the Linux kernel routing table so traffic to that IP goes into the pod."
"The capacity ceiling is instance-type dependent. An m5.xlarge supports 4 ENIs with up to 15 secondary IPs each, so 60 pod IPs. With prefix delegation enabled, each of those 15 secondary slots becomes a /28 prefix (16 IPs), giving you 4 × 15 × 16 = 960 pod IPs per node."
"The two failure modes I've seen in production: first, subnet IP exhaustion — your /24 runs out of IPs and pods get stuck Pending. The fix is either resize subnets (can't do in-place, need migration) or use prefix delegation with larger subnets. Second, the warm pool latency problem — if WARM_ENI_TARGET is 0, the first pod on a cold node waits for ENI attachment, which can take 10–30 seconds. Setting WARM_ENI_TARGET=1 pre-allocates a spare ENI."
nslookup from pod)kubectl get endpoints)iptables -L -t natkubectl exec + curl / nc to test connectivity directly"I'd approach this as a layered debug. First I'd confirm whether it's DNS resolution or actual network connectivity — exec into the source pod and run nslookup service-name.namespace.svc.cluster.local. If DNS fails, the problem is CoreDNS or the service name, not networking."
"If DNS resolves but the connection fails, I'd check NetworkPolicy first — this is the most common cause of cross-namespace failures. A default-deny policy in the destination namespace, or an ingress policy missing a namespaceSelector for the source namespace, would silently block traffic. I'd kubectl get networkpolicy -n <target-ns> and read the selectors carefully."
"If there's no NetworkPolicy blocking it, I'd check the Service has active Endpoints — kubectl get endpoints <service> -n <ns>. No endpoints means the selector doesn't match any pods. Then I'd verify kube-proxy has programmed the iptables rules correctly on the relevant nodes. On EKS, I'd also check whether Security Groups for Pods is in use, which adds an extra layer of SG-based filtering that NetworkPolicy doesn't control."
"I'd also verify the pods are actually running and that aws-node allocated IPs correctly — check aws-node DaemonSet logs for any IP pool exhaustion warnings."
"The first question I ask is whether we need soft or hard multi-tenancy. Soft tenancy — where tenants are internal teams that trust each other at some level — can work in one cluster with namespace isolation. Hard tenancy — where tenants are external customers with no trust relationship — requires separate clusters per tenant or very careful controls."
"For soft tenancy in EKS, I'd layer three controls. Network: default-deny ingress + egress NetworkPolicy in every tenant namespace, then explicit allow rules only for legitimate traffic paths. With aws-eks-nodeagent or Cilium, these are enforced at eBPF level. Compute: dedicated node groups per tenant with taints and tolerations so tenant A pods can't land on tenant B nodes — eliminates shared memory/CPU attack surface. IAM: namespace-scoped RBAC only, no ClusterRole, separate IRSA roles per tenant namespace."
"For full isolation, I'd add Security Groups for Pods — assigning per-tenant SGs at the ENI level. This enforces isolation at the AWS network layer, not just in kernel space, and you get CloudTrail logging of any cross-tenant traffic attempts. The downside is SG-for-Pods requires a specific VPC CNI configuration and can complicate your IP allocation."
"The fundamental architectural difference: Cluster Autoscaler works through Auto Scaling Groups — it scales up by increasing desired count on a pre-configured ASG, which means you must pre-define instance types, have separate ASGs per AZ per instance family, and handle mixed instance types manually. It's reactive, checking every 10–60 seconds, with a conservative scale-down mechanism that waits for an underutilization window."
"Karpenter bypasses ASGs entirely and calls the EC2 Fleet API directly. When a pod is unschedulable, Karpenter reads its resource requests and constraints, queries EC2 for all matching instance types with current spot and on-demand pricing, and launches the cheapest option that fits. This happens in seconds. It's also topology-aware — it considers the pod's AZ preference for EBS volumes."
"I'd choose Karpenter for any new EKS cluster where cost efficiency matters, especially for workloads with variable shape (some pods need GPU, some need memory, some are small). The consolidation feature — where Karpenter continuously right-sizes and terminates underutilized nodes — alone can save 20–40% on compute."
"I'd stick with Cluster Autoscaler if the team has deep existing tooling around ASGs, or if the cluster runs on non-AWS infrastructure. The critical thing with Karpenter: every pod needs explicit resource requests. Without them, Karpenter has no signal for sizing and you get unpredictable behavior."
karpenter.sh/do-not-disrupt annotation on pods that absolutely cannot be evicted."Fargate's core value proposition is zero node management and pod-level isolation — AWS provisions a dedicated microVM per pod, handles OS patching, and you never think about node capacity. That sounds appealing, but the production constraints are significant."
"The hard limitations: no DaemonSets — any tooling that relies on DaemonSets (log forwarders, security agents, monitoring collectors) won't work on Fargate pods. You work around this with sidecars, but that's operationally heavier. No EBS — Fargate pods can only use EFS for persistent storage, which has higher latency and different cost characteristics. Cold start is 30–60 seconds, which makes Fargate a poor fit for latency-sensitive scale-out paths."
"I use Fargate for: batch and ETL jobs that spin up infrequently and need strong isolation, CI/CD runners where the cold start doesn't matter, and dev/staging namespaces where the team wants zero node management overhead. For production web services with fast scale-out requirements or any DaemonSet-dependent tooling, I stay on EC2 managed node groups or Karpenter."
"The scheduler runs two phases per pod. Filtering reduces the candidate set to only nodes that can actually run the pod. This includes: does the node have enough CPU/memory requests available, do the pod's tolerations cover all node taints, does the node label match any required nodeAffinity, are the requested ports free, and critically for EKS — is the EBS volume in the same AZ as the node."
"Scoring then ranks the filtered nodes. The main scorer is LeastRequestedPriority — prefer nodes with more free capacity to spread load. PodAffinity scoring attracts pods to nodes where matching pods already run. SpreadConstraints scoring penalizes nodes in AZs that are already overloaded."
"Taints vs affinity is a common confusion point. Taints are on nodes — they push pods away unless the pod has a matching toleration. Affinity is on pods — they pull pods toward nodes. You'd taint a GPU node gpu=true:NoSchedule so only pods that tolerate it land there. You'd use nodeAffinity when you want pods to prefer certain nodes but don't want to block non-matching pods."
"For HA across zones, I prefer TopologySpreadConstraints over pod anti-affinity now — it's more expressive, can enforce a specific skew limit, and scales better. Anti-affinity requires N anti-affinity rules proportional to replica count, which gets expensive to evaluate in large clusters."
"With a node IAM role, every pod on that node — regardless of what it does — can call any AWS API that role allows. If one pod is compromised, the attacker gets node-wide AWS access. IRSA solves this with pod-level least privilege: each Kubernetes ServiceAccount maps to a specific IAM role with exactly the permissions that service needs."
"The mechanism: when you associate an OIDC provider with your cluster, EKS acts as an identity provider. You create an IAM role whose trust policy says 'trust tokens issued by this specific OIDC provider for this specific namespace/service-account'. The EKS mutating webhook intercepts pod creation and, if the pod's ServiceAccount is annotated with a role ARN, injects AWS_ROLE_ARN and AWS_WEB_IDENTITY_TOKEN_FILE env vars plus a projected token mounted as a file."
"At runtime, the AWS SDK picks this up automatically via its credential chain — it reads the env vars, reads the projected token, calls STS AssumeRoleWithWebIdentity, and gets back temporary credentials (15 min by default). The pod never has a long-lived credential — everything is ephemeral. You can't accidentally commit an IRSA credential to git."
"The scale limits: 100 OIDC providers per AWS account, and STS has per-role throttling. For fleets over 100 clusters, or for clusters with thousands of pod starts per minute all using the same role, Pod Identity is better because it has a local caching proxy on each node."
"*:sub": "system:serviceaccount:*:my-sa") allows any namespace's service account named my-sa to assume the role — a privilege escalation vector. Always use StringEquals with the full namespace:serviceaccount path."RBAC in Kubernetes is built on four resources. Role defines what actions (verbs: get, list, watch, create, update, patch, delete) are allowed on which resources (pods, services, configmaps…) within a single namespace. ClusterRole is the same but cluster-scoped — used for non-namespaced resources like Nodes and PersistentVolumes, or when you need uniform permissions across all namespaces."
"RoleBinding attaches a Role (or a ClusterRole!) to a subject within a specific namespace. This is a subtle but important point: you can bind a ClusterRole with a RoleBinding, which scopes those permissions to one namespace. That's how you create reusable permission templates with ClusterRoles but grant them per-namespace. ClusterRoleBinding attaches a ClusterRole cluster-wide — granting access to all namespaces at once."
"Three things that trip people up: RBAC is purely additive — there are no deny rules. If a user has two bindings, they get the union of permissions. Second, ServiceAccounts are namespace-scoped subjects — a ServiceAccount in namespace A can't be bound to a Role in namespace B without a ClusterRoleBinding. Third, EKS maps IAM identities to K8s RBAC users/groups — the IAM ARN becomes a username, and you bind that username via RoleBinding."
"Native Kubernetes Secrets are often misunderstood as 'secure by default' — they're not. They're base64-encoded, stored in etcd, and accessible to anyone with get secret RBAC permission in that namespace. The baseline fix is KMS envelope encryption for etcd and strict RBAC. But even with encryption at rest, the secret is still decrypted when read via the API."
"Secrets Store CSI Driver avoids etcd entirely — at pod startup, the CSI driver calls AWS Secrets Manager (using IRSA credentials from the pod's ServiceAccount) and mounts the secret directly into the pod's filesystem. The secret never touches etcd. The downside: the pod depends on Secrets Manager being reachable at startup, adding a cold-start dependency. Also, you need Secrets Manager access configured before the pod can start."
"External Secrets Operator syncs secrets from Secrets Manager into Kubernetes Secret objects on a schedule. It's easier to integrate with existing tooling that expects K8s secrets, but the secret does end up in etcd. The advantage over manual management: rotation in Secrets Manager automatically propagates to the K8s Secret, and from there you control whether pods see the update via projected volumes (immediate) or env vars (require restart)."
"My recommendation for greenfield EKS: CSI driver for high-sensitivity secrets (API keys, certs), ESO for config-level secrets where etcd exposure risk is acceptable, and always KMS encryption enabled regardless."
kubectl describe pvc <name>kubectl get pvvolumeBindingMode: WaitForFirstConsumerkubectl logs -n kube-system -l app=ebs-csi-controller"First: kubectl describe pvc <name> -n <ns> — the Events section tells you exactly what's failing. If it says 'waiting for first consumer to be created before binding', the StorageClass has WaitForFirstConsumer set, which is correct behavior — the PV won't be provisioned until the pod is scheduled. This usually means the pod itself is what's stuck."
"If the PVC shows 'ProvisioningFailed', the EBS CSI controller tried and failed to create the volume. I check EBS CSI controller pod logs: kubectl logs -n kube-system -l app=ebs-csi-controller -c csi-provisioner. Common errors: IAM permission denied (the IRSA role for the CSI driver is missing ec2:CreateVolume), EBS quota exceeded, or the requested AZ doesn't have the instance type available."
"If the StorageClass uses Immediate binding (legacy), the PV is created in whatever AZ the scheduler picks for provisioning — which may not be the AZ the pod ends up on. This causes 'Multi-Attach error for volume' or the pod stays Pending with 'node had no matching volume'. Fix: migrate the StorageClass to WaitForFirstConsumer and delete the stuck PVC/PV pair."
"I'd also check: is the EBS CSI driver installed at all? kubectl get pods -n kube-system | grep ebs-csi. On clusters before EKS 1.23, you had to install it explicitly — it wasn't included."
"The decision comes down to access pattern and scheduling flexibility. EBS is block storage — it behaves like a fast local disk attached to one EC2 instance at a time. It's ideal for databases (Postgres, MySQL, Cassandra nodes), Kafka broker data, any workload where you need low-latency sequential or random I/O. The constraint is AZ binding — the volume lives in one AZ and so does the pod that uses it."
"EFS is NFS over the network — multiple pods across multiple nodes and AZs can mount the same filesystem simultaneously (ReadWriteMany). The use cases are: shared configuration files, ML training datasets accessed by multiple worker pods, CMS media uploads, WordPress shared content. The latency profile is worse than EBS for small random I/O — you're going across a network to a managed NFS service."
"At production scale, the main EFS traps are: throughput mode — Bursting mode gives you throughput proportional to stored data, so a small but heavily-read filesystem gets throttled. Switch to Elastic mode which auto-scales throughput. Second: EFS pricing — $0.30/GB/month on Standard storage. For a cluster serving ML model files (say 50GB per model × 20 models = 1TB), you're paying $300/month in storage alone. S3 + FUSE might be cheaper depending on access patterns."
"The classic in-tree cloud provider creates a Classic ELB (now a legacy product) whenever you create a Service of type LoadBalancer. It's baked into the controller-manager, hard to update independently, and doesn't support modern ALB/NLB features like host-based routing, WAF integration, or IP-mode targeting."
"The AWS Load Balancer Controller is an out-of-tree controller — a Deployment running in your cluster that watches Ingress and Service objects. It uses IRSA to call AWS APIs and provisions ALBs from Ingress resources, NLBs from Service type:LoadBalancer annotations. Because it runs in-cluster, you can upgrade it independently of the K8s version."
"The most important setting is target-type: ip. With instance mode, the LB targets the EC2 instance on a NodePort, and kube-proxy then forwards to the pod — adding latency and potentially a cross-AZ hop. With IP mode, the LB puts the pod's VPC IP directly into the Target Group and routes straight to it. This requires VPC CNI (pods must have VPC IPs), but eliminates the kube-proxy hop and preserves the source IP at the pod."
"The other key concept is IngressGroup — without it, each Ingress object creates a separate ALB at ~$18/month. With group.name annotation, multiple Ingress resources across namespaces share one ALB, and the LBC manages rule ordering via group.order annotation."
"With the default externalTrafficPolicy: Cluster, the cloud load balancer targets every node in the cluster via NodePort. A request arriving at Node A might have its pod running on Node B — kube-proxy on Node A SNATs the packet and forwards it to Node B. This has two costs: latency from the extra hop and AWS cross-AZ data transfer charges (~$0.01/GB each way — significant at scale)."
"With externalTrafficPolicy: Local, the AWS Load Balancer Controller (or cloud provider) registers only the nodes that have a pod for that Service in the Target Group. Requests go directly to pods on the same node — no SNAT, no cross-node hop, and the original client source IP is preserved in the pod, which matters for rate limiting, geo-based routing, and audit logs."
"The risk: if a node has Local policy and the pod on it crashes before the health check catches it, the LB might still send requests there briefly, getting 503s. Mitigate with fast readiness probes and health check intervals. Also, Local means pod distribution matters more — if you have 10 nodes but pods only on 3, those 3 get all the traffic. Use TopologySpreadConstraints to ensure good distribution."
kubectl describe pod — eviction message tells you the reasonkubectl describe node | grep Conditions -A10"I start with kubectl describe pod <evicted-pod> — the Events section tells me the exact eviction message. Common messages: 'The node was low on resource: memory' means kubelet-initiated eviction due to node memory pressure. 'OOMKilled' means the container itself exceeded its memory limit and was killed by the kernel."
"For node pressure evictions, I check node conditions: kubectl describe node <node> | grep -A8 Conditions. MemoryPressure, DiskPressure, PIDPressure indicate which resource is constrained. I'd then check node memory allocation — kubectl top node plus the allocatable vs capacity breakdown."
"The fix depends on root cause. If pods have no resource requests (BestEffort QoS), they're evicted first by kubelet under pressure — set explicit resource requests to at minimum BestEffort → Burstable → Guaranteed. If the node is genuinely under-provisioned, either scale out (more nodes or larger instances) or tune kubelet eviction thresholds via the node group's kubelet config."
"If the evictions are happening during low-traffic periods, check if Karpenter consolidation is the cause — it drains underutilized nodes, which looks like eviction. Add PodDisruptionBudgets to ensure minimum availability is maintained during consolidation."
"Zero-downtime upgrade has three phases, and the order matters. Pre-upgrade: I run kubectl-no-trouble to scan for deprecated API usage — any workloads using removed APIs (like PSP in 1.25) need to be migrated first. I check EKS Cluster Insights for compatibility warnings. I verify all Helm chart versions support the target K8s version, and I confirm PodDisruptionBudgets exist for all critical Deployments and StatefulSets."
"Control plane upgrade is low-risk from a downtime perspective — AWS does a rolling replace of API server instances behind the NLB. Existing pods keep running. API calls may see brief increased latency during the transition but the endpoint stays available. I upgrade one minor version at a time."
"Data plane is where downtime risk lives. For managed node groups, EKS cordons a node, drains it (respecting PDBs — if draining would violate a PDB, it waits), launches a new node with the updated AMI, waits for it to be Ready, then terminates the old node. The key: PDBs must be set correctly. A Deployment with 3 replicas and no PDB could have all 3 nodes drained simultaneously. With minAvailable: 2, only 1 can be drained at a time."
"I also upgrade add-ons between the control plane and data plane steps. Running old vpc-cni with new K8s can cause ENI/IP allocation issues."
"The controller pattern is the core of how Kubernetes works. Every controller watches for a desired state expressed in Kubernetes objects, observes the actual state of the world, and takes actions to close the gap. The reconcile loop is the heart of it: observe current state, compare to desired state, take the minimum action needed, return. It's designed to be called repeatedly."
"The implementation uses informers — not raw watch calls to etcd. An informer maintains a local in-memory cache of the resources it cares about, refreshed by a list-watch stream. When a resource changes, the event is put into a work queue. The reconcile function pops from the queue, reads from the local cache (not etcd — this reduces load dramatically), and acts."
"The critical design property is idempotency: the reconcile function must produce the same result whether called once or ten times. This is because in a distributed system, the controller might crash mid-reconciliation, or receive duplicate events. Kubernetes controllers are designed so re-running reconcile on an already-reconciled object is safe."
"Failure handling: on error, the item is re-queued with exponential backoff. The work queue has rate limiting to prevent thundering herd when many resources need reconciling simultaneously. This is why after a cluster comes back up from an outage, you see a wave of reconciliation activity that settles down over minutes, not a hard spike."
"For a payment processor I'd structure this in layers. Network foundation: one VPC per cluster, three private subnets (one per AZ) sized at /22 (~1000 IPs each) with prefix delegation enabled — this gives ~16,000 pod IPs per subnet. Public subnets in each AZ hold only load balancers. No public node IPs."
"Compute: I'd use Karpenter with a NodePool constrained to on-demand only (no spot for payments — spot interruption = transaction disruption). Disruption policy set to WhenEmpty only — no voluntary consolidation during business hours. TopologySpreadConstraints on all Deployments enforce at least one pod per AZ with maxSkew:1."
"Data layer: if using EBS (e.g. for a local cache), WaitForFirstConsumer StorageClass so volumes are provisioned in the pod's AZ. PodDisruptionBudgets with minAvailable = ceil(replicas * 0.75) so upgrades can't take the service below 75% capacity. For the payment DB, I'd likely use Aurora Multi-AZ outside the cluster, accessed by pods via IRSA-authenticated connection."
"Security: IRSA per service component (payment-api gets only the specific DynamoDB tables it needs), KMS-encrypted secrets, NetworkPolicy default-deny in the payment namespace with explicit allow for ingress from the ALB target and egress to the DB endpoint. Pod SecurityContext: runAsNonRoot, readOnlyRootFilesystem, drop ALL capabilities."
"Ingress: multi-AZ ALB with WAF attached, target-type:ip, externalTrafficPolicy:Local, HTTPS only, TLS cert managed via ACM. Health checks on /healthz with 30s interval and 2 unhealthy threshold."
kubectl rollout undo scales up previous RS"A rolling update creates a new ReplicaSet for the updated pod template. The Deployment controller then scales the new RS up and the old RS down simultaneously, respecting maxSurge and maxUnavailable constraints."
"maxUnavailable=0, maxSurge=1 is the zero-downtime configuration: before killing any old pod, the controller ensures a new pod is Ready. It creates 1 new pod (surge), waits for its readiness probe to pass, then terminates 1 old pod. This repeats until all replicas are updated. Slowest but safest — no user traffic hits the new version until it's proven Ready."
"Readiness probes are the critical dependency here. If your new pod version never passes its readiness probe (e.g., the new code has a startup bug), the rolling update halts — it won't kill old pods because it can't bring new ones to Ready. This is actually good: the old version keeps serving traffic. You'd see the Deployment stuck at partial rollout."
"rollout history: each update creates a new RS kept as a revision. kubectl rollout undo scales up the previous RS. The number of retained revisions is controlled by revisionHistoryLimit (default 10)."
"Pod termination is a multi-step process with a critical race condition that causes dropped connections in most clusters. When a pod is deleted: the API server marks it as Terminating, which triggers two concurrent paths — kubelet starts the termination sequence, and endpoints controller removes the pod from Service endpoints, which kube-proxy uses to update iptables rules."
"The kubelet path: if a preStop hook is defined, it runs first. Then SIGTERM is sent to all containers. If the process doesn't exit within terminationGracePeriodSeconds, SIGKILL is sent."
"The race condition: iptables rule removal is asynchronous and may lag behind SIGTERM by 1–10 seconds. During that window, new requests can still arrive at the pod while it's trying to shut down. The standard workaround is a preStop hook with sleep 5 — this delays SIGTERM by 5 seconds, giving iptables time to propagate the endpoint removal before the app starts refusing connections."
"For applications that have long-running requests (gRPC streams, websockets), set terminationGracePeriodSeconds to the maximum expected request duration + the sleep buffer. A payment processing service might need 60–90 seconds to drain."
"A CRD lets you extend Kubernetes with domain-specific resource types. Once you define a CRD, the API server handles storage, validation (via OpenAPI schema), RBAC, and watch — you get all of that for free. Your custom object lives in etcd alongside native Kubernetes objects."
"An Operator is a controller that watches those CRDs and implements operational domain knowledge — the stuff a human operator would do. For a database operator: creating a cluster, handling node failures, taking backups, performing rolling upgrades. The key is encoding runbooks as code."
"Build vs buy: for common infrastructure — Postgres, Kafka, Prometheus, Redis — I'd always evaluate existing operators first. Strimzi for Kafka, CloudNativePG for Postgres, and the Prometheus Operator are battle-tested. Building your own is weeks of work with edge cases you haven't thought of yet."
"I'd build custom when: the operational logic is proprietary to our business domain (e.g., an operator for our internal ML training job lifecycle), when existing operators have architectural constraints that conflict with our requirements, or when we need deep integration with internal tooling. The kubebuilder framework makes this approachable — it scaffolds the controller, generates CRD YAML, and handles the watch/cache/queue plumbing."
svc-name → tries svc-name.namespace.svc.cluster.local first"CoreDNS is Kubernetes' cluster DNS service — a Deployment (typically 2 replicas for HA) running in kube-system. Its ClusterIP is configured in every pod's /etc/resolv.conf as the nameserver. Each pod's resolv.conf also has search domain suffixes, so a bare hostname like my-svc gets tried as my-svc.my-namespace.svc.cluster.local first, then falling back through the domain chain."
"The most impactful DNS issue in large EKS clusters is the 5-second DNS timeout. It's caused by a Linux kernel race condition in conntrack when multiple threads from the same pod make concurrent DNS queries — SNAT and packet processing can drop one of them, causing a 5-second retry. The symptoms: intermittent 5-second latency spikes on first connections, usually masked by connection pooling but visible in tail latencies."
"NodeLocal DNSCache is the proper EKS fix: a DaemonSet that runs a DNS cache on every node and intercepts DNS traffic via a link-local address before it reaches CoreDNS. This eliminates the conntrack issue because DNS queries no longer traverse iptables NAT for most lookups. It also reduces load on CoreDNS significantly in large clusters."
"Other CoreDNS failures: CoreDNS pods OOMKilled under load (increase memory limits), CoreDNS pods scheduled on a single node that becomes unavailable (use topologySpreadConstraints on the CoreDNS Deployment), and DNS cache poisoning (validate CoreDNS is not forwarding to untrusted resolvers)."
"I structure observability around three pillars with clear signal priorities. Metrics: I use kube-prometheus-stack (Prometheus + Grafana) for cluster metrics. kube-state-metrics translates Kubernetes object states into Prometheus metrics — pod phases, deployment rollout status, PVC binding state. node-exporter covers host-level resources. For the control plane, EKS doesn't expose Prometheus metrics directly, so I rely on CloudWatch Container Insights for control plane health."
"Critical alerts I always set up: node in NotReady for >2 minutes (hardware/network issue), pod CrashLoopBackOff (app crash), PersistentVolumeClaim stuck Pending (storage provisioning failure), kube-proxy DaemonSet not fully available, and CoreDNS error rate spike. These are the signals that wake me up at 3am."
"Logs: Fluent Bit DaemonSet (the ADOT Fluent Bit variant ships with Container Insights) forwards container logs to CloudWatch Logs with pod metadata enrichment. Control plane logs — especially API audit and auth logs — must be enabled explicitly in EKS and are critical for security investigations and RBAC debugging."
"Traces: for a Staff-level concern — I use OpenTelemetry (ADOT Collector as a DaemonSet or sidecar) to collect traces from services and forward to X-Ray or Jaeger. The key is propagating trace context across service boundaries, especially across Ingress → service → database. Without traces, you can see that latency increased but not which service in the call chain is responsible."
"Immediate priority is containment without losing evidence. I'd immediately apply a NetworkPolicy to the compromised pod's labels — deny all ingress and egress except to a forensics endpoint. This isolates it from lateral movement while keeping it running for evidence collection. I'd label the pod with a quarantine label so it's visually identifiable."
"In parallel, I'd check CloudTrail for any AWS API calls made by the pod's IAM role in the last hour. If the pod had an IRSA role with S3 read access, did it exfiltrate data? I can't revoke the temp credentials directly, but I can attach an explicit Deny policy to the IAM role immediately, which overrides all other permissions even for in-flight temp credentials."
"Blast radius assessment: what K8s Secrets were mounted in that namespace? What ServiceAccounts could the compromised process impersonate? If the pod had node-level access (host PID, host network), the blast radius is the entire node and all pods on it — in that case I'd cordon and drain the node."
"For remediation: rotate all credentials the pod could have accessed (DB passwords, API keys in Secrets, IRSA role rotated by creating a new role), patch the container image, redeploy. Then conduct a post-mortem: how did this happen, what GuardDuty rule would have caught it earlier, is Pod Security Admission configured to prevent privileged containers?"
"Requests and limits serve different purposes. Requests are scheduling hints — the scheduler sums all pod requests per node to determine available capacity. A node with 8 vCPU and all requests totaling 7 vCPU will not schedule another 2-vCPU request pod, even if actual usage is only 3 vCPU. This is intentional — requests guarantee resource availability."
"Limits are enforcement ceilings. CPU limits are implemented with Linux cgroups cfs_quota — if a container tries to use more than its CPU limit, the scheduler throttles it without killing it. Memory limits are different: if a container exceeds its memory limit, the Linux OOM killer kills the process. The container restarts (if restartPolicy allows), which shows up as CrashLoopBackOff with OOMKilled reason."
"The QoS class determines eviction priority under node pressure. Guaranteed (requests == limits for ALL containers) means kubelet won't evict the pod under memory pressure unless it's actually over limit. Burstable gets evicted next. BestEffort (no requests/limits at all) gets evicted first."
"My recommendation: set requests == limits for memory on critical workloads (Guaranteed QoS, protects against eviction), but keep CPU limit higher than request or remove it entirely — CPU throttling degrades latency subtly and is hard to detect. For non-critical workloads, set memory requests conservatively with limits 2x higher."
"A sidecar extends or enhances a main container's capabilities by running in the same pod, sharing network namespace and volumes. The canonical examples are: an Envoy proxy sidecar for service mesh traffic control, a Fluent Bit sidecar for log forwarding (especially on Fargate where DaemonSets don't work), and the IRSA token refresh sidecar in older patterns."
"Sidecar vs init container: init containers run sequentially before the main container and must complete successfully — perfect for wait-for-dependency patterns (wait until Postgres is ready), one-time config generation, or DB schema migrations. Sidecars run for the lifetime of the pod alongside the main container."
"In Kubernetes 1.29+, there's now a formal sidecar container type — an initContainer with restartPolicy:Always. It starts before the main container, keeps running, and importantly, it terminates after the main container during pod shutdown. This solves the classic problem where a log forwarder sidecar would die before draining the log buffer because both containers get SIGTERM simultaneously."
"Separate service is better when: the functionality needs independent scaling (log aggregation serving 100 app pods doesn't need to scale 1:1 with apps), it needs its own RBAC or IAM permissions isolated from the app, or it serves multiple different workloads."
"At the ALB level, the AWS Load Balancer Controller supports weighted target groups. You deploy a v2 Deployment alongside v1, create a separate Service, and in the Ingress you annotate with alb.ingress.kubernetes.io/actions.forward-rule pointing to both target groups with weights — e.g., 90% to v1, 10% to v2. This requires no application change and shifts traffic at the LB level."
"For full progressive delivery with automatic rollback, Argo Rollouts replaces the standard Deployment and adds traffic shifting logic. You define canary steps: pause for analysis, shift 10% → wait → shift 25% → wait → promote. The analysis template queries Prometheus — if error rate exceeds a threshold, Rollouts automatically rolls back. No human involvement needed."
"Header-based canary is useful for targeted testing: route requests with X-Canary: true header to the new version, all others to the old version. The ALB can route on request headers. This lets you send your QA team or specific user cohort to the new version while production users stay on stable."
"For true blue-green (instant full cutover): run two identical Deployments, change the Service selector label from version: blue to version: green. The cutover is atomic — Service selector update is a single etcd write, and kube-proxy propagates it within seconds."
pod-0.svc-name.namespace.svc.cluster.local"The core distinction is pod identity. Deployment pods are interchangeable — if pod-abc crashes, a new pod-xyz replaces it, same spec, different name. StatefulSet pods have stable identities: pod-0, pod-1, pod-2 — if pod-1 crashes, the replacement is also called pod-1 and remounts the same PVC."
"This stable identity enables three things Deployments can't provide: stable network identity via a headless Service (pod-0.db.namespace.svc.cluster.local resolves specifically to pod-0 — critical for Kafka brokers advertising their address to clients), stable persistent storage (each replica gets its own PVC from volumeClaimTemplates, which persists across pod restarts and is not deleted on scale-down), and ordered operations (pod-1 won't start until pod-0 is Ready, rolling updates go in reverse order — pod-2 updates before pod-1 before pod-0)."
"I use StatefulSet for any application where the instance matters: database nodes, Kafka/Pulsar brokers, ZooKeeper, Redis cluster members, Elasticsearch data nodes. I use Deployment for stateless applications where any pod can serve any request."
"Important EKS-specific nuance: StatefulSet PVCs are not deleted when you scale down. If you scale from 3 to 2 replicas, pod-2 is deleted but data-pod-2 PVC remains, consuming EBS costs. This is intentional (protects data) but requires manual cleanup when decommissioning."
"DaemonSets ensure exactly one pod per node (or per matching node subset). The DaemonSet controller bypasses the normal scheduler and directly sets nodeName on pods — which is why DaemonSet pods appear on nodes that are in NotReady state, or that have memory pressure. Critical infrastructure like aws-node (VPC CNI), kube-proxy, and log forwarders use DaemonSets because they must run everywhere."
"By default, DaemonSet pods tolerate most node condition taints — this is intentional so that node maintenance tooling (the DaemonSet) can run even when a node is degraded. You can override this by removing tolerations if you want the DaemonSet to only run on healthy nodes."
"To run a DaemonSet on a subset of nodes: add nodeSelector or nodeAffinity to the DaemonSet's pod spec. For example, run the GPU monitoring DaemonSet only on nodes with label gpu=true. Karpenter is aware of DaemonSet overhead — when sizing a new node, it accounts for DaemonSet pod resource requests as overhead."
kubectl describe node → Conditions section. KubeletReady = false?systemctl status kubelet.journalctl -u kubelet -n 100), disk full on root volume, docker/containerd OOM, network issue preventing kubelet from reaching API server, certificate rotation failure"First: kubectl describe node <node> — the Conditions section tells me whether it's MemoryPressure, DiskPressure, PIDPressure, or plain NotReady (kubelet lost contact with API server). The Events section shows recent node-level events."
"I then SSH via AWS Systems Manager Session Manager (no bastion needed if SSM agent is running on the node). Check kubelet: systemctl status kubelet and journalctl -u kubelet -n 200 --no-pager. Common logs: 'failed to get cgroup stats' (kernel version issue), 'certificate has expired' (cert rotation failure), 'PLEG is not healthy' (container runtime unresponsive)."
"If the disk is full (df -h), check for large log files, stuck containers with big log outputs, or image layer accumulation. On EKS, docker system prune (or containerd equivalent) clears unused images. Increase the root EBS volume in the launch template if this is recurring."
"For managed node groups, the operational response is: cordon, drain (respecting PDBs), terminate the EC2 instance, and let the ASG replace it with a fresh node. Investigating the root cause is parallel — don't hold up workload recovery."
"HPA adjusts the replica count of a Deployment or StatefulSet based on observed metrics. It queries the metrics-server every 15 seconds, applies the formula ceil(current × observed/target), and updates the replicas field. The built-in metrics are CPU and memory utilization relative to requests — which is why resource requests must be set for HPA to work."
"The stabilization window prevents thrashing: scale-up decisions are applied quickly (3-minute window — if the trigger persists for 3 min, scale up). Scale-down is conservative (5-minute window — don't scale down until the metric has been below target for 5 min). You can tune these with behavior.scaleDown.stabilizationWindowSeconds."
"KEDA (Kubernetes Event-Driven Autoscaling) extends HPA for external triggers — SQS queue depth, Kafka consumer lag, custom Prometheus metrics, even cron schedules. It creates HPA objects under the hood but with external metrics. This is the right pattern for batch/queue-based workloads where CPU utilization is a lagging indicator of load."
"Key limitation: HPA can scale replicas out but not individual pod resource sizes — that's VPA (Vertical Pod Autoscaler). VPA and HPA can conflict if both are targeting CPU metrics on the same resource. Use HPA for throughput-oriented scaling, VPA for right-sizing single-replica workloads."
"A service mesh intercepts all service-to-service traffic via sidecar proxies injected into each pod. This gives you: automatic mTLS between all services (zero-trust networking without any application code change), fine-grained traffic management (retry policies, circuit breakers, canary traffic splitting at L7), and deep observability — traces and metrics for every service-to-service call, even across language boundaries."
"The cost is real: latency overhead (2–10ms per hop for Istio Envoy sidecars, lower for Linkerd), memory overhead (each Envoy sidecar uses 50–150MB — at 100 pods, that's 5–15GB extra memory), and significant operational complexity. Debugging traffic issues through a service mesh is much harder than native Kubernetes networking."
"I'd recommend a service mesh when: compliance mandates mTLS in-cluster (PCI-DSS, HIPAA), you need sophisticated traffic management across microservices without code changes, or you need per-service-pair traffic metrics that Prometheus doesn't provide without custom instrumentation."
"For many teams, Cilium's eBPF-based service mesh is a better fit — no sidecars, lower overhead, and it integrates with the CNI. For mTLS specifically, SPIRE/SPIFFE for workload identity might be a lighter option than deploying full Istio."
pod-security.kubernetes.io/enforce=restricted"PSP was a cluster-scoped resource that defined what security constraints pods must meet. The problem: its RBAC model was confusing (a pod could use a PSP if the pod's ServiceAccount could 'use' it — this led to many accidental privilege escalation bugs). It was deprecated in 1.21 and removed in 1.25."
"Pod Security Admission replaces it with a simpler model: you label a namespace with a policy level, and the built-in admission controller enforces it. Three levels: privileged (no restrictions, for system namespaces), baseline (blocks hostPID, hostNetwork, privileged containers, dangerous capabilities), restricted (requires non-root, drops all capabilities, requires seccompProfile)."
"Three modes per level: enforce (reject the pod), audit (allow but log an audit event), warn (allow but show a warning to the user). The migration strategy: add pod-security.kubernetes.io/audit=restricted labels to all namespaces first, check audit logs for violations, fix the workloads, then switch to enforce mode."
"For EKS, namespaces like kube-system must stay privileged (system components need elevated privileges). Application namespaces should target at least baseline, ideally restricted for production workloads."
failurePolicy: Fail — if the webhook is unreachable, the request is rejected. Ignore — allow through.Fail policy: no new pods can be scheduled (cluster emergency)"Admission webhooks intercept API server requests after authentication/authorization but before persisting to etcd. Mutating webhooks run first — they can modify the object being submitted (inject a sidecar container, add an annotation, set a default resource limit). Validating webhooks run after all mutating webhooks — they can only accept or reject, not modify. OPA/Gatekeeper uses validating webhooks."
"The critical operational risk is failurePolicy: Fail. If your webhook service (e.g., Gatekeeper) becomes unavailable and it's configured with Fail policy, every API request that matches the webhook rule is rejected. In the worst case, no new pods can be scheduled and no Deployments can be updated — effectively a cluster incident. I've seen Istio injection webhooks with Fail policy take down a cluster when the Istiod service crashed."
"Best practices: scope webhooks with namespaceSelector to exclude critical system namespaces (kube-system, kube-public) so the cluster can always self-heal. Set reasonable timeouts (2–5s) and have runbooks for emergency disabling. Monitor webhook latency — a slow webhook adds that latency to every kubectl apply."
"The Ingress API has been stretched beyond its original design. Every feature beyond basic path routing requires implementation-specific annotations — ALB annotations for auth, NLB annotations for TCP, NGINX annotations for rate limiting. There's no standard way to express traffic splitting or header-based routing."
"Gateway API introduces a role-based model with three resource types. GatewayClass is cluster-scoped, controlled by infra teams — it defines the controller (e.g., aws-load-balancer-controller). Gateway is created by cluster operators and provisions the actual load balancer with listener configuration. HTTPRoute is created by app teams and attaches to a Gateway — defining routing rules with standardized syntax for path, header, weight-based traffic splitting."
"The key improvements for platform teams: role separation — app teams write HTTPRoutes without needing cluster-admin, and they can't accidentally misconfigure the shared Gateway. Standardized traffic splitting — canary deployments are first-class, expressed as weight: 90/10 in the HTTPRoute spec without custom annotations. AWS LBC supports Gateway API in newer versions."
"CSI standardizes the interface between Kubernetes and storage vendors. Before CSI, each storage provider had in-tree code in the Kubernetes codebase — a security and release coupling problem. CSI moves this to out-of-tree plugins that run as pods."
"The architecture has two components. The controller plugin is a Deployment that runs on any node and handles cloud API calls: CreateVolume (provision an EBS volume), DeleteVolume, ControllerPublishVolume (attach to an EC2 instance). It runs with IRSA credentials that have ec2 permissions."
"The node plugin is a DaemonSet on every worker node that handles the local mount operations: NodeStageVolume (format and mount to a staging path on the host), NodePublishVolume (bind-mount from staging into the pod's directory). It runs privileged because it needs to create mounts in the host's mount namespace."
"Dynamic provisioning flow: a PVC is created with a StorageClass. The external-provisioner sidecar inside the controller Deployment watches for unbound PVCs, calls CreateVolume on the CSI driver, and creates a PV object. The PVC binds to the PV. When the pod is scheduled to a node, the external-attacher calls ControllerPublishVolume to attach the EBS volume to that EC2 instance, then the node plugin mounts it into the pod."
karpenter.sh/capacity-type labels to separate."The 2-minute spot interruption notice is the key constraint. AWS sends an EC2 instance metadata event, and the AWS Node Termination Handler (or Karpenter natively) picks this up, cordons the node immediately, and drains it — evicting pods gracefully before the instance terminates. This gives pods up to 2 minutes to handle SIGTERM and shut down."
"My tiering strategy: stateless workloads on spot (web services, API servers, workers) with multiple replicas spread across instance families (m5, m5a, m5n, m4 — if m5 spot capacity dries up, m5a picks up the slack). Stateful and critical workloads on on-demand. With Karpenter, this is a NodePool label: karpenter.sh/capacity-type: spot for batch pools, on-demand for critical pools."
"AZ and instance family diversification is critical. A mass reclaim event in us-east-1a m5 family would take down all your spot nodes if they're homogeneous. Diversify: require at least 3 different instance families, 3 AZs. Karpenter automatically selects across families based on pricing."
"PDBs are non-negotiable for spot workloads. A spot mass reclaim can hit multiple nodes in the same minute. Without PDBs, your 3-replica service could go to 0 during a reclaim wave."
"Migrating a stateful workload involves three parallel concerns: Kubernetes resource migration, data migration, and traffic cutover. I'd approach them as a pipeline."
"Kubernetes resources: use Velero to back up all resources in the namespace from the source cluster and restore them to the destination. Velero handles PV/PVC metadata, ConfigMaps, Secrets, ServiceAccounts, and IRSA annotations."
"Data migration: take a CSI VolumeSnapshot of each PVC in the source cluster. Use the VolumeSnapshotContent object's snapshot handle to create a new PVC from snapshot in the destination cluster. For large volumes, this runs in parallel with resource setup. For databases, I'd quiesce writes (or use application-level replication) before the final snapshot to avoid consistency issues."
"Traffic cutover: I stand up the workload in the destination cluster in passive mode (workload running, no external traffic). Run smoke tests against the new cluster's internal ALB. Then use Route 53 weighted routing to gradually shift traffic: 10% to new cluster, verify metrics (error rate, latency) for 30 min, shift 50%, verify, then cut to 100%. Keep the old cluster alive for 30 minutes as a rollback option."
"The hardest part is maintaining data consistency during the traffic transition window if the workload has writes. For databases, application-level replication (Postgres logical replication) to the new cluster until cutover is the cleanest approach."