Complete Reference · KodeKloud AWS EKS Course

AWS EKS Deep Reference

Concepts · Animated Diagrams · Commands · Gotchas · Tradeoffs — all 7 modules
Balanced DepthProduction Patterns Interview ReadyPlatform Engineering
7
Modules
30+
Topics
80+
Commands

EKS is AWS's managed Kubernetes — AWS owns the control plane (etcd, API server, scheduler, controllers) spread across 3+ AZs. You own the data plane (nodes, pods, workloads). This split is the foundation of every EKS tradeoff.

EKS Architecture — Animated

Control plane (AWS-managed VPC) ↔ Data plane (your VPC) via cross-account ENI
AWS MANAGED VPC etcd 3–5 nodes quorum HA multi-AZ API Server multi-inst NLB-backed Controller Manager Scheduler OIDC Endpoint CloudWatch Logs AWS manages: patches · backups · scaling · HA ~$0.10/hr control plane fee X-ENI cross-account bridge YOUR VPC WORKER NODES EC2 node EC2 node EC2 node P P P P OS patches · IAM · SGs ADD-ONS CoreDNS kube-proxy VPC CNI Namespaces · RBAC · CRDs · NetworkPolicies Workloads · Deployments · Services · Ingress API traffic cross-account ENI your workloads etcd (AWS-owned)

Shared Responsibility Model

Core
AWS Manages (Control Plane)

etcd (3–5 nodes, quorum-managed) · API Server (multi-instance behind NLB) · Controller Manager · Scheduler · Control plane VPC + cross-account ENIs · Automatic HA across 3 AZs · Patching, backups, scaling · OIDC endpoint · CloudWatch log forwarding

You Manage (Data Plane)

Worker nodes (EC2 or Fargate) · OS patches & AMI updates · Kubernetes workloads, Namespaces, RBAC, CRDs · Pod config, resource requests · Security Groups, IAM roles · Node group scaling policies · Add-on version upgrades · VPC/subnet design

You pay ~$0.10/hr per EKS cluster for the control plane regardless of node count. Worker nodes bill separately as EC2. Always clean up test clusters — a forgotten cluster costs ~$72/month in control plane fees alone.

Control Plane Components

ComponentRoleEKS Behavior
etcdDistributed key-value store — all cluster state3–5 nodes, AWS manages quorum, automatic backups, multi-AZ
API ServerAll kubectl / SDK calls entry pointMultiple instances behind NLB, auto-scales under load
Controller ManagerReconcile loops (ReplicaSets, Deployments…)AWS managed; cloud controller runs separately in modern EKS
SchedulerAssigns pods to nodes (resources/taints/affinity)AWS managed, auto-replaced on failure
OIDC EndpointIssues tokens for IAM↔K8s identity mapping (IRSA)Auto-created per cluster; associate via eksctl

Control Plane ↔ Data Plane Communication

AWS provisions Cross-Account ENIs (X-ENIs) inside your VPC subnets. Traffic between your nodes and the AWS-managed API Server flows through these ENIs — no public internet required in private endpoint mode.

How a kubectl apply reaches etcd
kubectl apply your machine aws eks get-token IAM auth API Server validates RBAC check X-ENI bridge your VPC→AWS Controller Manager reconcile etcd write state stored
Misconfigured VPC route tables or Security Groups blocking port 443 to X-ENI CIDRs = nodes can't join cluster. Always allow 443 outbound from worker SG to the control plane endpoint.

Deployment Options

IaC
AWS Console

Click-through creation. Good for learning. Not reproducible. Avoid for production.

eksctl

Official CLI. Fastest path to a cluster. Generates CloudFormation under the hood. Best for learning/labs.

Terraform

HCL + official EKS Blueprints module. Remote state in S3+DynamoDB. Best for production IaC.

CDK / Pulumi

TypeScript/Python code → CloudFormation or direct API. Good for platform teams with existing code pipelines.

$
eksctl create cluster --name my-cluster --region us-east-1 --nodes 3 --node-type t3.mediumFastest cluster creation. Generates a CloudFormation stack automatically.
$
aws eks update-kubeconfig --region us-east-1 --name my-clusterMerge cluster credentials into ~/.kube/config
$
eksctl get cluster --region us-east-1List all EKS clusters in region
$
eksctl delete cluster --name my-cluster --region us-east-1Clean up cluster + all CloudFormation resources
⚠ Terraform State

Always configure remote backend (S3 + DynamoDB locking) before creating EKS with Terraform. Local state + team collaboration = corruption.

⚠ Tool drift

If you create via Console then try to manage with eksctl/Terraform, you'll have state drift. Pick one tool per cluster and stick to it.

Essential Tools

ToolPurposeInstall
kubectlK8s control — deployments, pods, services, logsbrew install kubectl
eksctlEKS-specific cluster/nodegroup/addon managementbrew tap weaveworks/tap && brew install eksctl
aws CLI v2AWS API — EKS, IAM, EC2, ECR etc.brew install awscli
helmPackage manager for Kubernetes chartsbrew install helm
eksdemoDemo/lab tool — VPC/subnet/ENI inspectionbrew install eksdemo
kubectx / kubensFast context and namespace switchingbrew install kubectx
kubectl-no-troubleScan for deprecated API usage before upgradesbrew install nodetaint/tap/kubectl-no-trouble

EKS networking uses AWS VPC CNI by default — each pod gets a real VPC IP from your subnet via secondary ENI IPs. No overlay, lowest latency, but IP exhaustion is a real production risk you must plan for upfront.

VPC CNI — How Pod IPs Are Assigned

AWS VPC CNI: secondary IPs on ENIs → pod network namespaces
VPC 10.0.0.0/16 Subnet us-east-1a 10.0.1.0/24 EC2 m5.xlarge eth0 primary 10.0.1.5 node IP eth1 secondary +15 IPs pod IPs eth2 secondary +15 IPs pod IPs pod .6 pod .7 pod .8 pod .9 pod .10 pod .11 aws-node DaemonSet manages ENI attach + IP pool Subnet us-east-1b 10.0.2.0/24 EC2 m5.xlarge eth0 10.0.2.5 eth1 +15 IPs pod pod pod direct pod-to-pod (no NAT)

CNI Options

CNIPod IP SourceOverlayLatencyUse Case
AWS VPC CNIVPC subnet IPsNoLowestDefault EKS, direct VPC routing, no overhead
CiliumCustom CIDRYes (Geneve/eBPF)VariableAdvanced L7 policies, observability, service mesh
CalicoCustom CIDRYes (IP-in-IP)Low overheadNetworkPolicy-heavy, BGP routing
FlannelCustom CIDRYes (VXLAN)MediumSimple dev clusters

Prefix Delegation — Pod Density Scale-up

IP Scale

Without prefix delegation, each secondary ENI slot holds 1 IP → 1 pod. With prefix delegation, each slot holds a /28 block = 16 IPs. Same instance, dramatically more pods.

Without vs with prefix delegation — same m5.xlarge (4 ENIs × 15 slots)
WITHOUT PREFIX DELEGATION 60 pods 4 ENIs × 15 IPs = 60 IPs Each square = 1 pod IP ENABLE_PREFIX DELEGATION=true WITH PREFIX DELEGATION 960 pods 4 ENIs × 15 /28 blocks × 16 IPs ENI-0 /28 = 16 IPs /28 = 16 IPs /28 = 16 IPs ... × 15 ENI-1 /28 = 16 IPs ENI-2 ENI-3 pod IP from /28 block — instant, no ENI wait
WARM_ENI_TARGET

Keep N spare ENIs attached. Default: 1. Higher = faster pod starts, more idle IPs consumed.

WARM_IP_TARGET

Keep N spare IPs ready. More granular than ENI target. Combine with MIN_IP_TARGET for small nodes.

WARM_PREFIX_TARGET

For prefix delegation: keep N spare /28 prefixes attached. Default: 1. Enables instant pod scheduling.

Enable Prefix Delegation# Enable on existing cluster
kubectl set env ds aws-node -n kube-system \
  ENABLE_PREFIX_DELEGATION=true \
  WARM_PREFIX_TARGET=1

# Verify — sort pods by IP to see /28 blocks
kubectl get pods -o wide --sort-by='.status.podIP'
# You'll see blocks like 192.168.114.16–31 on same node
Prefix delegation requires contiguous /28 blocks in your subnet. On a fragmented subnet, prefix allocation fails even with free individual IPs. Plan subnet size (at least /22) upfront.

Network Policies

By default all pods can reach all other pods (no isolation). NetworkPolicy resources add L3/L4 traffic rules. Use aws-eks-nodeagent (eBPF) or Cilium/Calico for enforcement.

deny-all ingress defaultapiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: default-deny-ingress
spec:
  podSelector: {}   # all pods
  policyTypes:
    - Ingress
  # no rules = deny all ingress
allow frontend → backendapiVersion: networking.k8s.io/v1
kind: NetworkPolicy
spec:
  podSelector:
    matchLabels:
      app: backend
  ingress:
  - from:
    - podSelector:
        matchLabels:
          app: frontend
    ports:
    - port: 8080
Enable eBPF network policy: kubectl set env ds aws-node -n kube-system ENABLE_NETWORK_POLICY_CONTROLLER=true. Or use Cilium for L7 policies + deep observability.

VPC & Subnet Design Rules

Public Subnets

Nodes with public IPs, IGW route. For load balancers only. Tag: kubernetes.io/role/elb=1

Private Subnets

Worker nodes (recommended). NAT Gateway for egress. Tag: kubernetes.io/role/internal-elb=1

Sizing Rule

Use /16 VPC per cluster. /22 subnets minimum per AZ. A /24 (250 IPs) fills up with prefix delegation enabled on just a few nodes.

⛔ IP Exhaustion

With AWS VPC CNI, each pod consumes a real VPC IP. A single m5.xlarge can consume 60 IPs (4 ENIs × 15 IPs). On a /24 subnet = 4 nodes max. Plan /22 or /21 subnets per AZ.

⚠ Multi-cluster in one VPC

Running multiple clusters in one VPC multiplies IP consumption. Isolate clusters in dedicated VPCs unless you have a specific reason to share.

Three compute models: Managed Node Groups (EC2 ASG, you own the OS), Fargate (serverless pods), and Karpenter (dynamic EC2 provisioner, workload-driven). The right choice depends on workload type, cost target, and ops burden tolerance.

Animated Compute Model Comparison

Three ways to run pods on EKS — click boxes to highlight
Managed Node Groups EC2 t3.xlarge EC2 t3.xlarge EC2 pending + Auto Scaling Group ✓ DaemonSets ✓ GPUs / host network △ You patch AMIs ✗ Slower scaling AWS Fargate pod microVM AWS manages underlying host pod microVM ✓ Zero node management ✓ Perfect isolation ✗ No DaemonSets / GPUs ✗ Cold start ~30–60s Karpenter Pending pod Karpenter controller new EC2 optimal EC2 Fleet API — any instance type ✓ Pod-level provisioning ✓ Auto least-cost / spot ✓ Consolidation △ Needs resource requests

Full Feature Comparison

DimensionManaged Node GroupsFargateKarpenter
Node ownershipYou (EC2)AWS managedYou (EC2)
OS patchingYou update AMIsAWS managesYou (NodeClass AMI config)
Startup latencyFast (pre-provisioned)30–60s cold start30–90s (EC2 boot)
Scaling granularityGroup-level ASGPer podPer pod
Spot supportManual per node groupNo spot on FargateAuto least-cost with spot
DaemonSets
GPUs / NVMe
Host networking
Persistent EBS✗ (EFS only)
Auto consolidation✗ manualN/A (pod-per-node)✓ built-in

Managed Node Groups — Commands

$
eksctl create nodegroup --cluster my-cluster --name ng-workers --node-type m5.xlarge --nodes 3 --nodes-min 1 --nodes-max 10 --managedCreate a managed node group with auto-scaling
$
eksctl scale nodegroup --cluster my-cluster --name ng-workers --nodes 5Manually scale a node group
$
eksctl upgrade nodegroup --cluster my-cluster --name ng-workers --kubernetes-version 1.30Rolling upgrade node group to new K8s version
$
eksctl delete nodegroup --cluster my-cluster --name ng-workers --drainDrain workloads then delete node group
⚠ maxUnavailable during upgrades

Default is 1 node at a time — very slow for large clusters. Increase maxUnavailable in the nodegroup update config. Ensure your workloads have replicas > maxUnavailable to avoid downtime.

⚠ EBS + AZ affinity

If a pod uses EBS and gets rescheduled to a different AZ, it can't reattach the volume. Use volumeBindingMode: WaitForFirstConsumer on StorageClass and consider per-AZ node groups for stateful workloads.

Karpenter — NodePool + NodeClass

Recommended
NodePool CRDapiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: default
spec:
  template:
    spec:
      nodeClassRef:
        name: default
      requirements:
      - key: karpenter.sh/capacity-type
        operator: In
        values: ["spot", "on-demand"]
      - key: kubernetes.io/arch
        operator: In
        values: ["amd64"]
  limits:
    cpu: 1000
    memory: 1000Gi
  disruption:
    consolidationPolicy: WhenUnderutilized
    consolidateAfter: 30s
EC2NodeClassapiVersion: karpenter.k8s.aws/v1
kind: EC2NodeClass
metadata:
  name: default
spec:
  amiSelectorTerms:
  - alias: al2023@latest
  subnetSelectorTerms:
  - tags:
      karpenter.sh/discovery: my-cluster
  securityGroupSelectorTerms:
  - tags:
      karpenter.sh/discovery: my-cluster
  role: KarpenterNodeRole-my-cluster
⚠ Missing resource requests = bad scheduling

Always set resources.requests on all containers. Without them, Karpenter can't determine the right instance size and may bin-pack everything onto one tiny instance.

⚠ No PodDisruptionBudget = disruptive consolidation

Karpenter drains nodes during consolidation. Without PDBs, critical services may briefly have 0 replicas. Always define PDBs for production workloads.

⛔ Don't run Karpenter + Cluster Autoscaler together

They will fight over scale-down decisions on the same node groups. Pick one — Karpenter is the recommended choice for new clusters.

EKS storage: EBS (block, AZ-bound, databases), EFS (shared file, multi-AZ, Fargate-compatible), FSx Lustre (HPC/ML), FSx ONTAP (enterprise). All use the CSI driver pattern — a controller Deployment + node DaemonSet.

Storage Options — Animated Decision Map

Choose the right storage based on access pattern and AZ scope
Shared across multiple pods? NO YES EBS Block · ReadWriteOnce Single AZ · Low latency Databases · StatefulSets High perf / HPC? Enterprise NAS? NO EFS NFS · ReadWriteMany Multi-AZ · Fargate ✓ FSx Lustre HPC · ML · Batch FSx ONTAP SMB/NFS/iSCSI Enterprise YES Instance Store NVMe Ephemeral! Temp/cache only cache

Storage Comparison

ServiceTypeAccess ModeAZ ScopeBest ForFargate
EBSBlockRWOSingle AZDatabases, stateful apps
EFSFile (NFS)RWXMulti-AZShared config, CMS, ML datasets
FSx LustreHigh-perf parallel FSRWXSingle AZHPC, ML training, batch
FSx ONTAPEnterprise NASRWO/RWXMulti-AZSMB/NFS/iSCSI enterprise lift-shift
Instance StoreNVMe localRWONode-local ephemeralTemp data, cache, shuffle

EBS Volume Types

gp3 ← Use This

General purpose SSD. 3,000 IOPS baseline, configurable up to 16,000. Cheaper than gp2 at same size. Default for new clusters. Always use gp3 unless specific IOPS needs.

IOPS/GB efficiency★★★★★
gp2 (legacy)

IOPS tied to size (3 IOPS/GB, burst 3000). Many clusters still default to gp2 — override explicitly. Avoid for new provisioning.

IOPS/GB efficiency★★★☆☆
io1 / io2

Provisioned IOPS SSD up to 64,000 (io2 Block Express: 256,000). For latency-sensitive databases. io2 supports multi-attach (RWX for EC2).

IOPS/GB efficiency★★★★★
StorageClass · gp3 recommendedapiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: gp3
  annotations:
    storageclass.kubernetes.io/is-default-class: "true"
provisioner: ebs.csi.aws.com
volumeBindingMode: WaitForFirstConsumer  # ← CRITICAL: delays PV creation until pod AZ is known
allowVolumeExpansion: true
parameters:
  type: gp3
  encrypted: "true"
  iops: "3000"
  throughput: "125"
$
kubectl get storageclassesList StorageClasses and their provisioners/binding modes
$
kubectl get pvc --all-namespacesList all PersistentVolumeClaims — check for Pending PVCs
$
kubectl describe pvc <name> -n <ns>Debug PVC binding — events section shows why it's stuck
$
eksctl create addon --name aws-ebs-csi-driver --cluster my-cluster --service-account-role-arn arn:aws:iam::ACCOUNT:role/EBSCSIRoleInstall EBS CSI driver as managed addon
⛔ EBS AZ Lock-in

EBS volumes are bound to a single AZ. If your pod reschedules to a different AZ, it cannot attach the volume → pod stays Pending. Always use volumeBindingMode: WaitForFirstConsumer so the PV is provisioned in the same AZ as the scheduled pod.

⛔ gp2 still default on older clusters

Clusters created before EKS 1.23 may still have gp2 as the default StorageClass. Override: annotate gp3 with storageclass.kubernetes.io/is-default-class: "true" and remove it from gp2.

⚠ EBS not available on Fargate

Fargate pods cannot attach EBS volumes. Use EFS with ReadWriteMany for Fargate persistent storage.

EFS — Elastic File System

EFS is NFS-backed, multi-AZ, and supports ReadWriteMany — multiple pods across nodes/AZs can mount simultaneously. The only persistent storage option for Fargate.

StorageClass · EFS dynamic provisioningapiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: efs-sc
provisioner: efs.csi.aws.com
parameters:
  provisioningMode: efs-ap          # access point per PVC (isolation)
  fileSystemId: fs-0123456789abcdef
  directoryPerms: "700"
---
accessModes: [ReadWriteMany]  # key difference — multiple pods simultaneously
EFS charges per GB stored (~$0.30/GB/month Standard tier) plus data transfer. For large datasets, compare with S3 + EBS snapshot patterns. EFS has higher latency than EBS for small random I/O workloads.

EKS security has two axes: cluster access (who can kubectl) via aws-auth/EKS Access Entries, and pod AWS permissions (what AWS APIs can a pod call) via IRSA or Pod Identity. Never give nodes broad IAM roles.

Authentication Flow — Animated

kubectl → IAM → EKS Auth → K8s RBAC — full request path
Developer kubectl apply ~/.kube/config IAM Identity aws eks get-token STS presigned URL EKS Auth API validates token → k8s username K8s RBAC ClusterRoleBinding RoleBinding API Server admit request → etcd write ① send request ② IAM token ③ identity mapped ④ permissions check ⑤ admitted / denied aws-auth ConfigMap or EKS Access Entries

Cluster Access — aws-auth vs EKS Access Entries

aws-auth ConfigMap (legacy)

Maps IAM ARNs → K8s usernames/groups. Lives in kube-system. YAML-based, fragile. A malformed YAML can lock everyone out of the cluster. Being replaced by EKS Access Entries.

EKS Access Entries (new, 1.28+)

Native EKS API for cluster access. No ConfigMap required. Manages via AWS Console/CLI/Terraform. More auditable, CloudTrail-logged. Recommended for all new clusters.

aws-auth — safe management via eksctl# Safe: eksctl validates before applying
eksctl create iamidentitymapping \
  --cluster my-cluster \
  --arn arn:aws:iam::123456789012:role/TeamDevRole \
  --group system:masters \
  --username team-dev

# RISKY: direct edit — backup first!
kubectl get cm aws-auth -n kube-system -o yaml > aws-auth-backup.yaml
kubectl edit cm aws-auth -n kube-system

# EKS Access Entries (new API)
aws eks create-access-entry \
  --cluster-name my-cluster \
  --principal-arn arn:aws:iam::123456789012:role/TeamDevRole

aws eks associate-access-policy \
  --cluster-name my-cluster \
  --principal-arn arn:aws:iam::123456789012:role/TeamDevRole \
  --policy-arn arn:aws:eks::aws:cluster-access-policy/AmazonEKSClusterAdminPolicy \
  --access-scope type=cluster
A YAML syntax error in aws-auth locks out ALL IAM users except the original cluster creator. Always use eksctl create iamidentitymapping instead of direct kubectl edit. Keep a backup of the current ConfigMap before any changes.

IRSA — IAM Roles for Service Accounts

IRSA binds an IAM role to a K8s ServiceAccount. The EKS mutating webhook injects env vars + a projected token into pods. The AWS SDK exchanges the token for STS temporary credentials automatically.

IRSA token exchange flow — pod → OIDC → STS → AWS APIs
Pod ServiceAccount: s3-reader projected token Mutating Webhook injects env vars OIDC Provider issues JWT STS AssumeRole WithWebIdentity IAM Role trust policy temp creds S3 / DDB AWS API ① pod starts ② inject token+env ③ get JWT ④ assume role ⑤ temp creds ⑥ API call Steps ③–⑤ happen automatically inside AWS SDK — zero app code changes needed
Step 1 · Associate OIDC Provider (once per cluster)
# Required once
eksctl utils associate-iam-oidc-provider --cluster my-cluster --approve
# Get OIDC issuer URL
aws eks describe-cluster --name my-cluster --query "cluster.identity.oidc.issuer" --output text
Step 2 · Create IAM Role with OIDC trust policy
{
  "Principal": { "Federated": "arn:aws:iam::ACCOUNT:oidc-provider/oidc.eks.REGION.amazonaws.com/id/OIDCID" },
  "Action": "sts:AssumeRoleWithWebIdentity",
  "Condition": { "StringEquals": {
    "oidc...sub": "system:serviceaccount:NAMESPACE:SA-NAME"
  }}
}
Step 3 · Annotate ServiceAccount
annotations:
  eks.amazonaws.com/role-arn: arn:aws:iam::123456789012:role/MyPodRole
Step 4 · Reference in Pod spec (automatic token injection)
spec:
  serviceAccountName: s3-reader   # webhook injects AWS_ROLE_ARN + AWS_WEB_IDENTITY_TOKEN_FILE
⚠ 100 OIDC provider limit per AWS account

Each EKS cluster gets its own OIDC provider. At 100+ clusters you hit the account limit. Consider Pod Identity (no OIDC required) for large fleets.

⚠ STS rate limits at scale

Each pod startup = 1 STS AssumeRoleWithWebIdentity call. At thousands of pod starts/min, you can hit per-role STS throttling. Pod Identity caches credentials locally — better at scale.

Pod Identity — IRSA v2

Recommended New

Pod Identity (EKS 1.24+) removes the OIDC provider requirement. Mappings stored in EKS control plane, not pod annotations. A privileged DaemonSet on each node proxies and caches credentials.

FeatureKube2IAMIRSAPod Identity
OIDC provider needed
Pod annotations needed
Local credential proxy + cache
ABAC / tag-based policies
Works off EKS✓ (OIDC)✗ EKS only
Pod Identity setup# 1. Install Pod Identity Agent addon
eksctl create addon --name eks-pod-identity-agent --cluster my-cluster

# 2. Associate role — no SA annotation needed
aws eks create-pod-identity-association \
  --cluster-name my-cluster \
  --namespace default \
  --service-account my-app-sa \
  --role-arn arn:aws:iam::123456789012:role/MyPodRole

# 3. Verify
aws eks list-pod-identity-associations --cluster-name my-cluster

Secrets Management

Native K8s Secrets ⚠

Base64 encoded, stored in etcd. Not encrypted at rest by default. Any pod in namespace can access. Always enable KMS envelope encryption.

AWS Secrets Manager

External Secrets Operator (ESO) or Secrets Store CSI Driver syncs secrets → K8s. Rotation support, CloudTrail audit trail, cross-account support.

AWS SSM Parameter Store

Cheaper than Secrets Manager for non-sensitive config. Use SecureString for sensitive values. Accessible via ESO or SSM Agent.

Enable KMS encryption for etcd secrets# Enable at cluster creation
eksctl create cluster \
  --name my-cluster \
  --with-oidc \
  --secrets-encryption-key-arn arn:aws:kms:REGION:ACCOUNT:key/KEY-ID

AWS Load Balancer Controller (AWS LBC) provisions ALBs (HTTP/HTTPS L7) and NLBs (TCP/UDP L4) from Kubernetes Ingress and Service resources. Understanding the full traffic path is essential for debugging and avoiding cross-AZ cost surprises.

Traffic Flow — Animated End to End

External request path: Internet → ALB → NodePort → kube-proxy → Pod
Internet Client HTTP req ALB L7 routing TLS term host/path rules AWS LBC managed NodePort :30xxx on all nodes kube-proxy iptables / ipvs ⚠ may hop cross-AZ! Pod A same node Pod B diff node externalTrafficPolicy Local ✓ no cross-node hop ✓ source IP preserved Use in production! ← set spec.externalTrafficPolicy: Local to eliminate cross-AZ hops and preserve client IP

ALB vs NLB

FeatureALB (Application LB)NLB (Network LB)
OSI LayerL7 (HTTP/HTTPS)L4 (TCP/UDP/TLS)
Routing rulesHost, path, header, query stringPort-based only
LatencyHigher (L7 processing overhead)Ultra-low (<1ms)
Source IPVia X-Forwarded-For headerPreserved natively
Static IP / Elastic IP✗ (DNS hostname only)
AWS PrivateLink
gRPC / HTTP2✓ (native)✓ (TLS passthrough)
WebSocket
K8s resourceIngress / IngressClassService type:LoadBalancer
Best forHTTP APIs, microservices, HTTPS terminationgRPC, databases, real-time, static IP needs

AWS Load Balancer Controller

Install AWS LBC via Helm# Add Helm repo
helm repo add eks https://aws.github.io/eks-charts
helm repo update

# Install (requires IAM SA pre-created with IRSA)
helm upgrade --install aws-load-balancer-controller eks/aws-load-balancer-controller \
  -n kube-system \
  --set clusterName=my-cluster \
  --set serviceAccount.create=false \
  --set serviceAccount.name=aws-load-balancer-controller

# Verify
kubectl get deployment -n kube-system aws-load-balancer-controller
kubectl logs -n kube-system -l app.kubernetes.io/name=aws-load-balancer-controller | tail -20
Ingress · ALB with IngressGroup (share one ALB)apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  annotations:
    kubernetes.io/ingress.class: alb
    alb.ingress.kubernetes.io/scheme: internet-facing
    alb.ingress.kubernetes.io/target-type: ip            # direct to pod, skips kube-proxy
    alb.ingress.kubernetes.io/certificate-arn: arn:aws:acm:...
    alb.ingress.kubernetes.io/listen-ports: '[{"HTTP":80},{"HTTPS":443}]'
    alb.ingress.kubernetes.io/ssl-redirect: "443"
    alb.ingress.kubernetes.io/group.name: my-app           # share one ALB across Ingresses
Use target-type: ip (routes directly to pod IP) over target-type: instance (routes via NodePort → kube-proxy). IP mode skips kube-proxy entirely, reduces hops, and requires VPC CNI pod IPs — which EKS provides by default.
⚠ Each Ingress without group.name = separate ALB (~$18/month each)

Without alb.ingress.kubernetes.io/group.name, every Ingress resource provisions its own ALB. Use IngressGroups to share one ALB across multiple services — critical for cost control.

⚠ ALB provisioning takes 1–3 minutes

Unlike NLBs, ALBs go through target health check warm-up. Don't expect instant availability after applying an Ingress. Check events on the Ingress object: kubectl describe ingress <name>

Kubernetes releases 3 minor versions/year. EKS supports N-2 (3 active at any time). After ~14 months a version reaches end-of-support and AWS will auto-upgrade your control plane with warning. Plan upgrades annually minimum — one minor version at a time.

In-Place Upgrade — Animated Sequence

5-step in-place upgrade sequence — control plane first, data plane last
① Pre-check kubectl-no-trouble EKS Insights API deprecated APIs? fix PSPs if <1.25 ~10–30 min ② Control Plane eksctl upgrade cluster or aws eks update-cluster-version AWS rolling replace ~15–30 min upgrading... ③ Addons aws eks update-addon vpc-cni kube-proxy coredns ~5 min each ④ Data Plane eksctl upgrade ng rolling node replace cordon → drain new AMI → uncordon slowest step ⑤ Verify kubectl get nodes -o wide kubectl get pods -A all nodes = v1.30 smoke tests pass done ✓ v1.29 → v1.30 one minor version at a time only

Upgrade Commands

$
kubectl-no-troubleScan cluster for deprecated API usage before upgrading
$
aws eks list-insights --cluster-name my-cluster --filter kubernetesVersion=1.30EKS Insights — flags removed APIs and compatibility issues
$
eksctl upgrade cluster --name my-cluster --version 1.30 --approveUpgrade control plane (one minor version at a time)
$
aws eks update-addon --cluster-name my-cluster --addon-name coredns --addon-version v1.11.1-eksbuild.9Upgrade individual managed addon
$
aws eks describe-addon-versions --kubernetes-version 1.30 --addon-name coredns --query "addons[].addonVersions[0].addonVersion"Find latest addon version for target K8s version
$
eksctl upgrade nodegroup --cluster my-cluster --name ng-workers --kubernetes-version 1.30 --force-upgradeUpgrade managed node group — rolling replace with new AMI

Managed Add-ons

Add-onPurposeRequired?
vpc-cniAWS VPC CNI — pod IP allocation from VPCYes
kube-proxyService networking (iptables / ipvs rules)Yes
corednsCluster DNS resolutionYes
aws-ebs-csi-driverEBS PersistentVolume supportIf using EBS
eks-pod-identity-agentPod Identity credential proxy DaemonSetIf using Pod Identity
adotAWS Distro for OpenTelemetryIf using OTEL
aws-guardduty-agentRuntime threat detection for podsSecurity baseline
Managed add-ons do NOT auto-upgrade with the control plane. After upgrading the cluster version, manually upgrade each add-on. Mismatched versions cause subtle networking or DNS breakage that's hard to diagnose.

Monitoring

CloudWatch Control Plane Logs

Enable API server, audit, auth, scheduler, controller-manager log types. Essential for auth debugging. Configure via EKS console or CLI.

Prometheus + Grafana

kube-state-metrics + node-exporter for cluster metrics. Helm chart: kube-prometheus-stack. AWS Managed Prometheus available for serverless option.

Container Insights

CloudWatch Container Insights: per-pod CPU/memory/disk/network via Fluent Bit DaemonSet. Simpler for CloudWatch-centric teams.

Enable all control plane loggingaws eks update-cluster-config \
  --name my-cluster \
  --logging \
    '{"clusterLogging":[{"types":["api","audit","authenticator","controllerManager","scheduler"],"enabled":true}]}'

Critical Gotchas

⛔ Never skip Kubernetes minor versions

EKS only supports upgrading one minor version at a time: 1.27→1.28→1.29→1.30. Attempting to skip fails with an API error. Plan for ~30 min per version hop.

⛔ Add-ons don't auto-upgrade with control plane

After upgrading the control plane, manually upgrade each addon. Running mismatched versions (e.g., old vpc-cni with new K8s) causes silent networking issues.

⚠ PodSecurityPolicy removed in 1.25

PSP was removed in Kubernetes 1.25. If on <1.25 with PSPs, migrate to Pod Security Admission (PSA) or OPA/Gatekeeper before upgrading past 1.24. Use kubectl-no-trouble to detect.

⚠ Fargate upgrade = redeploy

Fargate pods don't upgrade in place. After upgrading control plane, trigger a Deployment rollout (kubectl rollout restart deployment/<name>). New pods land on Fargate nodes with the updated K8s runtime.

⚠ Blue-green upgrades double costs

Running both clusters during migration window doubles your EC2 + control plane costs. Also requires careful ALB/NLB migration and DNS TTL management. Only use for large version jumps or CNI changes.

Health Check Command Reference

$
kubectl get nodes -o wideNode status, K8s version, internal/external IPs
$
kubectl get pods -A | grep -v Running | grep -v CompletedFind unhealthy pods across all namespaces
$
kubectl describe node <name> | grep -A10 ConditionsNode pressure conditions (MemoryPressure, DiskPressure, PIDPressure)
$
kubectl top nodesCPU/memory usage per node (requires metrics-server)
$
kubectl top pods -A --sort-by=memoryTop memory-consuming pods cluster-wide
$
kubectl get events -A --sort-by='.lastTimestamp' | tail -30Recent cluster events sorted by time
$
aws eks describe-cluster --name my-cluster --query "cluster.{Ver:version,Status:status}"Cluster version and status from AWS
$
eksctl get nodegroup --cluster my-clusterAll node groups with versions and status
$
aws eks list-addons --cluster-name my-clusterList installed addons
$
kubectl rollout status deployment/<name> -n <ns>Watch rolling deployment progress

Staff L6 interview prep — 40 questions across all 8 topic clusters. Each has key points to memorize, a full model answer, and interviewer follow-up probes. Click any question to expand.

Filter:
Questions reviewed 0 / 40
01

EKS Architecture & Control Plane

Q1 Explain the EKS shared responsibility model. What exactly does AWS manage and what do you own?
ConceptualMedium

Key Points

  • AWS manages the control plane: etcd (3–5 nodes, multi-AZ quorum), API server (multi-instance behind NLB), scheduler, controller manager, automatic patching/scaling/HA
  • You manage the data plane: EC2 node instances, OS patches, AMI updates, workloads, RBAC, CRDs, network policies, IAM roles, security groups
  • Communication bridge: AWS provisions cross-account ENIs (X-ENIs) in your VPC — nodes talk to the API server through these, not over public internet
  • Cost split: ~$0.10/hr control plane fee + EC2 costs for nodes
  • Staff depth: know that the control plane runs in an AWS-owned VPC, and AWS uses a dedicated account per cluster for isolation

Model Answer

"EKS splits Kubernetes responsibilities at the control plane / data plane boundary. AWS fully manages the control plane — that means etcd with 3–5 nodes spread across AZs maintaining quorum automatically, multiple API server instances behind an NLB that auto-scales under load, the scheduler and controller manager, and all patching, backup, and version management of those components."

"You own everything in your AWS account: worker nodes whether EC2 or Fargate, the OS and AMI lifecycle, all Kubernetes workloads, RBAC policies, CRDs, network policies, IAM role bindings, and security groups. The bridge between the two is AWS provisioning cross-account ENIs into your VPC subnets — your nodes register with the API server through those ENIs, which is why your VPC security groups must allow port 443 outbound to the control plane endpoint."

"The practical implication is that if you misconfigure VPC routing or security groups, your nodes can't join. If AWS has a control plane issue, your existing pods keep running but you can't schedule new ones or change cluster state — etcd goes read-only."

Interviewer Follow-ups

  • If the API server is unavailable, what happens to running pods?
  • How does AWS ensure control plane HA — can you describe the etcd quorum mechanics?
  • What's the blast radius if the X-ENI is misconfigured?
Most candidates stop at "AWS manages the master". The follow-up is: "so if etcd fails, do your pods die?" — answer is NO. kubelet runs a local cache; existing pods keep running until the node restarts or kubelet needs to re-register.
Q2 Walk me through what happens when you run kubectl apply -f deployment.yaml — from CLI to pod running.
ConceptualHard

Key Points

  • Auth: kubectl calls aws eks get-token → STS presigned URL → API server validates via EKS auth webhook → RBAC authorization
  • Admission: Mutating webhooks run first (inject sidecars, IRSA env vars), then Validating webhooks (OPA/Gatekeeper policy checks)
  • Persistence: API server writes to etcd (only after quorum acknowledgment)
  • Controllers: Deployment controller watches etcd → creates ReplicaSet → RS controller creates Pod objects
  • Scheduling: Scheduler watches for unbound pods → scores nodes (resources, taints, affinity) → writes nodeName to Pod spec in etcd
  • kubelet: On the target node, kubelet watches API server → calls container runtime (containerd) → pulls image, creates namespaces, calls CNI plugin, starts container
  • CNI: aws-node assigns secondary IP from ENI pool → sets up veth pair → programs routes

Model Answer

"The full path has about 8 distinct phases. Authentication first: kubectl generates a bearer token by calling STS with your IAM credentials to get a presigned URL. The API server passes this to the EKS authentication webhook, which validates the IAM identity and maps it to a Kubernetes username via the aws-auth ConfigMap or EKS Access Entries."

"Then authorization: the API server checks your RBAC permissions for the resource and verb. If you pass, it hits the admission controllers — mutating first, then validating. In a typical EKS cluster this is where IRSA webhook injects environment variables, and where OPA/Gatekeeper blocks policy violations."

"Once admitted, the object is written to etcd — only after a quorum of etcd nodes acknowledges the write. The Deployment controller running inside the controller manager watches etcd via a list-watch and creates a ReplicaSet, which creates Pod objects with no nodeName."

"The scheduler watches for Pods with empty nodeName, scores candidate nodes using predicates (resource fit, taints, pod affinity) and priorities, then writes the chosen node back to the Pod spec in etcd. The kubelet on that node is also watching, picks up the binding, calls containerd to pull the image, creates Linux namespaces, calls the CNI plugin (aws-cni) to allocate a secondary VPC IP and set up the veth pair, then starts the container. The pod transitions through Pending → ContainerCreating → Running."

Interviewer Follow-ups

  • Where exactly does a PodDisruptionBudget get enforced in this flow?
  • What happens if the scheduler crashes mid-way through scoring?
  • Can admission webhooks add latency to deployments at scale? How would you mitigate?
Many candidates skip admission controllers entirely. Staff interviewers expect you to know about mutating vs validating webhooks, their ordering, and failure modes (what happens if a webhook is unavailable — failurePolicy: Fail vs Ignore).
Q3 How does etcd work in EKS, and what are the failure modes you need to understand as an operator?
ConceptualHard

Key Points

  • etcd uses Raft consensus — needs (n/2)+1 nodes to form a quorum for writes. With 3 nodes, 2 must agree. With 5, 3 must agree.
  • If quorum is lost: etcd goes read-only. Existing pods keep running (kubelet has local cache) but you can't make API changes.
  • EKS runs 3–5 etcd nodes across AZs. AWS handles this — you can't SSH into them.
  • etcd stores all K8s objects as key-value pairs under /registry/ prefix
  • Watch mechanism: controllers use long-polling watch on etcd, not polling. This is how controllers get notified of state changes efficiently.
  • Large clusters: etcd compaction, defragmentation, and object count limits matter (~8MB default max request size)

Model Answer

"etcd is a distributed key-value store built on the Raft consensus algorithm. It elects a leader that handles all writes; followers replicate. To commit a write, the leader needs acknowledgment from a majority — quorum. In EKS, AWS runs 3–5 etcd nodes distributed across AZs. With 3 nodes, you can tolerate one AZ failure and still have quorum. With 5, you can tolerate two."

"The critical failure mode to understand: if quorum is lost, etcd becomes read-only. Your existing pods keep running because kubelet caches the pod spec locally — it doesn't need etcd to manage running containers. But you cannot create, update, or delete any Kubernetes objects until quorum is restored. This is why AWS runs this in 3+ AZs."

"As an operator, the things I watch for are: etcd compaction lag (if not compacted, the database grows unboundedly — AWS handles this but it's important to understand), the 8MB request size limit (trying to apply a very large ConfigMap or CRD can hit this), and at scale, object count — etcd's performance degrades with millions of objects, so namespacing and cleaning up stale resources matters."

Interviewer Follow-ups

  • A pod is Running but you can't delete it — kubectl delete hangs. What's happening and how do you debug?
  • How does the watch mechanism in etcd work, and why is it more efficient than polling?
  • What's a finalizer and how can it cause etcd objects to get stuck?
The "kubectl delete hangs" scenario almost always comes up. The answer: the object has a finalizer that isn't being resolved (e.g., the controller managing it is down). Fix: patch the object to remove the finalizer — but this bypasses cleanup logic, so understand the risk.
02

Networking — CNI, VPC, ENIs, Policies

Q4 Explain how the AWS VPC CNI assigns IP addresses to pods. What are the limits and failure modes?
ConceptualHard

Key Points

  • aws-node DaemonSet runs on every node, manages ENI attachment and secondary IP allocation
  • Each pod gets a real VPC IP from the node's ENI secondary IP pool — no overlay, no NAT
  • EC2 instance type determines max ENIs and IPs per ENI — e.g., m5.xlarge: 4 ENIs × 15 IPs = 60 pod IPs max
  • Warm pools: WARM_ENI_TARGET / WARM_IP_TARGET pre-allocate to reduce scheduling latency
  • Prefix delegation: each ENI slot holds a /28 (16 IPs) instead of 1 IP — multiply pod capacity by 16x same ENI count
  • Failure mode: subnet IP exhaustion → pods stay Pending with "not enough IPs" error
  • Failure mode: ENI attach limit hit → aws-node can't allocate new IPs, pods Pending

Model Answer

"The aws-node DaemonSet — the VPC CNI plugin — runs on each worker node and is responsible for pre-allocating a pool of IPs from your VPC subnet. It attaches secondary ENIs to the EC2 instance and assigns secondary private IPs to those ENIs. When a pod is scheduled, the kubelet calls the CNI plugin, which picks a free IP from the warm pool, creates a veth pair — one end in the pod's network namespace, one on the host — and programs the Linux kernel routing table so traffic to that IP goes into the pod."

"The capacity ceiling is instance-type dependent. An m5.xlarge supports 4 ENIs with up to 15 secondary IPs each, so 60 pod IPs. With prefix delegation enabled, each of those 15 secondary slots becomes a /28 prefix (16 IPs), giving you 4 × 15 × 16 = 960 pod IPs per node."

"The two failure modes I've seen in production: first, subnet IP exhaustion — your /24 runs out of IPs and pods get stuck Pending. The fix is either resize subnets (can't do in-place, need migration) or use prefix delegation with larger subnets. Second, the warm pool latency problem — if WARM_ENI_TARGET is 0, the first pod on a cold node waits for ENI attachment, which can take 10–30 seconds. Setting WARM_ENI_TARGET=1 pre-allocates a spare ENI."

Interviewer Follow-ups

  • You're seeing pods stuck in Pending with FailedScheduling — how do you diagnose whether it's IP exhaustion vs node capacity?
  • When would you choose Cilium over VPC CNI?
  • How does prefix delegation interact with subnet fragmentation?
Interviewers love asking "why would you NOT use VPC CNI?" — key answer: when you need to conserve VPC IPs (large cluster, tight CIDR), you'd use Cilium or Calico with custom CIDR + overlay. Trade off is encapsulation latency and loss of native VPC routing.
Q5 A pod can't reach another pod in a different namespace. How do you debug this systematically?
ScenarioHard

Key Points

  • Start at layer 7: is it a DNS issue or a network issue? (nslookup from pod)
  • Check NetworkPolicy — a deny-all default in either namespace blocks cross-namespace traffic
  • Check if the target service exists and has endpoints (kubectl get endpoints)
  • Check kube-proxy / iptables rules on the node: iptables -L -t nat
  • Check aws-node logs for IP allocation failures
  • Check Security Groups if using Security Groups for Pods feature
  • Use kubectl exec + curl / nc to test connectivity directly

Model Answer

"I'd approach this as a layered debug. First I'd confirm whether it's DNS resolution or actual network connectivity — exec into the source pod and run nslookup service-name.namespace.svc.cluster.local. If DNS fails, the problem is CoreDNS or the service name, not networking."

"If DNS resolves but the connection fails, I'd check NetworkPolicy first — this is the most common cause of cross-namespace failures. A default-deny policy in the destination namespace, or an ingress policy missing a namespaceSelector for the source namespace, would silently block traffic. I'd kubectl get networkpolicy -n <target-ns> and read the selectors carefully."

"If there's no NetworkPolicy blocking it, I'd check the Service has active Endpointskubectl get endpoints <service> -n <ns>. No endpoints means the selector doesn't match any pods. Then I'd verify kube-proxy has programmed the iptables rules correctly on the relevant nodes. On EKS, I'd also check whether Security Groups for Pods is in use, which adds an extra layer of SG-based filtering that NetworkPolicy doesn't control."

"I'd also verify the pods are actually running and that aws-node allocated IPs correctly — check aws-node DaemonSet logs for any IP pool exhaustion warnings."

Interviewer Follow-ups

  • How does kube-proxy implement Service routing — iptables vs IPVS mode differences?
  • If you have 10,000 services, what's the scaling problem with iptables mode?
  • What's the difference between a Service ClusterIP, NodePort, and a pod IP in terms of routing?
Candidates often jump to NetworkPolicy immediately. The trap: NetworkPolicy only works if a CNI that enforces it is installed. On a fresh EKS cluster without Cilium/Calico or the aws-eks-nodeagent eBPF policy engine, NetworkPolicy objects exist but are NOT enforced — pods communicate freely regardless.
Q6 You're designing a multi-tenant EKS cluster. How do you achieve network isolation between tenants?
DesignHard

Key Points

  • Namespace-level isolation: default-deny NetworkPolicy per namespace + explicit allow rules
  • Node-level isolation: dedicated node groups per tenant with taints/tolerations + nodeAffinity
  • Security Groups for Pods: assign different SGs to different tenant pods for AWS-level network enforcement
  • Hard multi-tenancy requires separate clusters — Kubernetes namespaces are soft boundaries
  • RBAC: namespace-scoped Roles, no ClusterRole access to other tenants
  • Resource quotas + LimitRanges per namespace to prevent noisy neighbor

Model Answer

"The first question I ask is whether we need soft or hard multi-tenancy. Soft tenancy — where tenants are internal teams that trust each other at some level — can work in one cluster with namespace isolation. Hard tenancy — where tenants are external customers with no trust relationship — requires separate clusters per tenant or very careful controls."

"For soft tenancy in EKS, I'd layer three controls. Network: default-deny ingress + egress NetworkPolicy in every tenant namespace, then explicit allow rules only for legitimate traffic paths. With aws-eks-nodeagent or Cilium, these are enforced at eBPF level. Compute: dedicated node groups per tenant with taints and tolerations so tenant A pods can't land on tenant B nodes — eliminates shared memory/CPU attack surface. IAM: namespace-scoped RBAC only, no ClusterRole, separate IRSA roles per tenant namespace."

"For full isolation, I'd add Security Groups for Pods — assigning per-tenant SGs at the ENI level. This enforces isolation at the AWS network layer, not just in kernel space, and you get CloudTrail logging of any cross-tenant traffic attempts. The downside is SG-for-Pods requires a specific VPC CNI configuration and can complicate your IP allocation."

Interviewer Follow-ups

  • A tenant namespace somehow got a pod with a privileged security context — how does that break your isolation model?
  • How do you prevent a tenant from exfiltrating data via DNS queries?
  • What's the trade-off between one large multi-tenant cluster vs many smaller single-tenant clusters?
03

Compute & Scaling

Q7 Compare Karpenter vs Cluster Autoscaler. When would you choose one over the other?
TradeoffHard

Key Points

  • Cluster Autoscaler: scales existing ASGs up/down. Needs pre-configured node groups. Slow (checks every 10s, respects scale-down delay). AWS-agnostic.
  • Karpenter: bypasses ASG, calls EC2 Fleet API directly. Any instance type. Pod-aware. Consolidation built-in. Faster (seconds vs minutes).
  • Karpenter needs proper resource requests on all pods — otherwise it can't size nodes correctly
  • CAS is better for: existing investment in ASG tooling, non-EKS K8s, predictable homogeneous workloads
  • Karpenter is better for: mixed instance types, spot optimization, cost-critical, rapid scale-out
  • Never run both simultaneously on the same nodes — they conflict on scale-down decisions

Model Answer

"The fundamental architectural difference: Cluster Autoscaler works through Auto Scaling Groups — it scales up by increasing desired count on a pre-configured ASG, which means you must pre-define instance types, have separate ASGs per AZ per instance family, and handle mixed instance types manually. It's reactive, checking every 10–60 seconds, with a conservative scale-down mechanism that waits for an underutilization window."

"Karpenter bypasses ASGs entirely and calls the EC2 Fleet API directly. When a pod is unschedulable, Karpenter reads its resource requests and constraints, queries EC2 for all matching instance types with current spot and on-demand pricing, and launches the cheapest option that fits. This happens in seconds. It's also topology-aware — it considers the pod's AZ preference for EBS volumes."

"I'd choose Karpenter for any new EKS cluster where cost efficiency matters, especially for workloads with variable shape (some pods need GPU, some need memory, some are small). The consolidation feature — where Karpenter continuously right-sizes and terminates underutilized nodes — alone can save 20–40% on compute."

"I'd stick with Cluster Autoscaler if the team has deep existing tooling around ASGs, or if the cluster runs on non-AWS infrastructure. The critical thing with Karpenter: every pod needs explicit resource requests. Without them, Karpenter has no signal for sizing and you get unpredictable behavior."

Interviewer Follow-ups

  • How does Karpenter handle a pod that needs a specific instance type for GPU access?
  • Walk me through Karpenter's consolidation algorithm — when does it decide to terminate a node?
  • You have a StatefulSet with PVCs — how does Karpenter handle node replacement without losing data?
The consolidation trap: Karpenter can disrupt stateful workloads during consolidation if you don't have PodDisruptionBudgets. Staff-level answer: always pair Karpenter with PDBs on any critical workload, and use karpenter.sh/do-not-disrupt annotation on pods that absolutely cannot be evicted.
Q8 When would you use Fargate vs EC2 node groups, and what are the production gotchas with Fargate?
TradeoffMedium

Key Points

  • Fargate: serverless, per-pod isolation, zero node management, cold start latency (~30–60s)
  • No DaemonSets, no GPUs, no host networking, no privileged containers on Fargate
  • EBS not supported — only EFS for persistent storage
  • Fargate profiles define namespace + label selectors — pods matching profile land on Fargate
  • Each Fargate "node" is dedicated to one pod — no bin-packing, potentially expensive
  • Good for: batch jobs, CI runners, burst workloads, dev namespaces, high-isolation requirements

Model Answer

"Fargate's core value proposition is zero node management and pod-level isolation — AWS provisions a dedicated microVM per pod, handles OS patching, and you never think about node capacity. That sounds appealing, but the production constraints are significant."

"The hard limitations: no DaemonSets — any tooling that relies on DaemonSets (log forwarders, security agents, monitoring collectors) won't work on Fargate pods. You work around this with sidecars, but that's operationally heavier. No EBS — Fargate pods can only use EFS for persistent storage, which has higher latency and different cost characteristics. Cold start is 30–60 seconds, which makes Fargate a poor fit for latency-sensitive scale-out paths."

"I use Fargate for: batch and ETL jobs that spin up infrequently and need strong isolation, CI/CD runners where the cold start doesn't matter, and dev/staging namespaces where the team wants zero node management overhead. For production web services with fast scale-out requirements or any DaemonSet-dependent tooling, I stay on EC2 managed node groups or Karpenter."

Interviewer Follow-ups

  • How do you get logs out of Fargate pods if you can't run a DaemonSet?
  • What's the cost comparison between Fargate and equivalently-sized EC2?
  • How do Fargate nodes get upgraded after a cluster version bump?
Q9 How does the Kubernetes scheduler work? Explain predicates, priorities, and how taints/tolerations/affinity interact.
ConceptualHard

Key Points

  • Phase 1 — Filtering: eliminates nodes that can't run the pod (insufficient CPU/memory, taints not tolerated, node affinity mismatch, port conflicts, volume zone mismatch)
  • Phase 2 — Scoring: ranks remaining nodes on: least-requested resources, pod affinity/anti-affinity spread, image locality, topology spread constraints
  • Taints: mark a node to repel pods. Tolerations: allow a pod to be scheduled on a tainted node. Effect: NoSchedule, PreferNoSchedule, NoExecute (evicts existing pods)
  • Node affinity: attraction to nodes with specific labels (required vs preferred)
  • Pod affinity/anti-affinity: co-locate or spread based on other pods' labels
  • TopologySpreadConstraints: enforce pod spread across zones/nodes (preferred over anti-affinity for large clusters)

Model Answer

"The scheduler runs two phases per pod. Filtering reduces the candidate set to only nodes that can actually run the pod. This includes: does the node have enough CPU/memory requests available, do the pod's tolerations cover all node taints, does the node label match any required nodeAffinity, are the requested ports free, and critically for EKS — is the EBS volume in the same AZ as the node."

"Scoring then ranks the filtered nodes. The main scorer is LeastRequestedPriority — prefer nodes with more free capacity to spread load. PodAffinity scoring attracts pods to nodes where matching pods already run. SpreadConstraints scoring penalizes nodes in AZs that are already overloaded."

"Taints vs affinity is a common confusion point. Taints are on nodes — they push pods away unless the pod has a matching toleration. Affinity is on pods — they pull pods toward nodes. You'd taint a GPU node gpu=true:NoSchedule so only pods that tolerate it land there. You'd use nodeAffinity when you want pods to prefer certain nodes but don't want to block non-matching pods."

"For HA across zones, I prefer TopologySpreadConstraints over pod anti-affinity now — it's more expressive, can enforce a specific skew limit, and scales better. Anti-affinity requires N anti-affinity rules proportional to replica count, which gets expensive to evaluate in large clusters."

Interviewer Follow-ups

  • A pod is Pending with "0/5 nodes available: 5 node(s) had untolerated taint" — how do you fix it?
  • You have 3 replicas of a Deployment and 3 nodes — how do you guarantee one replica per node?
  • What's the risk of using requiredDuringSchedulingIgnoredDuringExecution for node affinity?
IgnoredDuringExecution is critical: if you use required affinity and then remove the node label, existing pods keep running but new ones can't schedule. For production critical workloads, use preferred affinity or ensure node labels are stable.
04

Security — IRSA, Pod Identity, RBAC, Secrets

Q10 Explain IRSA end-to-end. Why is it better than using a node IAM role?
ConceptualMedium

Key Points

  • Node IAM role = all pods on that node share the same AWS permissions. Blast radius of a compromise is the entire node's IAM role.
  • IRSA = per-ServiceAccount IAM role. Lease-privilege at pod granularity.
  • Mechanism: OIDC provider per cluster → trust policy on IAM role → mutating webhook injects token + env vars → AWS SDK calls STS AssumeRoleWithWebIdentity → temp credentials scoped to that role
  • Token is a projected ServiceAccount token with audience and expiry — not a long-lived credential
  • Key limits: 100 OIDC providers per AWS account, STS rate limits at scale

Model Answer

"With a node IAM role, every pod on that node — regardless of what it does — can call any AWS API that role allows. If one pod is compromised, the attacker gets node-wide AWS access. IRSA solves this with pod-level least privilege: each Kubernetes ServiceAccount maps to a specific IAM role with exactly the permissions that service needs."

"The mechanism: when you associate an OIDC provider with your cluster, EKS acts as an identity provider. You create an IAM role whose trust policy says 'trust tokens issued by this specific OIDC provider for this specific namespace/service-account'. The EKS mutating webhook intercepts pod creation and, if the pod's ServiceAccount is annotated with a role ARN, injects AWS_ROLE_ARN and AWS_WEB_IDENTITY_TOKEN_FILE env vars plus a projected token mounted as a file."

"At runtime, the AWS SDK picks this up automatically via its credential chain — it reads the env vars, reads the projected token, calls STS AssumeRoleWithWebIdentity, and gets back temporary credentials (15 min by default). The pod never has a long-lived credential — everything is ephemeral. You can't accidentally commit an IRSA credential to git."

"The scale limits: 100 OIDC providers per AWS account, and STS has per-role throttling. For fleets over 100 clusters, or for clusters with thousands of pod starts per minute all using the same role, Pod Identity is better because it has a local caching proxy on each node."

Interviewer Follow-ups

  • How do you prevent one team's pod from using another team's ServiceAccount (cross-namespace IRSA abuse)?
  • The condition in the trust policy says StringEquals — what happens if you use StringLike with a wildcard?
  • How would you detect if a pod is using an overly permissive IAM role?
StringLike with a wildcard in the trust policy condition (e.g. "*:sub": "system:serviceaccount:*:my-sa") allows any namespace's service account named my-sa to assume the role — a privilege escalation vector. Always use StringEquals with the full namespace:serviceaccount path.
Q11 How does RBAC work in Kubernetes? What's the difference between Role, ClusterRole, RoleBinding, ClusterRoleBinding?
ConceptualMedium

Key Points

  • Role: namespace-scoped permissions (verbs on resources within one namespace)
  • ClusterRole: cluster-scoped permissions (non-namespaced resources like nodes, PVs, or cross-namespace)
  • RoleBinding: grants a Role or ClusterRole to a subject within a namespace
  • ClusterRoleBinding: grants a ClusterRole cluster-wide (affects all namespaces)
  • Subjects: User (IAM mapped), Group, ServiceAccount
  • RBAC is additive only — no deny rules. If multiple bindings apply, permissions union.
  • Aggregated ClusterRoles: ClusterRoles can aggregate by label selector — used by system roles

Model Answer

"RBAC in Kubernetes is built on four resources. Role defines what actions (verbs: get, list, watch, create, update, patch, delete) are allowed on which resources (pods, services, configmaps…) within a single namespace. ClusterRole is the same but cluster-scoped — used for non-namespaced resources like Nodes and PersistentVolumes, or when you need uniform permissions across all namespaces."

"RoleBinding attaches a Role (or a ClusterRole!) to a subject within a specific namespace. This is a subtle but important point: you can bind a ClusterRole with a RoleBinding, which scopes those permissions to one namespace. That's how you create reusable permission templates with ClusterRoles but grant them per-namespace. ClusterRoleBinding attaches a ClusterRole cluster-wide — granting access to all namespaces at once."

"Three things that trip people up: RBAC is purely additive — there are no deny rules. If a user has two bindings, they get the union of permissions. Second, ServiceAccounts are namespace-scoped subjects — a ServiceAccount in namespace A can't be bound to a Role in namespace B without a ClusterRoleBinding. Third, EKS maps IAM identities to K8s RBAC users/groups — the IAM ARN becomes a username, and you bind that username via RoleBinding."

Interviewer Follow-ups

  • How would you audit which IAM roles have cluster-admin privileges in a large EKS cluster?
  • A developer accidentally got system:masters — how do you remove it safely?
  • What's the risk of binding a ServiceAccount to a ClusterRole with wildcard resource access?
Q12 How would you securely inject secrets into a pod? Compare native K8s Secrets, Secrets Manager CSI, and External Secrets Operator.
TradeoffHard

Key Points

  • Native K8s Secrets: base64 (not encrypted), stored in etcd, accessible by anyone with namespace read access. Fix: enable KMS envelope encryption, strict RBAC on secrets resource.
  • Secrets Store CSI: mounts secrets as files at pod startup directly from AWS Secrets Manager / SSM. Never written to etcd. Requires a running pod to retrieve — cold-start dependency.
  • External Secrets Operator: syncs AWS Secrets Manager / SSM → K8s Secret objects. Secrets live in etcd (so same risks) but source-of-truth is AWS. Easier to use in existing tooling.
  • Rotation: CSI and ESO both support rotation. Native K8s requires manual update + pod restart.
  • Audit trail: AWS Secrets Manager has CloudTrail. Native K8s secrets have K8s audit logs.

Model Answer

"Native Kubernetes Secrets are often misunderstood as 'secure by default' — they're not. They're base64-encoded, stored in etcd, and accessible to anyone with get secret RBAC permission in that namespace. The baseline fix is KMS envelope encryption for etcd and strict RBAC. But even with encryption at rest, the secret is still decrypted when read via the API."

"Secrets Store CSI Driver avoids etcd entirely — at pod startup, the CSI driver calls AWS Secrets Manager (using IRSA credentials from the pod's ServiceAccount) and mounts the secret directly into the pod's filesystem. The secret never touches etcd. The downside: the pod depends on Secrets Manager being reachable at startup, adding a cold-start dependency. Also, you need Secrets Manager access configured before the pod can start."

"External Secrets Operator syncs secrets from Secrets Manager into Kubernetes Secret objects on a schedule. It's easier to integrate with existing tooling that expects K8s secrets, but the secret does end up in etcd. The advantage over manual management: rotation in Secrets Manager automatically propagates to the K8s Secret, and from there you control whether pods see the update via projected volumes (immediate) or env vars (require restart)."

"My recommendation for greenfield EKS: CSI driver for high-sensitivity secrets (API keys, certs), ESO for config-level secrets where etcd exposure risk is acceptable, and always KMS encryption enabled regardless."

Interviewer Follow-ups

  • How do you handle secret rotation without restarting pods?
  • A developer checks in a K8s manifest with a hardcoded secret — what controls would catch this?
05

Storage — EBS, EFS, CSI, StatefulSets

Q13 A pod with an EBS-backed PVC is stuck in Pending — walk me through your debug process.
ScenarioHard

Key Points

  • Check PVC status and events: kubectl describe pvc <name>
  • Check if PV was provisioned: kubectl get pv
  • Common cause 1: AZ mismatch — EBS volume created in wrong AZ before pod was scheduled. Fix: volumeBindingMode: WaitForFirstConsumer
  • Common cause 2: EBS CSI driver not installed or not running
  • Common cause 3: CSI driver lacks IAM permissions (IRSA not configured correctly)
  • Common cause 4: EBS quota exhaustion in the account/region
  • Check EBS CSI controller logs: kubectl logs -n kube-system -l app=ebs-csi-controller

Model Answer

"First: kubectl describe pvc <name> -n <ns> — the Events section tells you exactly what's failing. If it says 'waiting for first consumer to be created before binding', the StorageClass has WaitForFirstConsumer set, which is correct behavior — the PV won't be provisioned until the pod is scheduled. This usually means the pod itself is what's stuck."

"If the PVC shows 'ProvisioningFailed', the EBS CSI controller tried and failed to create the volume. I check EBS CSI controller pod logs: kubectl logs -n kube-system -l app=ebs-csi-controller -c csi-provisioner. Common errors: IAM permission denied (the IRSA role for the CSI driver is missing ec2:CreateVolume), EBS quota exceeded, or the requested AZ doesn't have the instance type available."

"If the StorageClass uses Immediate binding (legacy), the PV is created in whatever AZ the scheduler picks for provisioning — which may not be the AZ the pod ends up on. This causes 'Multi-Attach error for volume' or the pod stays Pending with 'node had no matching volume'. Fix: migrate the StorageClass to WaitForFirstConsumer and delete the stuck PVC/PV pair."

"I'd also check: is the EBS CSI driver installed at all? kubectl get pods -n kube-system | grep ebs-csi. On clusters before EKS 1.23, you had to install it explicitly — it wasn't included."

Interviewer Follow-ups

  • How do you migrate an EBS volume from one AZ to another without data loss?
  • A StatefulSet is being scaled down — what happens to the PVCs?
  • How do you take a snapshot backup of an EBS-backed PVC in Kubernetes?
Q14 When would you use EBS vs EFS, and what are the tradeoffs at production scale?
TradeoffMedium

Key Points

  • EBS: block storage, single AZ, RWO, low latency, good IOPS, AZ-bound (can't reschedule pod to different AZ)
  • EFS: NFS, multi-AZ, RWX, higher latency than EBS for random I/O, per-GB pricing (~$0.30/GB/mo), auto-scaling
  • EFS access points: per-namespace isolation, different UIDs/permissions per mount
  • EFS is the only persistent storage option on Fargate
  • EFS throughput modes: Bursting vs Elastic (recommended for variable workloads)
  • Cost trap: EFS charges for all stored data; EBS charges for provisioned capacity even if unused

Model Answer

"The decision comes down to access pattern and scheduling flexibility. EBS is block storage — it behaves like a fast local disk attached to one EC2 instance at a time. It's ideal for databases (Postgres, MySQL, Cassandra nodes), Kafka broker data, any workload where you need low-latency sequential or random I/O. The constraint is AZ binding — the volume lives in one AZ and so does the pod that uses it."

"EFS is NFS over the network — multiple pods across multiple nodes and AZs can mount the same filesystem simultaneously (ReadWriteMany). The use cases are: shared configuration files, ML training datasets accessed by multiple worker pods, CMS media uploads, WordPress shared content. The latency profile is worse than EBS for small random I/O — you're going across a network to a managed NFS service."

"At production scale, the main EFS traps are: throughput mode — Bursting mode gives you throughput proportional to stored data, so a small but heavily-read filesystem gets throttled. Switch to Elastic mode which auto-scales throughput. Second: EFS pricing — $0.30/GB/month on Standard storage. For a cluster serving ML model files (say 50GB per model × 20 models = 1TB), you're paying $300/month in storage alone. S3 + FUSE might be cheaper depending on access patterns."

Interviewer Follow-ups

  • You have a Deployment (not StatefulSet) that needs shared storage — what are the risks?
  • How do EFS access points help with multi-tenant scenarios?
06

Load Balancers & Ingress

Q15 Explain how the AWS Load Balancer Controller works and how it differs from the classic in-tree cloud provider LB.
ConceptualMedium

Key Points

  • Classic: controller in Kubernetes controller-manager creates Classic ELB for every type:LoadBalancer Service. Old API, limited features.
  • AWS LBC: out-of-tree controller (Deployment in your cluster), watches Services + Ingress. Provisions ALBs and NLBs with full feature sets.
  • target-type: ip vs instance: ip routes directly to pod IP (bypasses NodePort/kube-proxy), instance routes to node then kube-proxy hops to pod
  • IngressGroup: multiple Ingress resources share one ALB (cost savings, rule ordering)
  • AWS LBC requires IRSA — it calls EC2 and ELB APIs on your behalf
  • TargetGroupBinding CRD: attach existing TGs to pods outside of standard Ingress flow

Model Answer

"The classic in-tree cloud provider creates a Classic ELB (now a legacy product) whenever you create a Service of type LoadBalancer. It's baked into the controller-manager, hard to update independently, and doesn't support modern ALB/NLB features like host-based routing, WAF integration, or IP-mode targeting."

"The AWS Load Balancer Controller is an out-of-tree controller — a Deployment running in your cluster that watches Ingress and Service objects. It uses IRSA to call AWS APIs and provisions ALBs from Ingress resources, NLBs from Service type:LoadBalancer annotations. Because it runs in-cluster, you can upgrade it independently of the K8s version."

"The most important setting is target-type: ip. With instance mode, the LB targets the EC2 instance on a NodePort, and kube-proxy then forwards to the pod — adding latency and potentially a cross-AZ hop. With IP mode, the LB puts the pod's VPC IP directly into the Target Group and routes straight to it. This requires VPC CNI (pods must have VPC IPs), but eliminates the kube-proxy hop and preserves the source IP at the pod."

"The other key concept is IngressGroup — without it, each Ingress object creates a separate ALB at ~$18/month. With group.name annotation, multiple Ingress resources across namespaces share one ALB, and the LBC manages rule ordering via group.order annotation."

Interviewer Follow-ups

  • How do you handle blue-green or canary deployments at the ALB level with AWS LBC?
  • The AWS LBC can't reconcile an Ingress — how do you debug it?
  • How would you route traffic to services in different namespaces from a single ALB?
Q16 What is externalTrafficPolicy and why does setting it to Local matter for performance and cost?
ConceptualMedium

Key Points

  • Default (Cluster): LB sends traffic to any node, kube-proxy forwards to any pod regardless of node. Causes cross-node/cross-AZ hops. Source IP is SNATed.
  • Local: LB only targets nodes that actually run matching pods. No cross-node forward. Source IP preserved.
  • Cost: cross-AZ traffic in AWS = ~$0.01/GB each way. With Cluster policy on a high-traffic service this adds up.
  • Risk: if a node has no pods for this service but the LB sends to it, 503. Mitigate with readinessProbe + proper LB health checks.
  • Works best with topology-aware routing or ensuring pods are distributed across all nodes

Model Answer

"With the default externalTrafficPolicy: Cluster, the cloud load balancer targets every node in the cluster via NodePort. A request arriving at Node A might have its pod running on Node B — kube-proxy on Node A SNATs the packet and forwards it to Node B. This has two costs: latency from the extra hop and AWS cross-AZ data transfer charges (~$0.01/GB each way — significant at scale)."

"With externalTrafficPolicy: Local, the AWS Load Balancer Controller (or cloud provider) registers only the nodes that have a pod for that Service in the Target Group. Requests go directly to pods on the same node — no SNAT, no cross-node hop, and the original client source IP is preserved in the pod, which matters for rate limiting, geo-based routing, and audit logs."

"The risk: if a node has Local policy and the pod on it crashes before the health check catches it, the LB might still send requests there briefly, getting 503s. Mitigate with fast readiness probes and health check intervals. Also, Local means pod distribution matters more — if you have 10 nodes but pods only on 3, those 3 get all the traffic. Use TopologySpreadConstraints to ensure good distribution."

Interviewer Follow-ups

  • How does topology-aware routing in Kubernetes complement externalTrafficPolicy: Local?
  • You have a DaemonSet-backed Service — does externalTrafficPolicy: Local make sense here?
07

Observability & Upgrades

Q17 Your EKS cluster is experiencing intermittent pod evictions. How do you investigate and fix this?
ScenarioHard

Key Points

  • Check pod events: kubectl describe pod — eviction message tells you the reason
  • Causes: node memory pressure (kubelet evicts low-priority pods), node disk pressure (ephemeral storage limit), OOMKilled (container exceeded memory limit)
  • Check node conditions: kubectl describe node | grep Conditions -A10
  • Memory eviction order: BestEffort (no requests) → Burstable (requests < limits) → Guaranteed (requests = limits)
  • Fix: set proper resource requests (Guaranteed QoS class), tune kubelet eviction thresholds, right-size nodes
  • Karpenter consolidation can also cause "voluntary evictions" — check if consolidation is too aggressive

Model Answer

"I start with kubectl describe pod <evicted-pod> — the Events section tells me the exact eviction message. Common messages: 'The node was low on resource: memory' means kubelet-initiated eviction due to node memory pressure. 'OOMKilled' means the container itself exceeded its memory limit and was killed by the kernel."

"For node pressure evictions, I check node conditions: kubectl describe node <node> | grep -A8 Conditions. MemoryPressure, DiskPressure, PIDPressure indicate which resource is constrained. I'd then check node memory allocationkubectl top node plus the allocatable vs capacity breakdown."

"The fix depends on root cause. If pods have no resource requests (BestEffort QoS), they're evicted first by kubelet under pressure — set explicit resource requests to at minimum BestEffort → Burstable → Guaranteed. If the node is genuinely under-provisioned, either scale out (more nodes or larger instances) or tune kubelet eviction thresholds via the node group's kubelet config."

"If the evictions are happening during low-traffic periods, check if Karpenter consolidation is the cause — it drains underutilized nodes, which looks like eviction. Add PodDisruptionBudgets to ensure minimum availability is maintained during consolidation."

Interviewer Follow-ups

  • What's the difference between QoS classes — BestEffort, Burstable, Guaranteed — and how does kubelet prioritize eviction?
  • How do you prevent a critical pod from being evicted while allowing less critical ones to be evicted normally?
PriorityClass is the answer to preventing critical pod eviction — assign a high PriorityClass (e.g., system-cluster-critical) to ensure kubelet evicts lower-priority pods first. Combined with Guaranteed QoS (requests == limits), these pods are essentially eviction-immune under normal circumstances.
Q18 How would you approach upgrading an EKS cluster with zero downtime for production workloads?
DesignHard

Key Points

  • Pre-flight: scan deprecated APIs (kubectl-no-trouble), check EKS Insights, validate all Helm charts support target version
  • Control plane upgrade: one minor version at a time, API server stays behind NLB during rolling replace — zero downtime
  • Add-ons upgrade: must be done manually after control plane, each addon deploys rolling
  • Data plane: managed node group rolling replace (cordon → drain → new AMI node → old node terminates). Need PDBs set.
  • PodDisruptionBudgets: critical to ensure rolling node replacement doesn't take a service to zero replicas
  • Validate at each step before proceeding to next

Model Answer

"Zero-downtime upgrade has three phases, and the order matters. Pre-upgrade: I run kubectl-no-trouble to scan for deprecated API usage — any workloads using removed APIs (like PSP in 1.25) need to be migrated first. I check EKS Cluster Insights for compatibility warnings. I verify all Helm chart versions support the target K8s version, and I confirm PodDisruptionBudgets exist for all critical Deployments and StatefulSets."

"Control plane upgrade is low-risk from a downtime perspective — AWS does a rolling replace of API server instances behind the NLB. Existing pods keep running. API calls may see brief increased latency during the transition but the endpoint stays available. I upgrade one minor version at a time."

"Data plane is where downtime risk lives. For managed node groups, EKS cordons a node, drains it (respecting PDBs — if draining would violate a PDB, it waits), launches a new node with the updated AMI, waits for it to be Ready, then terminates the old node. The key: PDBs must be set correctly. A Deployment with 3 replicas and no PDB could have all 3 nodes drained simultaneously. With minAvailable: 2, only 1 can be drained at a time."

"I also upgrade add-ons between the control plane and data plane steps. Running old vpc-cni with new K8s can cause ENI/IP allocation issues."

Interviewer Follow-ups

  • A PDB is blocking node drain during upgrade — how do you handle it without data loss?
  • You have a StatefulSet with persistent data — what's your strategy for draining those nodes?
  • When would you choose a blue-green cluster upgrade over in-place?
08

Kubernetes Core — Scheduling, Controllers, etcd

Q19 What is the controller pattern in Kubernetes? Explain the reconcile loop and how it handles failures.
ConceptualHard

Key Points

  • Controllers implement the desired state → actual state reconciliation loop
  • Watch mechanism: list-watch on etcd for resource changes, triggers reconcile
  • Reconcile function is idempotent — safe to call multiple times, same result
  • On failure: exponential backoff retry, work queue with rate limiting
  • Controller uses informers with local cache — reduces etcd load
  • Eventual consistency: the system may temporarily be in non-desired state but converges
  • Custom controllers via controller-runtime / kubebuilder — same pattern

Model Answer

"The controller pattern is the core of how Kubernetes works. Every controller watches for a desired state expressed in Kubernetes objects, observes the actual state of the world, and takes actions to close the gap. The reconcile loop is the heart of it: observe current state, compare to desired state, take the minimum action needed, return. It's designed to be called repeatedly."

"The implementation uses informers — not raw watch calls to etcd. An informer maintains a local in-memory cache of the resources it cares about, refreshed by a list-watch stream. When a resource changes, the event is put into a work queue. The reconcile function pops from the queue, reads from the local cache (not etcd — this reduces load dramatically), and acts."

"The critical design property is idempotency: the reconcile function must produce the same result whether called once or ten times. This is because in a distributed system, the controller might crash mid-reconciliation, or receive duplicate events. Kubernetes controllers are designed so re-running reconcile on an already-reconciled object is safe."

"Failure handling: on error, the item is re-queued with exponential backoff. The work queue has rate limiting to prevent thundering herd when many resources need reconciling simultaneously. This is why after a cluster comes back up from an outage, you see a wave of reconciliation activity that settles down over minutes, not a hard spike."

Interviewer Follow-ups

  • You're writing a custom controller — how do you test that the reconcile function is idempotent?
  • What happens if two controllers both try to reconcile the same resource simultaneously?
  • How do resource versions and optimistic locking prevent stale writes in controllers?
Resource versions (resourceVersion field) implement optimistic concurrency control. If you read an object at version 42 and someone else updates it to 43 before you write, your write fails with a conflict error. Controllers handle this by re-queuing for reconcile — they'll re-read the latest version and retry. This is why controllers never store object state locally between reconcile calls.
Q20 Design a highly available, multi-AZ EKS cluster for a payment processing service. Walk through every layer.
DesignHard

Key Points

  • Network: 3 private subnets (one per AZ), /22 each, VPC CNI with prefix delegation
  • Compute: Managed node groups spread across 3 AZs, or Karpenter with multi-AZ NodePool. At least 2 nodes per AZ for HA.
  • Workload HA: TopologySpreadConstraints across AZs + PodDisruptionBudgets (minAvailable ≥ 1 per AZ)
  • Storage: StatefulSets with WaitForFirstConsumer StorageClass, per-AZ node groups if EBS used
  • Ingress: ALB (multi-AZ by default) with target-type:ip, externalTrafficPolicy:Local
  • Security: IRSA for payment-service pods, KMS for secrets, NetworkPolicy default-deny, SecurityContext non-root
  • Observability: Prometheus + AlertManager, CloudWatch control plane logs, distributed tracing (OTEL)

Model Answer

"For a payment processor I'd structure this in layers. Network foundation: one VPC per cluster, three private subnets (one per AZ) sized at /22 (~1000 IPs each) with prefix delegation enabled — this gives ~16,000 pod IPs per subnet. Public subnets in each AZ hold only load balancers. No public node IPs."

"Compute: I'd use Karpenter with a NodePool constrained to on-demand only (no spot for payments — spot interruption = transaction disruption). Disruption policy set to WhenEmpty only — no voluntary consolidation during business hours. TopologySpreadConstraints on all Deployments enforce at least one pod per AZ with maxSkew:1."

"Data layer: if using EBS (e.g. for a local cache), WaitForFirstConsumer StorageClass so volumes are provisioned in the pod's AZ. PodDisruptionBudgets with minAvailable = ceil(replicas * 0.75) so upgrades can't take the service below 75% capacity. For the payment DB, I'd likely use Aurora Multi-AZ outside the cluster, accessed by pods via IRSA-authenticated connection."

"Security: IRSA per service component (payment-api gets only the specific DynamoDB tables it needs), KMS-encrypted secrets, NetworkPolicy default-deny in the payment namespace with explicit allow for ingress from the ALB target and egress to the DB endpoint. Pod SecurityContext: runAsNonRoot, readOnlyRootFilesystem, drop ALL capabilities."

"Ingress: multi-AZ ALB with WAF attached, target-type:ip, externalTrafficPolicy:Local, HTTPS only, TLS cert managed via ACM. Health checks on /healthz with 30s interval and 2 unhealthy threshold."

Interviewer Follow-ups

  • One AZ becomes completely unavailable — what breaks and how quickly does your system recover?
  • How do you handle database migrations without downtime for the payment service?
  • Your compliance team requires all API calls to AWS services to be logged — how do you implement this?
Q21 What happens during a rolling update of a Deployment? How do maxSurge and maxUnavailable interact?
ConceptualMedium

Key Points

  • maxUnavailable: max pods that can be unavailable during update. Default 25%. Set to 0 for zero-downtime (but slower).
  • maxSurge: max extra pods above desired count. Default 25%. Allows creating new pods before killing old ones.
  • With maxUnavailable=0, maxSurge=1: creates 1 new pod, waits for it to be Ready, deletes 1 old pod — safest, slowest
  • With maxUnavailable=1, maxSurge=0: kills 1 old pod, creates 1 new pod — saves capacity but briefly reduces availability
  • Deployment controller watches ReplicaSets — creates new RS for each rollout revision
  • kubectl rollout undo scales up previous RS

Model Answer

"A rolling update creates a new ReplicaSet for the updated pod template. The Deployment controller then scales the new RS up and the old RS down simultaneously, respecting maxSurge and maxUnavailable constraints."

"maxUnavailable=0, maxSurge=1 is the zero-downtime configuration: before killing any old pod, the controller ensures a new pod is Ready. It creates 1 new pod (surge), waits for its readiness probe to pass, then terminates 1 old pod. This repeats until all replicas are updated. Slowest but safest — no user traffic hits the new version until it's proven Ready."

"Readiness probes are the critical dependency here. If your new pod version never passes its readiness probe (e.g., the new code has a startup bug), the rolling update halts — it won't kill old pods because it can't bring new ones to Ready. This is actually good: the old version keeps serving traffic. You'd see the Deployment stuck at partial rollout."

"rollout history: each update creates a new RS kept as a revision. kubectl rollout undo scales up the previous RS. The number of retained revisions is controlled by revisionHistoryLimit (default 10)."

Interviewer Follow-ups

  • How does Kubernetes ensure traffic isn't sent to pods that are starting up?
  • What's the difference between a readiness probe failure and a liveness probe failure?
Q22 Explain how Kubernetes handles pod termination gracefully. What is SIGTERM, terminationGracePeriodSeconds, and preStop hooks?
ConceptualMedium

Key Points

  • On pod deletion: kubelet sends SIGTERM to PID 1 of each container, starts grace period timer (default 30s)
  • preStop hook runs before SIGTERM — useful for deregistering from service discovery or draining connections
  • After grace period, kubelet sends SIGKILL (forceful kill)
  • Race condition: kube-proxy iptables update and endpoint removal are async — pod may still receive traffic after SIGTERM. preStop sleep(5) is the common workaround.
  • terminationGracePeriodSeconds should be > time for app to finish in-flight requests

Model Answer

"Pod termination is a multi-step process with a critical race condition that causes dropped connections in most clusters. When a pod is deleted: the API server marks it as Terminating, which triggers two concurrent paths — kubelet starts the termination sequence, and endpoints controller removes the pod from Service endpoints, which kube-proxy uses to update iptables rules."

"The kubelet path: if a preStop hook is defined, it runs first. Then SIGTERM is sent to all containers. If the process doesn't exit within terminationGracePeriodSeconds, SIGKILL is sent."

"The race condition: iptables rule removal is asynchronous and may lag behind SIGTERM by 1–10 seconds. During that window, new requests can still arrive at the pod while it's trying to shut down. The standard workaround is a preStop hook with sleep 5 — this delays SIGTERM by 5 seconds, giving iptables time to propagate the endpoint removal before the app starts refusing connections."

"For applications that have long-running requests (gRPC streams, websockets), set terminationGracePeriodSeconds to the maximum expected request duration + the sleep buffer. A payment processing service might need 60–90 seconds to drain."

Interviewer Follow-ups

  • How do you verify a pod is handling SIGTERM correctly in a staging environment?
  • What happens if your app ignores SIGTERM and you've set terminationGracePeriodSeconds: 3600?
The iptables race condition is one of the most common causes of dropped connections during deployments that senior candidates miss. The preStop sleep is a pragmatic but imperfect fix — the proper solution is also to handle SIGTERM gracefully in the app, stop accepting new connections, and drain existing ones.
Q23 What is a CRD and Operator pattern? When would you build a custom operator vs use an existing one?
ConceptualHard

Key Points

  • CRD: extends the Kubernetes API with custom resource types. Stored in etcd, managed like native objects.
  • Operator: a controller that watches CRDs and encodes domain-specific operational knowledge (Day-2 operations)
  • Operator Capability Levels: Basic Install → Seamless Upgrades → Full Lifecycle → Deep Insights → Auto Pilot
  • Build custom when: proprietary operational logic, existing operators don't exist, tight integration with internal systems
  • Use existing when: common infra (Postgres, Kafka, Prometheus), mature operators available (Strimzi, CloudNativePG)
  • Tools: kubebuilder, controller-runtime, Operator SDK

Model Answer

"A CRD lets you extend Kubernetes with domain-specific resource types. Once you define a CRD, the API server handles storage, validation (via OpenAPI schema), RBAC, and watch — you get all of that for free. Your custom object lives in etcd alongside native Kubernetes objects."

"An Operator is a controller that watches those CRDs and implements operational domain knowledge — the stuff a human operator would do. For a database operator: creating a cluster, handling node failures, taking backups, performing rolling upgrades. The key is encoding runbooks as code."

"Build vs buy: for common infrastructure — Postgres, Kafka, Prometheus, Redis — I'd always evaluate existing operators first. Strimzi for Kafka, CloudNativePG for Postgres, and the Prometheus Operator are battle-tested. Building your own is weeks of work with edge cases you haven't thought of yet."

"I'd build custom when: the operational logic is proprietary to our business domain (e.g., an operator for our internal ML training job lifecycle), when existing operators have architectural constraints that conflict with our requirements, or when we need deep integration with internal tooling. The kubebuilder framework makes this approachable — it scaffolds the controller, generates CRD YAML, and handles the watch/cache/queue plumbing."

Interviewer Follow-ups

  • What happens to existing CRD instances if you delete the CRD itself?
  • How do you version a CRD API safely with conversion webhooks?
Q24 How does CoreDNS work in Kubernetes and what are common DNS failure modes in EKS?
ConceptualMedium

Key Points

  • CoreDNS runs as a Deployment (2 replicas by default) in kube-system. Service IP is the cluster DNS.
  • Pod /etc/resolv.conf points to CoreDNS ClusterIP. Search domains: default.svc.cluster.local, svc.cluster.local, cluster.local
  • Short names resolve via search domain chain: svc-name → tries svc-name.namespace.svc.cluster.local first
  • DNS 5-second timeout: caused by conntrack race condition in Linux kernel with iptables. Fix: switch to TCP DNS, use localDNS caching, or ndots:2
  • EKS NodeLocal DNSCache: DaemonSet running a DNS cache on each node — eliminates the conntrack issue
  • CoreDNS is an EKS managed addon — upgrade separately

Model Answer

"CoreDNS is Kubernetes' cluster DNS service — a Deployment (typically 2 replicas for HA) running in kube-system. Its ClusterIP is configured in every pod's /etc/resolv.conf as the nameserver. Each pod's resolv.conf also has search domain suffixes, so a bare hostname like my-svc gets tried as my-svc.my-namespace.svc.cluster.local first, then falling back through the domain chain."

"The most impactful DNS issue in large EKS clusters is the 5-second DNS timeout. It's caused by a Linux kernel race condition in conntrack when multiple threads from the same pod make concurrent DNS queries — SNAT and packet processing can drop one of them, causing a 5-second retry. The symptoms: intermittent 5-second latency spikes on first connections, usually masked by connection pooling but visible in tail latencies."

"NodeLocal DNSCache is the proper EKS fix: a DaemonSet that runs a DNS cache on every node and intercepts DNS traffic via a link-local address before it reaches CoreDNS. This eliminates the conntrack issue because DNS queries no longer traverse iptables NAT for most lookups. It also reduces load on CoreDNS significantly in large clusters."

"Other CoreDNS failures: CoreDNS pods OOMKilled under load (increase memory limits), CoreDNS pods scheduled on a single node that becomes unavailable (use topologySpreadConstraints on the CoreDNS Deployment), and DNS cache poisoning (validate CoreDNS is not forwarding to untrusted resolvers)."

Interviewer Follow-ups

  • How do you configure CoreDNS to forward internal domain queries to an on-premises resolver?
  • A pod can reach its service by ClusterIP but not by name — what's wrong?
Q25 How would you implement comprehensive observability for an EKS cluster — what metrics, logs, and traces matter most?
DesignHard

Key Points

  • Metrics: USE (Utilization, Saturation, Errors) for infrastructure; RED (Rate, Errors, Duration) for services
  • kube-state-metrics: pod/deployment/node states as Prometheus metrics
  • node-exporter: host-level CPU/memory/disk/network
  • Control plane logs (API audit, auth) → CloudWatch; critical for security and debugging RBAC issues
  • Application logs: Fluent Bit DaemonSet → CloudWatch Logs or S3/OpenSearch
  • Distributed tracing: OpenTelemetry (ADOT) → AWS X-Ray or Jaeger
  • Critical alerts: node NotReady, pod crashlooping, PVC Pending, API server error rate, etcd leader changes

Model Answer

"I structure observability around three pillars with clear signal priorities. Metrics: I use kube-prometheus-stack (Prometheus + Grafana) for cluster metrics. kube-state-metrics translates Kubernetes object states into Prometheus metrics — pod phases, deployment rollout status, PVC binding state. node-exporter covers host-level resources. For the control plane, EKS doesn't expose Prometheus metrics directly, so I rely on CloudWatch Container Insights for control plane health."

"Critical alerts I always set up: node in NotReady for >2 minutes (hardware/network issue), pod CrashLoopBackOff (app crash), PersistentVolumeClaim stuck Pending (storage provisioning failure), kube-proxy DaemonSet not fully available, and CoreDNS error rate spike. These are the signals that wake me up at 3am."

"Logs: Fluent Bit DaemonSet (the ADOT Fluent Bit variant ships with Container Insights) forwards container logs to CloudWatch Logs with pod metadata enrichment. Control plane logs — especially API audit and auth logs — must be enabled explicitly in EKS and are critical for security investigations and RBAC debugging."

"Traces: for a Staff-level concern — I use OpenTelemetry (ADOT Collector as a DaemonSet or sidecar) to collect traces from services and forward to X-Ray or Jaeger. The key is propagating trace context across service boundaries, especially across Ingress → service → database. Without traces, you can see that latency increased but not which service in the call chain is responsible."

Interviewer Follow-ups

  • Prometheus storage is filling up on your cluster. How do you handle long-term metrics retention at scale?
  • How do you correlate a spike in application error rate with a Kubernetes event (e.g., node replacement)?
Q26 A pod has been compromised — walk through your incident response process for an EKS cluster.
ScenarioHard

Key Points

  • Contain: apply NetworkPolicy to isolate the pod immediately, revoke IRSA role temp creds (can't revoke but can add deny policy)
  • Preserve evidence: don't delete the pod yet. Copy /proc, memory dump if needed. Check CloudTrail for API calls made by the pod's IAM role.
  • Assess blast radius: what IAM permissions did the pod have? What K8s secrets were in that namespace? What services was it talking to?
  • Remediate: rotate all secrets/tokens the pod could access, revoke and recreate IRSA role, patch the vulnerability, redeploy clean version
  • Detect: GuardDuty EKS Protection for runtime threat detection, CloudTrail for AWS API calls

Model Answer

"Immediate priority is containment without losing evidence. I'd immediately apply a NetworkPolicy to the compromised pod's labels — deny all ingress and egress except to a forensics endpoint. This isolates it from lateral movement while keeping it running for evidence collection. I'd label the pod with a quarantine label so it's visually identifiable."

"In parallel, I'd check CloudTrail for any AWS API calls made by the pod's IAM role in the last hour. If the pod had an IRSA role with S3 read access, did it exfiltrate data? I can't revoke the temp credentials directly, but I can attach an explicit Deny policy to the IAM role immediately, which overrides all other permissions even for in-flight temp credentials."

"Blast radius assessment: what K8s Secrets were mounted in that namespace? What ServiceAccounts could the compromised process impersonate? If the pod had node-level access (host PID, host network), the blast radius is the entire node and all pods on it — in that case I'd cordon and drain the node."

"For remediation: rotate all credentials the pod could have accessed (DB passwords, API keys in Secrets, IRSA role rotated by creating a new role), patch the container image, redeploy. Then conduct a post-mortem: how did this happen, what GuardDuty rule would have caught it earlier, is Pod Security Admission configured to prevent privileged containers?"

Interviewer Follow-ups

  • How does GuardDuty EKS Protection work and what threats does it detect?
  • If the attacker escaped to the node (container breakout), how do you detect and respond?
Q27 How do resource requests and limits work in Kubernetes, and what's the relationship to QoS classes and OOMKill behavior?
ConceptualMedium

Key Points

  • Request: what the scheduler uses for node selection. The container is guaranteed this amount. Node allocatable = sum of all requests.
  • Limit: ceiling. CPU limit = throttled if exceeded (not killed). Memory limit = OOMKilled if exceeded.
  • Guaranteed: requests == limits for all containers. Never evicted unless exceeding limits.
  • Burstable: at least one container has requests != limits. Evicted after BestEffort under pressure.
  • BestEffort: no requests or limits set. Evicted first.
  • CPU is compressible (throttled), memory is incompressible (process killed)

Model Answer

"Requests and limits serve different purposes. Requests are scheduling hints — the scheduler sums all pod requests per node to determine available capacity. A node with 8 vCPU and all requests totaling 7 vCPU will not schedule another 2-vCPU request pod, even if actual usage is only 3 vCPU. This is intentional — requests guarantee resource availability."

"Limits are enforcement ceilings. CPU limits are implemented with Linux cgroups cfs_quota — if a container tries to use more than its CPU limit, the scheduler throttles it without killing it. Memory limits are different: if a container exceeds its memory limit, the Linux OOM killer kills the process. The container restarts (if restartPolicy allows), which shows up as CrashLoopBackOff with OOMKilled reason."

"The QoS class determines eviction priority under node pressure. Guaranteed (requests == limits for ALL containers) means kubelet won't evict the pod under memory pressure unless it's actually over limit. Burstable gets evicted next. BestEffort (no requests/limits at all) gets evicted first."

"My recommendation: set requests == limits for memory on critical workloads (Guaranteed QoS, protects against eviction), but keep CPU limit higher than request or remove it entirely — CPU throttling degrades latency subtly and is hard to detect. For non-critical workloads, set memory requests conservatively with limits 2x higher."

Interviewer Follow-ups

  • Your Java app is getting OOMKilled but heap usage looks fine — what might be happening?
  • How does CPU throttling manifest in application latency, and how do you detect it?
Java apps get OOMKilled because the JVM doesn't account for native memory, metaspace, and thread stacks — the container limit is for total process memory, not just heap. Set -XX:MaxRAMPercentage=75 (not -Xmx) so JVM auto-tunes heap to 75% of container limit, leaving headroom for native memory.
Q28 What is a sidecar container? When is it the right pattern vs an init container vs a separate service?
ConceptualMedium

Key Points

  • Sidecar: runs alongside main container for lifetime of pod. Shares PID, network, volumes. Used for: logging agents, proxies (Envoy), IRSA token refresh, metrics exporters.
  • Init container: runs to completion before main container starts. Used for: DB migrations, wait-for-dependency, config generation, secret fetching.
  • K8s 1.29+: sidecar containers (restartPolicy: Always in initContainers) that start before main and stay running — formal sidecar pattern
  • Separate service: when the functionality needs independent scaling, separate RBAC, or serves multiple workloads

Model Answer

"A sidecar extends or enhances a main container's capabilities by running in the same pod, sharing network namespace and volumes. The canonical examples are: an Envoy proxy sidecar for service mesh traffic control, a Fluent Bit sidecar for log forwarding (especially on Fargate where DaemonSets don't work), and the IRSA token refresh sidecar in older patterns."

"Sidecar vs init container: init containers run sequentially before the main container and must complete successfully — perfect for wait-for-dependency patterns (wait until Postgres is ready), one-time config generation, or DB schema migrations. Sidecars run for the lifetime of the pod alongside the main container."

"In Kubernetes 1.29+, there's now a formal sidecar container type — an initContainer with restartPolicy:Always. It starts before the main container, keeps running, and importantly, it terminates after the main container during pod shutdown. This solves the classic problem where a log forwarder sidecar would die before draining the log buffer because both containers get SIGTERM simultaneously."

"Separate service is better when: the functionality needs independent scaling (log aggregation serving 100 app pods doesn't need to scale 1:1 with apps), it needs its own RBAC or IAM permissions isolated from the app, or it serves multiple different workloads."

Interviewer Follow-ups

  • How does adding a sidecar affect pod scheduling and resource requests?
  • The Envoy sidecar in your service mesh is adding 20ms of latency — how do you profile this?
Q29 How would you implement blue-green or canary deployments on EKS without changing application code?
DesignHard

Key Points

  • Blue-green at Service level: two Deployments (blue/green), switch Service selector. Instant cutover, requires 2x capacity.
  • Canary at ALB level: AWS LBC supports weighted target groups. Two IngressGroups, weighted annotations (e.g. 90/10 split). No code change.
  • Argo Rollouts: Kubernetes-native progressive delivery. CanaryStep, analysis runs, automatic rollback on metric degradation.
  • Flagger: works with Flagger + Istio/Linkerd/nginx for automatic traffic shifting based on Prometheus metrics
  • Header-based routing for targeted canary (A/B test a specific user cohort)

Model Answer

"At the ALB level, the AWS Load Balancer Controller supports weighted target groups. You deploy a v2 Deployment alongside v1, create a separate Service, and in the Ingress you annotate with alb.ingress.kubernetes.io/actions.forward-rule pointing to both target groups with weights — e.g., 90% to v1, 10% to v2. This requires no application change and shifts traffic at the LB level."

"For full progressive delivery with automatic rollback, Argo Rollouts replaces the standard Deployment and adds traffic shifting logic. You define canary steps: pause for analysis, shift 10% → wait → shift 25% → wait → promote. The analysis template queries Prometheus — if error rate exceeds a threshold, Rollouts automatically rolls back. No human involvement needed."

"Header-based canary is useful for targeted testing: route requests with X-Canary: true header to the new version, all others to the old version. The ALB can route on request headers. This lets you send your QA team or specific user cohort to the new version while production users stay on stable."

"For true blue-green (instant full cutover): run two identical Deployments, change the Service selector label from version: blue to version: green. The cutover is atomic — Service selector update is a single etcd write, and kube-proxy propagates it within seconds."

Interviewer Follow-ups

  • How do you handle database schema migrations in a blue-green deployment?
  • What metrics would you use as canary analysis signals for a payment API?
Q30 What is a StatefulSet and how does it differ from a Deployment? When would you use each?
ConceptualMedium

Key Points

  • StatefulSet gives pods: stable network identity (pod-0, pod-1…), stable persistent storage (PVC per replica, not deleted on scale-down), ordered scaling/rolling updates
  • Deployment: pods are cattle (ephemeral, interchangeable). StatefulSet: pods have identity (pets).
  • StatefulSet headless Service: pods get DNS entries like pod-0.svc-name.namespace.svc.cluster.local
  • Use StatefulSet for: databases, Kafka brokers, ZooKeeper, Elasticsearch nodes, any app that needs stable identity
  • Scale-down: PVCs are NOT deleted — must be cleaned up manually
  • Rolling update: ordered (pod N-1 updated before pod N). Can pause mid-rollout.

Model Answer

"The core distinction is pod identity. Deployment pods are interchangeable — if pod-abc crashes, a new pod-xyz replaces it, same spec, different name. StatefulSet pods have stable identities: pod-0, pod-1, pod-2 — if pod-1 crashes, the replacement is also called pod-1 and remounts the same PVC."

"This stable identity enables three things Deployments can't provide: stable network identity via a headless Service (pod-0.db.namespace.svc.cluster.local resolves specifically to pod-0 — critical for Kafka brokers advertising their address to clients), stable persistent storage (each replica gets its own PVC from volumeClaimTemplates, which persists across pod restarts and is not deleted on scale-down), and ordered operations (pod-1 won't start until pod-0 is Ready, rolling updates go in reverse order — pod-2 updates before pod-1 before pod-0)."

"I use StatefulSet for any application where the instance matters: database nodes, Kafka/Pulsar brokers, ZooKeeper, Redis cluster members, Elasticsearch data nodes. I use Deployment for stateless applications where any pod can serve any request."

"Important EKS-specific nuance: StatefulSet PVCs are not deleted when you scale down. If you scale from 3 to 2 replicas, pod-2 is deleted but data-pod-2 PVC remains, consuming EBS costs. This is intentional (protects data) but requires manual cleanup when decommissioning."

Interviewer Follow-ups

  • How do you perform a StatefulSet rolling update that updates pods in batches with a manual promotion gate?
  • A StatefulSet pod is stuck in Pending after a node failure — PVC is in a different AZ — how do you recover?
Q31 What is a DaemonSet and how does it interact with taints and node affinity?
ConceptualMedium

Key Points

  • DaemonSet: exactly one pod per node (or subset). Used for: log forwarders, monitoring agents, network plugins, security scanners.
  • DaemonSet controller bypasses the scheduler — it assigns pods directly to nodes.
  • By default DaemonSets tolerate: node.kubernetes.io/not-ready, node.kubernetes.io/unreachable, node.kubernetes.io/disk-pressure — so they run on degraded nodes.
  • To limit DaemonSet to subset of nodes: nodeSelector or nodeAffinity on the DaemonSet spec.
  • DaemonSet does NOT run on Fargate nodes.

Model Answer

"DaemonSets ensure exactly one pod per node (or per matching node subset). The DaemonSet controller bypasses the normal scheduler and directly sets nodeName on pods — which is why DaemonSet pods appear on nodes that are in NotReady state, or that have memory pressure. Critical infrastructure like aws-node (VPC CNI), kube-proxy, and log forwarders use DaemonSets because they must run everywhere."

"By default, DaemonSet pods tolerate most node condition taints — this is intentional so that node maintenance tooling (the DaemonSet) can run even when a node is degraded. You can override this by removing tolerations if you want the DaemonSet to only run on healthy nodes."

"To run a DaemonSet on a subset of nodes: add nodeSelector or nodeAffinity to the DaemonSet's pod spec. For example, run the GPU monitoring DaemonSet only on nodes with label gpu=true. Karpenter is aware of DaemonSet overhead — when sizing a new node, it accounts for DaemonSet pod resource requests as overhead."

Interviewer Follow-ups

  • How do you do a rolling update of a DaemonSet with minimal disruption?
  • A DaemonSet pod is not running on a specific new node — how do you debug?
Q32 How do you troubleshoot a node that's NotReady in EKS?
ScenarioHard

Key Points

  • Check node conditions: kubectl describe node → Conditions section. KubeletReady = false?
  • SSH to node (via SSM Session Manager — no bastion needed on EKS). Check systemctl status kubelet.
  • Common causes: kubelet crashed (check journalctl -u kubelet -n 100), disk full on root volume, docker/containerd OOM, network issue preventing kubelet from reaching API server, certificate rotation failure
  • aws-node not running → pods can't get IPs → node appears degraded
  • For managed node groups: cordon the node, drain it, terminate the EC2 instance — ASG will replace it

Model Answer

"First: kubectl describe node <node> — the Conditions section tells me whether it's MemoryPressure, DiskPressure, PIDPressure, or plain NotReady (kubelet lost contact with API server). The Events section shows recent node-level events."

"I then SSH via AWS Systems Manager Session Manager (no bastion needed if SSM agent is running on the node). Check kubelet: systemctl status kubelet and journalctl -u kubelet -n 200 --no-pager. Common logs: 'failed to get cgroup stats' (kernel version issue), 'certificate has expired' (cert rotation failure), 'PLEG is not healthy' (container runtime unresponsive)."

"If the disk is full (df -h), check for large log files, stuck containers with big log outputs, or image layer accumulation. On EKS, docker system prune (or containerd equivalent) clears unused images. Increase the root EBS volume in the launch template if this is recurring."

"For managed node groups, the operational response is: cordon, drain (respecting PDBs), terminate the EC2 instance, and let the ASG replace it with a fresh node. Investigating the root cause is parallel — don't hold up workload recovery."

Interviewer Follow-ups

  • How do you prevent disk pressure from occurring in the first place?
  • Node is NotReady but pods on it are still running — is that possible?
Yes — pods keep running when a node goes NotReady (kubelet cache). They'll only be evicted after the default tolerationSeconds (5 minutes for NotReady/Unreachable taints). This is why you see 5-minute delays before pods reschedule after node failure.
Q33 What is a Horizontal Pod Autoscaler? How does it work and what are its limitations?
ConceptualMedium

Key Points

  • HPA scales replica count based on metrics from metrics-server (CPU, memory) or custom metrics (KEDA, Prometheus Adapter)
  • Formula: desiredReplicas = ceil(currentReplicas × currentMetric / targetMetric)
  • Cooldown: scale-up fast (3 min default stabilization window), scale-down slow (5 min) — prevents thrashing
  • Limitation: scaling Deployments that use EBS (StatefulSets with RWO volumes don't scale horizontally)
  • KEDA extends HPA with event-driven scaling (SQS queue depth, Kafka lag, custom metrics)
  • VPA: vertical scaling (change CPU/memory requests) — can't run with HPA on same resource

Model Answer

"HPA adjusts the replica count of a Deployment or StatefulSet based on observed metrics. It queries the metrics-server every 15 seconds, applies the formula ceil(current × observed/target), and updates the replicas field. The built-in metrics are CPU and memory utilization relative to requests — which is why resource requests must be set for HPA to work."

"The stabilization window prevents thrashing: scale-up decisions are applied quickly (3-minute window — if the trigger persists for 3 min, scale up). Scale-down is conservative (5-minute window — don't scale down until the metric has been below target for 5 min). You can tune these with behavior.scaleDown.stabilizationWindowSeconds."

"KEDA (Kubernetes Event-Driven Autoscaling) extends HPA for external triggers — SQS queue depth, Kafka consumer lag, custom Prometheus metrics, even cron schedules. It creates HPA objects under the hood but with external metrics. This is the right pattern for batch/queue-based workloads where CPU utilization is a lagging indicator of load."

"Key limitation: HPA can scale replicas out but not individual pod resource sizes — that's VPA (Vertical Pod Autoscaler). VPA and HPA can conflict if both are targeting CPU metrics on the same resource. Use HPA for throughput-oriented scaling, VPA for right-sizing single-replica workloads."

Interviewer Follow-ups

  • HPA is scaling up but new pods are Pending — why, and how do you fix?
  • What's the problem with scaling based on CPU for a queue-processing service?
Q34 What is a Service Mesh and when does it make sense to add Istio or Linkerd to an EKS cluster?
TradeoffHard

Key Points

  • Service mesh adds: mTLS between services, traffic management (retries, timeouts, circuit breaking), observability (traces, metrics per service-pair), policy enforcement
  • Implemented via sidecar proxies (Envoy/Linkerd-proxy) injected by mutating webhook
  • Overhead: 2–10ms latency per hop, significant memory per sidecar (~50–100MB), operational complexity
  • Justified when: compliance requires mTLS, complex traffic routing needed, cross-service observability is critical
  • Alternative to full Istio: Cilium eBPF-based service mesh (no sidecars, lower overhead)

Model Answer

"A service mesh intercepts all service-to-service traffic via sidecar proxies injected into each pod. This gives you: automatic mTLS between all services (zero-trust networking without any application code change), fine-grained traffic management (retry policies, circuit breakers, canary traffic splitting at L7), and deep observability — traces and metrics for every service-to-service call, even across language boundaries."

"The cost is real: latency overhead (2–10ms per hop for Istio Envoy sidecars, lower for Linkerd), memory overhead (each Envoy sidecar uses 50–150MB — at 100 pods, that's 5–15GB extra memory), and significant operational complexity. Debugging traffic issues through a service mesh is much harder than native Kubernetes networking."

"I'd recommend a service mesh when: compliance mandates mTLS in-cluster (PCI-DSS, HIPAA), you need sophisticated traffic management across microservices without code changes, or you need per-service-pair traffic metrics that Prometheus doesn't provide without custom instrumentation."

"For many teams, Cilium's eBPF-based service mesh is a better fit — no sidecars, lower overhead, and it integrates with the CNI. For mTLS specifically, SPIRE/SPIFFE for workload identity might be a lighter option than deploying full Istio."

Interviewer Follow-ups

  • How would you migrate an existing EKS cluster to Istio with zero downtime?
  • What's the difference between Istio's virtual services and K8s HTTPRoute (Gateway API)?
Q35 What is Pod Security Admission and how does it replace PodSecurityPolicy?
ConceptualMedium

Key Points

  • PSP was removed in K8s 1.25 — any upgrade past 1.24 must migrate away
  • PSA (Pod Security Admission) is built-in, no CRD. Labels on namespace define policy: pod-security.kubernetes.io/enforce=restricted
  • Three profiles: privileged (no restrictions), baseline (prevents known privilege escalations), restricted (hardened, follows security best practices)
  • Three modes: enforce (reject), audit (allow + log), warn (allow + user warning)
  • Migration: start with audit mode to find violations, then enforce

Model Answer

"PSP was a cluster-scoped resource that defined what security constraints pods must meet. The problem: its RBAC model was confusing (a pod could use a PSP if the pod's ServiceAccount could 'use' it — this led to many accidental privilege escalation bugs). It was deprecated in 1.21 and removed in 1.25."

"Pod Security Admission replaces it with a simpler model: you label a namespace with a policy level, and the built-in admission controller enforces it. Three levels: privileged (no restrictions, for system namespaces), baseline (blocks hostPID, hostNetwork, privileged containers, dangerous capabilities), restricted (requires non-root, drops all capabilities, requires seccompProfile)."

"Three modes per level: enforce (reject the pod), audit (allow but log an audit event), warn (allow but show a warning to the user). The migration strategy: add pod-security.kubernetes.io/audit=restricted labels to all namespaces first, check audit logs for violations, fix the workloads, then switch to enforce mode."

"For EKS, namespaces like kube-system must stay privileged (system components need elevated privileges). Application namespaces should target at least baseline, ideally restricted for production workloads."

Interviewer Follow-ups

  • A third-party Helm chart requires privileged containers — how do you handle this in a restricted namespace?
  • How does OPA/Gatekeeper complement or replace PSA?
Q36 How do admission webhooks work, and what are the failure mode risks?
ConceptualHard

Key Points

  • Mutating: runs first, can modify the object (e.g., inject sidecar, add labels, set defaults)
  • Validating: runs after mutating, can only accept/reject (e.g., OPA/Gatekeeper policy enforcement)
  • failurePolicy: Fail — if the webhook is unreachable, the request is rejected. Ignore — allow through.
  • Webhooks add latency to every API call they match — timeout is configurable (default 10s)
  • If the webhook service is down with Fail policy: no new pods can be scheduled (cluster emergency)
  • Best practice: scope webhooks with namespaceSelector to exclude kube-system, monitor webhook latency

Model Answer

"Admission webhooks intercept API server requests after authentication/authorization but before persisting to etcd. Mutating webhooks run first — they can modify the object being submitted (inject a sidecar container, add an annotation, set a default resource limit). Validating webhooks run after all mutating webhooks — they can only accept or reject, not modify. OPA/Gatekeeper uses validating webhooks."

"The critical operational risk is failurePolicy: Fail. If your webhook service (e.g., Gatekeeper) becomes unavailable and it's configured with Fail policy, every API request that matches the webhook rule is rejected. In the worst case, no new pods can be scheduled and no Deployments can be updated — effectively a cluster incident. I've seen Istio injection webhooks with Fail policy take down a cluster when the Istiod service crashed."

"Best practices: scope webhooks with namespaceSelector to exclude critical system namespaces (kube-system, kube-public) so the cluster can always self-heal. Set reasonable timeouts (2–5s) and have runbooks for emergency disabling. Monitor webhook latency — a slow webhook adds that latency to every kubectl apply."

Interviewer Follow-ups

  • You need to emergency-disable a validating webhook that's blocking all pod creation — how?
  • How do you test a mutating webhook locally before deploying to production?
Q37 Explain the Gateway API and how it improves on Ingress in Kubernetes.
ConceptualMedium

Key Points

  • Ingress API is limited: single resource, implementation-specific via annotations, no TCP/UDP routing, no traffic splitting standard
  • Gateway API separates concerns: GatewayClass (infra team defines LB type), Gateway (cluster ops deploys LB), HTTPRoute/TCPRoute (app teams configure routing)
  • Multi-namespace: HTTPRoutes in app namespace can attach to a Gateway in infra namespace
  • Standard traffic splitting, header matching, redirects — no annotations needed
  • AWS LBC supports Gateway API via GatewayClass

Model Answer

"The Ingress API has been stretched beyond its original design. Every feature beyond basic path routing requires implementation-specific annotations — ALB annotations for auth, NLB annotations for TCP, NGINX annotations for rate limiting. There's no standard way to express traffic splitting or header-based routing."

"Gateway API introduces a role-based model with three resource types. GatewayClass is cluster-scoped, controlled by infra teams — it defines the controller (e.g., aws-load-balancer-controller). Gateway is created by cluster operators and provisions the actual load balancer with listener configuration. HTTPRoute is created by app teams and attaches to a Gateway — defining routing rules with standardized syntax for path, header, weight-based traffic splitting."

"The key improvements for platform teams: role separation — app teams write HTTPRoutes without needing cluster-admin, and they can't accidentally misconfigure the shared Gateway. Standardized traffic splitting — canary deployments are first-class, expressed as weight: 90/10 in the HTTPRoute spec without custom annotations. AWS LBC supports Gateway API in newer versions."

Interviewer Follow-ups

  • How do you migrate from Ingress to Gateway API without downtime?
Q38 What is the CSI driver architecture? Explain how dynamic volume provisioning works end-to-end.
ConceptualHard

Key Points

  • CSI = Container Storage Interface — standardized plugin API, replaces in-tree volume drivers
  • Controller component (Deployment): handles CreateVolume, DeleteVolume, AttachVolume, DetachVolume via cloud API
  • Node component (DaemonSet): handles NodeStageVolume, NodePublishVolume (mount into pod namespace)
  • Dynamic provisioning flow: PVC created → external-provisioner sidecar watches → calls CSI CreateVolume → PV created → PVC bound → pod scheduled → node CSI mounts volume
  • external-provisioner, external-attacher, external-snapshotter: standard sidecars from kubernetes-csi

Model Answer

"CSI standardizes the interface between Kubernetes and storage vendors. Before CSI, each storage provider had in-tree code in the Kubernetes codebase — a security and release coupling problem. CSI moves this to out-of-tree plugins that run as pods."

"The architecture has two components. The controller plugin is a Deployment that runs on any node and handles cloud API calls: CreateVolume (provision an EBS volume), DeleteVolume, ControllerPublishVolume (attach to an EC2 instance). It runs with IRSA credentials that have ec2 permissions."

"The node plugin is a DaemonSet on every worker node that handles the local mount operations: NodeStageVolume (format and mount to a staging path on the host), NodePublishVolume (bind-mount from staging into the pod's directory). It runs privileged because it needs to create mounts in the host's mount namespace."

"Dynamic provisioning flow: a PVC is created with a StorageClass. The external-provisioner sidecar inside the controller Deployment watches for unbound PVCs, calls CreateVolume on the CSI driver, and creates a PV object. The PVC binds to the PV. When the pod is scheduled to a node, the external-attacher calls ControllerPublishVolume to attach the EBS volume to that EC2 instance, then the node plugin mounts it into the pod."

Interviewer Follow-ups

  • Volume attach is stuck — how do you debug it?
  • How does volume snapshot work with CSI?
Q39 How do you run cost-optimized spot instances on EKS without risking production stability?
DesignHard

Key Points

  • Spot instances can be reclaimed with 2-minute warning. SIGTERM sent to pods via node termination handler.
  • AWS Node Termination Handler (or Karpenter's built-in): watches EC2 spot interruption notices, cordons + drains node gracefully before termination
  • Strategies: spot for stateless/batch, on-demand for stateful. Use karpenter.sh/capacity-type labels to separate.
  • Spread across multiple instance families + AZs to reduce simultaneous interruptions
  • PDBs required — spot interruption can remove multiple nodes simultaneously in same family/AZ
  • Karpenter's spot-to-on-demand fallback: if spot unavailable, automatically provision on-demand

Model Answer

"The 2-minute spot interruption notice is the key constraint. AWS sends an EC2 instance metadata event, and the AWS Node Termination Handler (or Karpenter natively) picks this up, cordons the node immediately, and drains it — evicting pods gracefully before the instance terminates. This gives pods up to 2 minutes to handle SIGTERM and shut down."

"My tiering strategy: stateless workloads on spot (web services, API servers, workers) with multiple replicas spread across instance families (m5, m5a, m5n, m4 — if m5 spot capacity dries up, m5a picks up the slack). Stateful and critical workloads on on-demand. With Karpenter, this is a NodePool label: karpenter.sh/capacity-type: spot for batch pools, on-demand for critical pools."

"AZ and instance family diversification is critical. A mass reclaim event in us-east-1a m5 family would take down all your spot nodes if they're homogeneous. Diversify: require at least 3 different instance families, 3 AZs. Karpenter automatically selects across families based on pricing."

"PDBs are non-negotiable for spot workloads. A spot mass reclaim can hit multiple nodes in the same minute. Without PDBs, your 3-replica service could go to 0 during a reclaim wave."

Interviewer Follow-ups

  • How do you handle long-running spot jobs (ML training) that can't complete within 2 minutes of interruption?
  • What's the cost saving percentage you'd expect from spot in a typical workload mix?
Q40 You need to migrate a stateful workload from one EKS cluster to another with minimal downtime. How do you approach this?
DesignHard

Key Points

  • Snapshot EBS/EFS data, restore in destination cluster (Volume Snapshot API + VolumeSnapshotContent)
  • Run workload in both clusters temporarily (active-passive or dual-active depending on consistency requirements)
  • Use Route 53 weighted routing for gradual traffic migration
  • Tools: Velero for backup/restore of both K8s resources and volumes
  • Data consistency: quiesce writes before final snapshot if strong consistency required
  • Verify: run smoke tests on new cluster before cutting over DNS

Model Answer

"Migrating a stateful workload involves three parallel concerns: Kubernetes resource migration, data migration, and traffic cutover. I'd approach them as a pipeline."

"Kubernetes resources: use Velero to back up all resources in the namespace from the source cluster and restore them to the destination. Velero handles PV/PVC metadata, ConfigMaps, Secrets, ServiceAccounts, and IRSA annotations."

"Data migration: take a CSI VolumeSnapshot of each PVC in the source cluster. Use the VolumeSnapshotContent object's snapshot handle to create a new PVC from snapshot in the destination cluster. For large volumes, this runs in parallel with resource setup. For databases, I'd quiesce writes (or use application-level replication) before the final snapshot to avoid consistency issues."

"Traffic cutover: I stand up the workload in the destination cluster in passive mode (workload running, no external traffic). Run smoke tests against the new cluster's internal ALB. Then use Route 53 weighted routing to gradually shift traffic: 10% to new cluster, verify metrics (error rate, latency) for 30 min, shift 50%, verify, then cut to 100%. Keep the old cluster alive for 30 minutes as a rollback option."

"The hardest part is maintaining data consistency during the traffic transition window if the workload has writes. For databases, application-level replication (Postgres logical replication) to the new cluster until cutover is the cleanest approach."

Interviewer Follow-ups

  • How does Velero handle cross-region migration of EBS snapshots?
  • What's your rollback plan if the new cluster has issues at 50% traffic?