AWS EKS · Deep Reference

Component	Role	EKS Behavior
etcd	Distributed key-value store — all cluster state	3–5 nodes, AWS manages quorum, automatic backups, multi-AZ
API Server	All kubectl / SDK calls entry point	Multiple instances behind NLB, auto-scales under load
Controller Manager	Reconcile loops (ReplicaSets, Deployments…)	AWS managed; cloud controller runs separately in modern EKS
Scheduler	Assigns pods to nodes (resources/taints/affinity)	AWS managed, auto-replaced on failure
OIDC Endpoint	Issues tokens for IAM↔K8s identity mapping (IRSA)	Auto-created per cluster; associate via eksctl

Tool	Purpose	Install
kubectl	K8s control — deployments, pods, services, logs	`brew install kubectl`
eksctl	EKS-specific cluster/nodegroup/addon management	`brew tap weaveworks/tap && brew install eksctl`
aws CLI v2	AWS API — EKS, IAM, EC2, ECR etc.	`brew install awscli`
helm	Package manager for Kubernetes charts	`brew install helm`
eksdemo	Demo/lab tool — VPC/subnet/ENI inspection	`brew install eksdemo`
kubectx / kubens	Fast context and namespace switching	`brew install kubectx`
kubectl-no-trouble	Scan for deprecated API usage before upgrades	`brew install nodetaint/tap/kubectl-no-trouble`

CNI	Pod IP Source	Overlay	Latency	Use Case
AWS VPC CNI	VPC subnet IPs	No	Lowest	Default EKS, direct VPC routing, no overhead
Cilium	Custom CIDR	Yes (Geneve/eBPF)	Variable	Advanced L7 policies, observability, service mesh
Calico	Custom CIDR	Yes (IP-in-IP)	Low overhead	NetworkPolicy-heavy, BGP routing
Flannel	Custom CIDR	Yes (VXLAN)	Medium	Simple dev clusters

Dimension	Managed Node Groups	Fargate	Karpenter
Node ownership	You (EC2)	AWS managed	You (EC2)
OS patching	You update AMIs	AWS manages	You (NodeClass AMI config)
Startup latency	Fast (pre-provisioned)	30–60s cold start	30–90s (EC2 boot)
Scaling granularity	Group-level ASG	Per pod	Per pod
Spot support	Manual per node group	No spot on Fargate	Auto least-cost with spot
DaemonSets	✓	✗	✓
GPUs / NVMe	✓	✗	✓
Host networking	✓	✗	✓
Persistent EBS	✓	✗ (EFS only)	✓
Auto consolidation	✗ manual	N/A (pod-per-node)	✓ built-in

Service	Type	Access Mode	AZ Scope	Best For	Fargate
EBS	Block	RWO	Single AZ	Databases, stateful apps	✗
EFS	File (NFS)	RWX	Multi-AZ	Shared config, CMS, ML datasets	✓
FSx Lustre	High-perf parallel FS	RWX	Single AZ	HPC, ML training, batch	✗
FSx ONTAP	Enterprise NAS	RWO/RWX	Multi-AZ	SMB/NFS/iSCSI enterprise lift-shift	✗
Instance Store	NVMe local	RWO	Node-local ephemeral	Temp data, cache, shuffle	✗

Feature	Kube2IAM	IRSA	Pod Identity
OIDC provider needed	✗	✓	✗
Pod annotations needed	✓	✓	✗
Local credential proxy + cache	✗	✗	✓
ABAC / tag-based policies	✗	✗	✓
Works off EKS	✓	✓ (OIDC)	✗ EKS only

Feature	ALB (Application LB)	NLB (Network LB)
OSI Layer	L7 (HTTP/HTTPS)	L4 (TCP/UDP/TLS)
Routing rules	Host, path, header, query string	Port-based only
Latency	Higher (L7 processing overhead)	Ultra-low (<1ms)
Source IP	Via X-Forwarded-For header	Preserved natively
Static IP / Elastic IP	✗ (DNS hostname only)	✓
AWS PrivateLink	✗	✓
gRPC / HTTP2	✓ (native)	✓ (TLS passthrough)
WebSocket	✓	✓
K8s resource	Ingress / IngressClass	Service type:LoadBalancer
Best for	HTTP APIs, microservices, HTTPS termination	gRPC, databases, real-time, static IP needs

Add-on	Purpose	Required?
vpc-cni	AWS VPC CNI — pod IP allocation from VPC	Yes
kube-proxy	Service networking (iptables / ipvs rules)	Yes
coredns	Cluster DNS resolution	Yes
aws-ebs-csi-driver	EBS PersistentVolume support	If using EBS
eks-pod-identity-agent	Pod Identity credential proxy DaemonSet	If using Pod Identity
adot	AWS Distro for OpenTelemetry	If using OTEL
aws-guardduty-agent	Runtime threat detection for pods	Security baseline

Staff L6 interview prep — 40 questions across all 8 topic clusters. Each has key points to memorize, a full model answer, and interviewer follow-up probes. Click any question to expand.

Filter:

Questions reviewed 0 / 40

EKS Architecture & Control Plane

Q1 Explain the EKS shared responsibility model. What exactly does AWS manage and what do you own?

ConceptualMedium

▶

Key Points

AWS manages the control plane: etcd (3–5 nodes, multi-AZ quorum), API server (multi-instance behind NLB), scheduler, controller manager, automatic patching/scaling/HA
You manage the data plane: EC2 node instances, OS patches, AMI updates, workloads, RBAC, CRDs, network policies, IAM roles, security groups
Communication bridge: AWS provisions cross-account ENIs (X-ENIs) in your VPC — nodes talk to the API server through these, not over public internet
Cost split: ~$0.10/hr control plane fee + EC2 costs for nodes
Staff depth: know that the control plane runs in an AWS-owned VPC, and AWS uses a dedicated account per cluster for isolation

Model Answer

"EKS splits Kubernetes responsibilities at the control plane / data plane boundary. AWS fully manages the control plane — that means etcd with 3–5 nodes spread across AZs maintaining quorum automatically, multiple API server instances behind an NLB that auto-scales under load, the scheduler and controller manager, and all patching, backup, and version management of those components."

"You own everything in your AWS account: worker nodes whether EC2 or Fargate, the OS and AMI lifecycle, all Kubernetes workloads, RBAC policies, CRDs, network policies, IAM role bindings, and security groups. The bridge between the two is AWS provisioning cross-account ENIs into your VPC subnets — your nodes register with the API server through those ENIs, which is why your VPC security groups must allow port 443 outbound to the control plane endpoint."

"The practical implication is that if you misconfigure VPC routing or security groups, your nodes can't join. If AWS has a control plane issue, your existing pods keep running but you can't schedule new ones or change cluster state — etcd goes read-only."

Interviewer Follow-ups

If the API server is unavailable, what happens to running pods?
How does AWS ensure control plane HA — can you describe the etcd quorum mechanics?
What's the blast radius if the X-ENI is misconfigured?

Most candidates stop at "AWS manages the master". The follow-up is: "so if etcd fails, do your pods die?" — answer is NO. kubelet runs a local cache; existing pods keep running until the node restarts or kubelet needs to re-register.

Q2 Walk me through what happens when you run kubectl apply -f deployment.yaml — from CLI to pod running.

ConceptualHard

▶

Key Points

Auth: kubectl calls aws eks get-token → STS presigned URL → API server validates via EKS auth webhook → RBAC authorization
Admission: Mutating webhooks run first (inject sidecars, IRSA env vars), then Validating webhooks (OPA/Gatekeeper policy checks)
Persistence: API server writes to etcd (only after quorum acknowledgment)
Controllers: Deployment controller watches etcd → creates ReplicaSet → RS controller creates Pod objects
Scheduling: Scheduler watches for unbound pods → scores nodes (resources, taints, affinity) → writes nodeName to Pod spec in etcd
kubelet: On the target node, kubelet watches API server → calls container runtime (containerd) → pulls image, creates namespaces, calls CNI plugin, starts container
CNI: aws-node assigns secondary IP from ENI pool → sets up veth pair → programs routes

Model Answer

"The full path has about 8 distinct phases. Authentication first: kubectl generates a bearer token by calling STS with your IAM credentials to get a presigned URL. The API server passes this to the EKS authentication webhook, which validates the IAM identity and maps it to a Kubernetes username via the aws-auth ConfigMap or EKS Access Entries."

"Then authorization: the API server checks your RBAC permissions for the resource and verb. If you pass, it hits the admission controllers — mutating first, then validating. In a typical EKS cluster this is where IRSA webhook injects environment variables, and where OPA/Gatekeeper blocks policy violations."

"Once admitted, the object is written to etcd — only after a quorum of etcd nodes acknowledges the write. The Deployment controller running inside the controller manager watches etcd via a list-watch and creates a ReplicaSet, which creates Pod objects with no nodeName."

"The scheduler watches for Pods with empty nodeName, scores candidate nodes using predicates (resource fit, taints, pod affinity) and priorities, then writes the chosen node back to the Pod spec in etcd. The kubelet on that node is also watching, picks up the binding, calls containerd to pull the image, creates Linux namespaces, calls the CNI plugin (aws-cni) to allocate a secondary VPC IP and set up the veth pair, then starts the container. The pod transitions through Pending → ContainerCreating → Running."

Interviewer Follow-ups

Where exactly does a PodDisruptionBudget get enforced in this flow?
What happens if the scheduler crashes mid-way through scoring?
Can admission webhooks add latency to deployments at scale? How would you mitigate?

Many candidates skip admission controllers entirely. Staff interviewers expect you to know about mutating vs validating webhooks, their ordering, and failure modes (what happens if a webhook is unavailable — failurePolicy: Fail vs Ignore).

Q3 How does etcd work in EKS, and what are the failure modes you need to understand as an operator?

ConceptualHard

▶

Key Points

etcd uses Raft consensus — needs (n/2)+1 nodes to form a quorum for writes. With 3 nodes, 2 must agree. With 5, 3 must agree.
If quorum is lost: etcd goes read-only. Existing pods keep running (kubelet has local cache) but you can't make API changes.
EKS runs 3–5 etcd nodes across AZs. AWS handles this — you can't SSH into them.
etcd stores all K8s objects as key-value pairs under /registry/ prefix
Watch mechanism: controllers use long-polling watch on etcd, not polling. This is how controllers get notified of state changes efficiently.
Large clusters: etcd compaction, defragmentation, and object count limits matter (~8MB default max request size)

Model Answer

"etcd is a distributed key-value store built on the Raft consensus algorithm. It elects a leader that handles all writes; followers replicate. To commit a write, the leader needs acknowledgment from a majority — quorum. In EKS, AWS runs 3–5 etcd nodes distributed across AZs. With 3 nodes, you can tolerate one AZ failure and still have quorum. With 5, you can tolerate two."

"The critical failure mode to understand: if quorum is lost, etcd becomes read-only. Your existing pods keep running because kubelet caches the pod spec locally — it doesn't need etcd to manage running containers. But you cannot create, update, or delete any Kubernetes objects until quorum is restored. This is why AWS runs this in 3+ AZs."

"As an operator, the things I watch for are: etcd compaction lag (if not compacted, the database grows unboundedly — AWS handles this but it's important to understand), the 8MB request size limit (trying to apply a very large ConfigMap or CRD can hit this), and at scale, object count — etcd's performance degrades with millions of objects, so namespacing and cleaning up stale resources matters."

Interviewer Follow-ups

A pod is Running but you can't delete it — kubectl delete hangs. What's happening and how do you debug?
How does the watch mechanism in etcd work, and why is it more efficient than polling?
What's a finalizer and how can it cause etcd objects to get stuck?

The "kubectl delete hangs" scenario almost always comes up. The answer: the object has a finalizer that isn't being resolved (e.g., the controller managing it is down). Fix: patch the object to remove the finalizer — but this bypasses cleanup logic, so understand the risk.

Networking — CNI, VPC, ENIs, Policies

Q4 Explain how the AWS VPC CNI assigns IP addresses to pods. What are the limits and failure modes?

ConceptualHard

▶

Key Points

aws-node DaemonSet runs on every node, manages ENI attachment and secondary IP allocation
Each pod gets a real VPC IP from the node's ENI secondary IP pool — no overlay, no NAT
EC2 instance type determines max ENIs and IPs per ENI — e.g., m5.xlarge: 4 ENIs × 15 IPs = 60 pod IPs max
Warm pools: WARM_ENI_TARGET / WARM_IP_TARGET pre-allocate to reduce scheduling latency
Prefix delegation: each ENI slot holds a /28 (16 IPs) instead of 1 IP — multiply pod capacity by 16x same ENI count
Failure mode: subnet IP exhaustion → pods stay Pending with "not enough IPs" error
Failure mode: ENI attach limit hit → aws-node can't allocate new IPs, pods Pending

Model Answer

"The aws-node DaemonSet — the VPC CNI plugin — runs on each worker node and is responsible for pre-allocating a pool of IPs from your VPC subnet. It attaches secondary ENIs to the EC2 instance and assigns secondary private IPs to those ENIs. When a pod is scheduled, the kubelet calls the CNI plugin, which picks a free IP from the warm pool, creates a veth pair — one end in the pod's network namespace, one on the host — and programs the Linux kernel routing table so traffic to that IP goes into the pod."

"The capacity ceiling is instance-type dependent. An m5.xlarge supports 4 ENIs with up to 15 secondary IPs each, so 60 pod IPs. With prefix delegation enabled, each of those 15 secondary slots becomes a /28 prefix (16 IPs), giving you 4 × 15 × 16 = 960 pod IPs per node."

"The two failure modes I've seen in production: first, subnet IP exhaustion — your /24 runs out of IPs and pods get stuck Pending. The fix is either resize subnets (can't do in-place, need migration) or use prefix delegation with larger subnets. Second, the warm pool latency problem — if WARM_ENI_TARGET is 0, the first pod on a cold node waits for ENI attachment, which can take 10–30 seconds. Setting WARM_ENI_TARGET=1 pre-allocates a spare ENI."

Interviewer Follow-ups

You're seeing pods stuck in Pending with FailedScheduling — how do you diagnose whether it's IP exhaustion vs node capacity?
When would you choose Cilium over VPC CNI?
How does prefix delegation interact with subnet fragmentation?

Interviewers love asking "why would you NOT use VPC CNI?" — key answer: when you need to conserve VPC IPs (large cluster, tight CIDR), you'd use Cilium or Calico with custom CIDR + overlay. Trade off is encapsulation latency and loss of native VPC routing.

Q5 A pod can't reach another pod in a different namespace. How do you debug this systematically?

ScenarioHard

▶

Key Points

Start at layer 7: is it a DNS issue or a network issue? (nslookup from pod)
Check NetworkPolicy — a deny-all default in either namespace blocks cross-namespace traffic
Check if the target service exists and has endpoints (kubectl get endpoints)
Check kube-proxy / iptables rules on the node: iptables -L -t nat
Check aws-node logs for IP allocation failures
Check Security Groups if using Security Groups for Pods feature
Use kubectl exec + curl / nc to test connectivity directly

Model Answer

"I'd approach this as a layered debug. First I'd confirm whether it's DNS resolution or actual network connectivity — exec into the source pod and run nslookup service-name.namespace.svc.cluster.local. If DNS fails, the problem is CoreDNS or the service name, not networking."

"If DNS resolves but the connection fails, I'd check NetworkPolicy first — this is the most common cause of cross-namespace failures. A default-deny policy in the destination namespace, or an ingress policy missing a namespaceSelector for the source namespace, would silently block traffic. I'd kubectl get networkpolicy -n <target-ns> and read the selectors carefully."

"If there's no NetworkPolicy blocking it, I'd check the Service has active Endpoints — kubectl get endpoints <service> -n <ns>. No endpoints means the selector doesn't match any pods. Then I'd verify kube-proxy has programmed the iptables rules correctly on the relevant nodes. On EKS, I'd also check whether Security Groups for Pods is in use, which adds an extra layer of SG-based filtering that NetworkPolicy doesn't control."

"I'd also verify the pods are actually running and that aws-node allocated IPs correctly — check aws-node DaemonSet logs for any IP pool exhaustion warnings."

Interviewer Follow-ups

How does kube-proxy implement Service routing — iptables vs IPVS mode differences?
If you have 10,000 services, what's the scaling problem with iptables mode?
What's the difference between a Service ClusterIP, NodePort, and a pod IP in terms of routing?

Candidates often jump to NetworkPolicy immediately. The trap: NetworkPolicy only works if a CNI that enforces it is installed. On a fresh EKS cluster without Cilium/Calico or the aws-eks-nodeagent eBPF policy engine, NetworkPolicy objects exist but are NOT enforced — pods communicate freely regardless.

Q6 You're designing a multi-tenant EKS cluster. How do you achieve network isolation between tenants?

DesignHard

▶

Key Points

Namespace-level isolation: default-deny NetworkPolicy per namespace + explicit allow rules
Node-level isolation: dedicated node groups per tenant with taints/tolerations + nodeAffinity
Security Groups for Pods: assign different SGs to different tenant pods for AWS-level network enforcement
Hard multi-tenancy requires separate clusters — Kubernetes namespaces are soft boundaries
RBAC: namespace-scoped Roles, no ClusterRole access to other tenants
Resource quotas + LimitRanges per namespace to prevent noisy neighbor

Model Answer

"The first question I ask is whether we need soft or hard multi-tenancy. Soft tenancy — where tenants are internal teams that trust each other at some level — can work in one cluster with namespace isolation. Hard tenancy — where tenants are external customers with no trust relationship — requires separate clusters per tenant or very careful controls."

"For soft tenancy in EKS, I'd layer three controls. Network: default-deny ingress + egress NetworkPolicy in every tenant namespace, then explicit allow rules only for legitimate traffic paths. With aws-eks-nodeagent or Cilium, these are enforced at eBPF level. Compute: dedicated node groups per tenant with taints and tolerations so tenant A pods can't land on tenant B nodes — eliminates shared memory/CPU attack surface. IAM: namespace-scoped RBAC only, no ClusterRole, separate IRSA roles per tenant namespace."

"For full isolation, I'd add Security Groups for Pods — assigning per-tenant SGs at the ENI level. This enforces isolation at the AWS network layer, not just in kernel space, and you get CloudTrail logging of any cross-tenant traffic attempts. The downside is SG-for-Pods requires a specific VPC CNI configuration and can complicate your IP allocation."

Interviewer Follow-ups

A tenant namespace somehow got a pod with a privileged security context — how does that break your isolation model?
How do you prevent a tenant from exfiltrating data via DNS queries?
What's the trade-off between one large multi-tenant cluster vs many smaller single-tenant clusters?

Compute & Scaling

Q7 Compare Karpenter vs Cluster Autoscaler. When would you choose one over the other?

TradeoffHard

▶

Key Points

Cluster Autoscaler: scales existing ASGs up/down. Needs pre-configured node groups. Slow (checks every 10s, respects scale-down delay). AWS-agnostic.
Karpenter: bypasses ASG, calls EC2 Fleet API directly. Any instance type. Pod-aware. Consolidation built-in. Faster (seconds vs minutes).
Karpenter needs proper resource requests on all pods — otherwise it can't size nodes correctly
CAS is better for: existing investment in ASG tooling, non-EKS K8s, predictable homogeneous workloads
Karpenter is better for: mixed instance types, spot optimization, cost-critical, rapid scale-out
Never run both simultaneously on the same nodes — they conflict on scale-down decisions

Model Answer

"The fundamental architectural difference: Cluster Autoscaler works through Auto Scaling Groups — it scales up by increasing desired count on a pre-configured ASG, which means you must pre-define instance types, have separate ASGs per AZ per instance family, and handle mixed instance types manually. It's reactive, checking every 10–60 seconds, with a conservative scale-down mechanism that waits for an underutilization window."

"Karpenter bypasses ASGs entirely and calls the EC2 Fleet API directly. When a pod is unschedulable, Karpenter reads its resource requests and constraints, queries EC2 for all matching instance types with current spot and on-demand pricing, and launches the cheapest option that fits. This happens in seconds. It's also topology-aware — it considers the pod's AZ preference for EBS volumes."

"I'd choose Karpenter for any new EKS cluster where cost efficiency matters, especially for workloads with variable shape (some pods need GPU, some need memory, some are small). The consolidation feature — where Karpenter continuously right-sizes and terminates underutilized nodes — alone can save 20–40% on compute."

"I'd stick with Cluster Autoscaler if the team has deep existing tooling around ASGs, or if the cluster runs on non-AWS infrastructure. The critical thing with Karpenter: every pod needs explicit resource requests. Without them, Karpenter has no signal for sizing and you get unpredictable behavior."

Interviewer Follow-ups

How does Karpenter handle a pod that needs a specific instance type for GPU access?
Walk me through Karpenter's consolidation algorithm — when does it decide to terminate a node?
You have a StatefulSet with PVCs — how does Karpenter handle node replacement without losing data?

The consolidation trap: Karpenter can disrupt stateful workloads during consolidation if you don't have PodDisruptionBudgets. Staff-level answer: always pair Karpenter with PDBs on any critical workload, and use karpenter.sh/do-not-disrupt annotation on pods that absolutely cannot be evicted.

Q8 When would you use Fargate vs EC2 node groups, and what are the production gotchas with Fargate?

TradeoffMedium

▶

Key Points

Fargate: serverless, per-pod isolation, zero node management, cold start latency (~30–60s)
No DaemonSets, no GPUs, no host networking, no privileged containers on Fargate
EBS not supported — only EFS for persistent storage
Fargate profiles define namespace + label selectors — pods matching profile land on Fargate
Each Fargate "node" is dedicated to one pod — no bin-packing, potentially expensive
Good for: batch jobs, CI runners, burst workloads, dev namespaces, high-isolation requirements

Model Answer

"Fargate's core value proposition is zero node management and pod-level isolation — AWS provisions a dedicated microVM per pod, handles OS patching, and you never think about node capacity. That sounds appealing, but the production constraints are significant."

"The hard limitations: no DaemonSets — any tooling that relies on DaemonSets (log forwarders, security agents, monitoring collectors) won't work on Fargate pods. You work around this with sidecars, but that's operationally heavier. No EBS — Fargate pods can only use EFS for persistent storage, which has higher latency and different cost characteristics. Cold start is 30–60 seconds, which makes Fargate a poor fit for latency-sensitive scale-out paths."

"I use Fargate for: batch and ETL jobs that spin up infrequently and need strong isolation, CI/CD runners where the cold start doesn't matter, and dev/staging namespaces where the team wants zero node management overhead. For production web services with fast scale-out requirements or any DaemonSet-dependent tooling, I stay on EC2 managed node groups or Karpenter."

Interviewer Follow-ups

How do you get logs out of Fargate pods if you can't run a DaemonSet?
What's the cost comparison between Fargate and equivalently-sized EC2?
How do Fargate nodes get upgraded after a cluster version bump?

Q9 How does the Kubernetes scheduler work? Explain predicates, priorities, and how taints/tolerations/affinity interact.

ConceptualHard

▶

Key Points

Phase 1 — Filtering: eliminates nodes that can't run the pod (insufficient CPU/memory, taints not tolerated, node affinity mismatch, port conflicts, volume zone mismatch)
Phase 2 — Scoring: ranks remaining nodes on: least-requested resources, pod affinity/anti-affinity spread, image locality, topology spread constraints
Taints: mark a node to repel pods. Tolerations: allow a pod to be scheduled on a tainted node. Effect: NoSchedule, PreferNoSchedule, NoExecute (evicts existing pods)
Node affinity: attraction to nodes with specific labels (required vs preferred)
Pod affinity/anti-affinity: co-locate or spread based on other pods' labels
TopologySpreadConstraints: enforce pod spread across zones/nodes (preferred over anti-affinity for large clusters)

Model Answer

"The scheduler runs two phases per pod. Filtering reduces the candidate set to only nodes that can actually run the pod. This includes: does the node have enough CPU/memory requests available, do the pod's tolerations cover all node taints, does the node label match any required nodeAffinity, are the requested ports free, and critically for EKS — is the EBS volume in the same AZ as the node."

"Scoring then ranks the filtered nodes. The main scorer is LeastRequestedPriority — prefer nodes with more free capacity to spread load. PodAffinity scoring attracts pods to nodes where matching pods already run. SpreadConstraints scoring penalizes nodes in AZs that are already overloaded."

"Taints vs affinity is a common confusion point. Taints are on nodes — they push pods away unless the pod has a matching toleration. Affinity is on pods — they pull pods toward nodes. You'd taint a GPU node gpu=true:NoSchedule so only pods that tolerate it land there. You'd use nodeAffinity when you want pods to prefer certain nodes but don't want to block non-matching pods."

"For HA across zones, I prefer TopologySpreadConstraints over pod anti-affinity now — it's more expressive, can enforce a specific skew limit, and scales better. Anti-affinity requires N anti-affinity rules proportional to replica count, which gets expensive to evaluate in large clusters."

Interviewer Follow-ups

A pod is Pending with "0/5 nodes available: 5 node(s) had untolerated taint" — how do you fix it?
You have 3 replicas of a Deployment and 3 nodes — how do you guarantee one replica per node?
What's the risk of using requiredDuringSchedulingIgnoredDuringExecution for node affinity?

IgnoredDuringExecution is critical: if you use required affinity and then remove the node label, existing pods keep running but new ones can't schedule. For production critical workloads, use preferred affinity or ensure node labels are stable.

Security — IRSA, Pod Identity, RBAC, Secrets

Q10 Explain IRSA end-to-end. Why is it better than using a node IAM role?

ConceptualMedium

▶

Key Points

Node IAM role = all pods on that node share the same AWS permissions. Blast radius of a compromise is the entire node's IAM role.
IRSA = per-ServiceAccount IAM role. Lease-privilege at pod granularity.
Mechanism: OIDC provider per cluster → trust policy on IAM role → mutating webhook injects token + env vars → AWS SDK calls STS AssumeRoleWithWebIdentity → temp credentials scoped to that role
Token is a projected ServiceAccount token with audience and expiry — not a long-lived credential
Key limits: 100 OIDC providers per AWS account, STS rate limits at scale

Model Answer

"With a node IAM role, every pod on that node — regardless of what it does — can call any AWS API that role allows. If one pod is compromised, the attacker gets node-wide AWS access. IRSA solves this with pod-level least privilege: each Kubernetes ServiceAccount maps to a specific IAM role with exactly the permissions that service needs."

"The mechanism: when you associate an OIDC provider with your cluster, EKS acts as an identity provider. You create an IAM role whose trust policy says 'trust tokens issued by this specific OIDC provider for this specific namespace/service-account'. The EKS mutating webhook intercepts pod creation and, if the pod's ServiceAccount is annotated with a role ARN, injects AWS_ROLE_ARN and AWS_WEB_IDENTITY_TOKEN_FILE env vars plus a projected token mounted as a file."

"At runtime, the AWS SDK picks this up automatically via its credential chain — it reads the env vars, reads the projected token, calls STS AssumeRoleWithWebIdentity, and gets back temporary credentials (15 min by default). The pod never has a long-lived credential — everything is ephemeral. You can't accidentally commit an IRSA credential to git."

"The scale limits: 100 OIDC providers per AWS account, and STS has per-role throttling. For fleets over 100 clusters, or for clusters with thousands of pod starts per minute all using the same role, Pod Identity is better because it has a local caching proxy on each node."

Interviewer Follow-ups

How do you prevent one team's pod from using another team's ServiceAccount (cross-namespace IRSA abuse)?
The condition in the trust policy says StringEquals — what happens if you use StringLike with a wildcard?
How would you detect if a pod is using an overly permissive IAM role?

StringLike with a wildcard in the trust policy condition (e.g. "*:sub": "system:serviceaccount:*:my-sa") allows any namespace's service account named my-sa to assume the role — a privilege escalation vector. Always use StringEquals with the full namespace:serviceaccount path.

Q11 How does RBAC work in Kubernetes? What's the difference between Role, ClusterRole, RoleBinding, ClusterRoleBinding?

ConceptualMedium

▶

Key Points

Role: namespace-scoped permissions (verbs on resources within one namespace)
ClusterRole: cluster-scoped permissions (non-namespaced resources like nodes, PVs, or cross-namespace)
RoleBinding: grants a Role or ClusterRole to a subject within a namespace
ClusterRoleBinding: grants a ClusterRole cluster-wide (affects all namespaces)
Subjects: User (IAM mapped), Group, ServiceAccount
RBAC is additive only — no deny rules. If multiple bindings apply, permissions union.
Aggregated ClusterRoles: ClusterRoles can aggregate by label selector — used by system roles

Model Answer

"RBAC in Kubernetes is built on four resources. Role defines what actions (verbs: get, list, watch, create, update, patch, delete) are allowed on which resources (pods, services, configmaps…) within a single namespace. ClusterRole is the same but cluster-scoped — used for non-namespaced resources like Nodes and PersistentVolumes, or when you need uniform permissions across all namespaces."

"RoleBinding attaches a Role (or a ClusterRole!) to a subject within a specific namespace. This is a subtle but important point: you can bind a ClusterRole with a RoleBinding, which scopes those permissions to one namespace. That's how you create reusable permission templates with ClusterRoles but grant them per-namespace. ClusterRoleBinding attaches a ClusterRole cluster-wide — granting access to all namespaces at once."

"Three things that trip people up: RBAC is purely additive — there are no deny rules. If a user has two bindings, they get the union of permissions. Second, ServiceAccounts are namespace-scoped subjects — a ServiceAccount in namespace A can't be bound to a Role in namespace B without a ClusterRoleBinding. Third, EKS maps IAM identities to K8s RBAC users/groups — the IAM ARN becomes a username, and you bind that username via RoleBinding."

Interviewer Follow-ups

How would you audit which IAM roles have cluster-admin privileges in a large EKS cluster?
A developer accidentally got system:masters — how do you remove it safely?
What's the risk of binding a ServiceAccount to a ClusterRole with wildcard resource access?

Q12 How would you securely inject secrets into a pod? Compare native K8s Secrets, Secrets Manager CSI, and External Secrets Operator.

TradeoffHard

▶

Key Points

Native K8s Secrets: base64 (not encrypted), stored in etcd, accessible by anyone with namespace read access. Fix: enable KMS envelope encryption, strict RBAC on secrets resource.
Secrets Store CSI: mounts secrets as files at pod startup directly from AWS Secrets Manager / SSM. Never written to etcd. Requires a running pod to retrieve — cold-start dependency.
External Secrets Operator: syncs AWS Secrets Manager / SSM → K8s Secret objects. Secrets live in etcd (so same risks) but source-of-truth is AWS. Easier to use in existing tooling.
Rotation: CSI and ESO both support rotation. Native K8s requires manual update + pod restart.
Audit trail: AWS Secrets Manager has CloudTrail. Native K8s secrets have K8s audit logs.

Model Answer

"Native Kubernetes Secrets are often misunderstood as 'secure by default' — they're not. They're base64-encoded, stored in etcd, and accessible to anyone with get secret RBAC permission in that namespace. The baseline fix is KMS envelope encryption for etcd and strict RBAC. But even with encryption at rest, the secret is still decrypted when read via the API."

"Secrets Store CSI Driver avoids etcd entirely — at pod startup, the CSI driver calls AWS Secrets Manager (using IRSA credentials from the pod's ServiceAccount) and mounts the secret directly into the pod's filesystem. The secret never touches etcd. The downside: the pod depends on Secrets Manager being reachable at startup, adding a cold-start dependency. Also, you need Secrets Manager access configured before the pod can start."

"External Secrets Operator syncs secrets from Secrets Manager into Kubernetes Secret objects on a schedule. It's easier to integrate with existing tooling that expects K8s secrets, but the secret does end up in etcd. The advantage over manual management: rotation in Secrets Manager automatically propagates to the K8s Secret, and from there you control whether pods see the update via projected volumes (immediate) or env vars (require restart)."

"My recommendation for greenfield EKS: CSI driver for high-sensitivity secrets (API keys, certs), ESO for config-level secrets where etcd exposure risk is acceptable, and always KMS encryption enabled regardless."

Interviewer Follow-ups

How do you handle secret rotation without restarting pods?
A developer checks in a K8s manifest with a hardcoded secret — what controls would catch this?

Storage — EBS, EFS, CSI, StatefulSets

Q13 A pod with an EBS-backed PVC is stuck in Pending — walk me through your debug process.

ScenarioHard

▶

Key Points

Check PVC status and events: kubectl describe pvc <name>
Check if PV was provisioned: kubectl get pv
Common cause 1: AZ mismatch — EBS volume created in wrong AZ before pod was scheduled. Fix: volumeBindingMode: WaitForFirstConsumer
Common cause 2: EBS CSI driver not installed or not running
Common cause 3: CSI driver lacks IAM permissions (IRSA not configured correctly)
Common cause 4: EBS quota exhaustion in the account/region
Check EBS CSI controller logs: kubectl logs -n kube-system -l app=ebs-csi-controller

Model Answer

"First: kubectl describe pvc <name> -n <ns> — the Events section tells you exactly what's failing. If it says 'waiting for first consumer to be created before binding', the StorageClass has WaitForFirstConsumer set, which is correct behavior — the PV won't be provisioned until the pod is scheduled. This usually means the pod itself is what's stuck."

"If the PVC shows 'ProvisioningFailed', the EBS CSI controller tried and failed to create the volume. I check EBS CSI controller pod logs: kubectl logs -n kube-system -l app=ebs-csi-controller -c csi-provisioner. Common errors: IAM permission denied (the IRSA role for the CSI driver is missing ec2:CreateVolume), EBS quota exceeded, or the requested AZ doesn't have the instance type available."

"If the StorageClass uses Immediate binding (legacy), the PV is created in whatever AZ the scheduler picks for provisioning — which may not be the AZ the pod ends up on. This causes 'Multi-Attach error for volume' or the pod stays Pending with 'node had no matching volume'. Fix: migrate the StorageClass to WaitForFirstConsumer and delete the stuck PVC/PV pair."

"I'd also check: is the EBS CSI driver installed at all? kubectl get pods -n kube-system | grep ebs-csi. On clusters before EKS 1.23, you had to install it explicitly — it wasn't included."

Interviewer Follow-ups

How do you migrate an EBS volume from one AZ to another without data loss?
A StatefulSet is being scaled down — what happens to the PVCs?
How do you take a snapshot backup of an EBS-backed PVC in Kubernetes?

Q14 When would you use EBS vs EFS, and what are the tradeoffs at production scale?

TradeoffMedium

▶

Key Points

EBS: block storage, single AZ, RWO, low latency, good IOPS, AZ-bound (can't reschedule pod to different AZ)
EFS: NFS, multi-AZ, RWX, higher latency than EBS for random I/O, per-GB pricing (~$0.30/GB/mo), auto-scaling
EFS access points: per-namespace isolation, different UIDs/permissions per mount
EFS is the only persistent storage option on Fargate
EFS throughput modes: Bursting vs Elastic (recommended for variable workloads)
Cost trap: EFS charges for all stored data; EBS charges for provisioned capacity even if unused

Model Answer

"The decision comes down to access pattern and scheduling flexibility. EBS is block storage — it behaves like a fast local disk attached to one EC2 instance at a time. It's ideal for databases (Postgres, MySQL, Cassandra nodes), Kafka broker data, any workload where you need low-latency sequential or random I/O. The constraint is AZ binding — the volume lives in one AZ and so does the pod that uses it."

"EFS is NFS over the network — multiple pods across multiple nodes and AZs can mount the same filesystem simultaneously (ReadWriteMany). The use cases are: shared configuration files, ML training datasets accessed by multiple worker pods, CMS media uploads, WordPress shared content. The latency profile is worse than EBS for small random I/O — you're going across a network to a managed NFS service."

"At production scale, the main EFS traps are: throughput mode — Bursting mode gives you throughput proportional to stored data, so a small but heavily-read filesystem gets throttled. Switch to Elastic mode which auto-scales throughput. Second: EFS pricing — $0.30/GB/month on Standard storage. For a cluster serving ML model files (say 50GB per model × 20 models = 1TB), you're paying $300/month in storage alone. S3 + FUSE might be cheaper depending on access patterns."

Interviewer Follow-ups

You have a Deployment (not StatefulSet) that needs shared storage — what are the risks?
How do EFS access points help with multi-tenant scenarios?

Load Balancers & Ingress

Q15 Explain how the AWS Load Balancer Controller works and how it differs from the classic in-tree cloud provider LB.

ConceptualMedium

▶

Key Points

Classic: controller in Kubernetes controller-manager creates Classic ELB for every type:LoadBalancer Service. Old API, limited features.
AWS LBC: out-of-tree controller (Deployment in your cluster), watches Services + Ingress. Provisions ALBs and NLBs with full feature sets.
target-type: ip vs instance: ip routes directly to pod IP (bypasses NodePort/kube-proxy), instance routes to node then kube-proxy hops to pod
IngressGroup: multiple Ingress resources share one ALB (cost savings, rule ordering)
AWS LBC requires IRSA — it calls EC2 and ELB APIs on your behalf
TargetGroupBinding CRD: attach existing TGs to pods outside of standard Ingress flow

Model Answer

"The classic in-tree cloud provider creates a Classic ELB (now a legacy product) whenever you create a Service of type LoadBalancer. It's baked into the controller-manager, hard to update independently, and doesn't support modern ALB/NLB features like host-based routing, WAF integration, or IP-mode targeting."

"The AWS Load Balancer Controller is an out-of-tree controller — a Deployment running in your cluster that watches Ingress and Service objects. It uses IRSA to call AWS APIs and provisions ALBs from Ingress resources, NLBs from Service type:LoadBalancer annotations. Because it runs in-cluster, you can upgrade it independently of the K8s version."

"The most important setting is target-type: ip. With instance mode, the LB targets the EC2 instance on a NodePort, and kube-proxy then forwards to the pod — adding latency and potentially a cross-AZ hop. With IP mode, the LB puts the pod's VPC IP directly into the Target Group and routes straight to it. This requires VPC CNI (pods must have VPC IPs), but eliminates the kube-proxy hop and preserves the source IP at the pod."

"The other key concept is IngressGroup — without it, each Ingress object creates a separate ALB at ~$18/month. With group.name annotation, multiple Ingress resources across namespaces share one ALB, and the LBC manages rule ordering via group.order annotation."

Interviewer Follow-ups

How do you handle blue-green or canary deployments at the ALB level with AWS LBC?
The AWS LBC can't reconcile an Ingress — how do you debug it?
How would you route traffic to services in different namespaces from a single ALB?

Q16 What is externalTrafficPolicy and why does setting it to Local matter for performance and cost?

ConceptualMedium

▶

Key Points

Default (Cluster): LB sends traffic to any node, kube-proxy forwards to any pod regardless of node. Causes cross-node/cross-AZ hops. Source IP is SNATed.
Local: LB only targets nodes that actually run matching pods. No cross-node forward. Source IP preserved.
Cost: cross-AZ traffic in AWS = ~$0.01/GB each way. With Cluster policy on a high-traffic service this adds up.
Risk: if a node has no pods for this service but the LB sends to it, 503. Mitigate with readinessProbe + proper LB health checks.
Works best with topology-aware routing or ensuring pods are distributed across all nodes

Model Answer

"With the default externalTrafficPolicy: Cluster, the cloud load balancer targets every node in the cluster via NodePort. A request arriving at Node A might have its pod running on Node B — kube-proxy on Node A SNATs the packet and forwards it to Node B. This has two costs: latency from the extra hop and AWS cross-AZ data transfer charges (~$0.01/GB each way — significant at scale)."

"With externalTrafficPolicy: Local, the AWS Load Balancer Controller (or cloud provider) registers only the nodes that have a pod for that Service in the Target Group. Requests go directly to pods on the same node — no SNAT, no cross-node hop, and the original client source IP is preserved in the pod, which matters for rate limiting, geo-based routing, and audit logs."

"The risk: if a node has Local policy and the pod on it crashes before the health check catches it, the LB might still send requests there briefly, getting 503s. Mitigate with fast readiness probes and health check intervals. Also, Local means pod distribution matters more — if you have 10 nodes but pods only on 3, those 3 get all the traffic. Use TopologySpreadConstraints to ensure good distribution."

Interviewer Follow-ups

How does topology-aware routing in Kubernetes complement externalTrafficPolicy: Local?
You have a DaemonSet-backed Service — does externalTrafficPolicy: Local make sense here?

Observability & Upgrades

Q17 Your EKS cluster is experiencing intermittent pod evictions. How do you investigate and fix this?

ScenarioHard

▶

Key Points

Check pod events: kubectl describe pod — eviction message tells you the reason
Causes: node memory pressure (kubelet evicts low-priority pods), node disk pressure (ephemeral storage limit), OOMKilled (container exceeded memory limit)
Check node conditions: kubectl describe node | grep Conditions -A10
Memory eviction order: BestEffort (no requests) → Burstable (requests < limits) → Guaranteed (requests = limits)
Fix: set proper resource requests (Guaranteed QoS class), tune kubelet eviction thresholds, right-size nodes
Karpenter consolidation can also cause "voluntary evictions" — check if consolidation is too aggressive

Model Answer

"I start with kubectl describe pod <evicted-pod> — the Events section tells me the exact eviction message. Common messages: 'The node was low on resource: memory' means kubelet-initiated eviction due to node memory pressure. 'OOMKilled' means the container itself exceeded its memory limit and was killed by the kernel."

"For node pressure evictions, I check node conditions: kubectl describe node <node> | grep -A8 Conditions. MemoryPressure, DiskPressure, PIDPressure indicate which resource is constrained. I'd then check node memory allocation — kubectl top node plus the allocatable vs capacity breakdown."

"The fix depends on root cause. If pods have no resource requests (BestEffort QoS), they're evicted first by kubelet under pressure — set explicit resource requests to at minimum BestEffort → Burstable → Guaranteed. If the node is genuinely under-provisioned, either scale out (more nodes or larger instances) or tune kubelet eviction thresholds via the node group's kubelet config."

"If the evictions are happening during low-traffic periods, check if Karpenter consolidation is the cause — it drains underutilized nodes, which looks like eviction. Add PodDisruptionBudgets to ensure minimum availability is maintained during consolidation."

Interviewer Follow-ups

What's the difference between QoS classes — BestEffort, Burstable, Guaranteed — and how does kubelet prioritize eviction?
How do you prevent a critical pod from being evicted while allowing less critical ones to be evicted normally?

PriorityClass is the answer to preventing critical pod eviction — assign a high PriorityClass (e.g., system-cluster-critical) to ensure kubelet evicts lower-priority pods first. Combined with Guaranteed QoS (requests == limits), these pods are essentially eviction-immune under normal circumstances.

Q18 How would you approach upgrading an EKS cluster with zero downtime for production workloads?

DesignHard

▶

Key Points

Pre-flight: scan deprecated APIs (kubectl-no-trouble), check EKS Insights, validate all Helm charts support target version
Control plane upgrade: one minor version at a time, API server stays behind NLB during rolling replace — zero downtime
Add-ons upgrade: must be done manually after control plane, each addon deploys rolling
Data plane: managed node group rolling replace (cordon → drain → new AMI node → old node terminates). Need PDBs set.
PodDisruptionBudgets: critical to ensure rolling node replacement doesn't take a service to zero replicas
Validate at each step before proceeding to next

Model Answer

"Zero-downtime upgrade has three phases, and the order matters. Pre-upgrade: I run kubectl-no-trouble to scan for deprecated API usage — any workloads using removed APIs (like PSP in 1.25) need to be migrated first. I check EKS Cluster Insights for compatibility warnings. I verify all Helm chart versions support the target K8s version, and I confirm PodDisruptionBudgets exist for all critical Deployments and StatefulSets."

"Control plane upgrade is low-risk from a downtime perspective — AWS does a rolling replace of API server instances behind the NLB. Existing pods keep running. API calls may see brief increased latency during the transition but the endpoint stays available. I upgrade one minor version at a time."

"Data plane is where downtime risk lives. For managed node groups, EKS cordons a node, drains it (respecting PDBs — if draining would violate a PDB, it waits), launches a new node with the updated AMI, waits for it to be Ready, then terminates the old node. The key: PDBs must be set correctly. A Deployment with 3 replicas and no PDB could have all 3 nodes drained simultaneously. With minAvailable: 2, only 1 can be drained at a time."

"I also upgrade add-ons between the control plane and data plane steps. Running old vpc-cni with new K8s can cause ENI/IP allocation issues."

Interviewer Follow-ups

A PDB is blocking node drain during upgrade — how do you handle it without data loss?
You have a StatefulSet with persistent data — what's your strategy for draining those nodes?
When would you choose a blue-green cluster upgrade over in-place?

Kubernetes Core — Scheduling, Controllers, etcd

Q19 What is the controller pattern in Kubernetes? Explain the reconcile loop and how it handles failures.

ConceptualHard

▶

Key Points

Controllers implement the desired state → actual state reconciliation loop
Watch mechanism: list-watch on etcd for resource changes, triggers reconcile
Reconcile function is idempotent — safe to call multiple times, same result
On failure: exponential backoff retry, work queue with rate limiting
Controller uses informers with local cache — reduces etcd load
Eventual consistency: the system may temporarily be in non-desired state but converges
Custom controllers via controller-runtime / kubebuilder — same pattern

Model Answer

"The controller pattern is the core of how Kubernetes works. Every controller watches for a desired state expressed in Kubernetes objects, observes the actual state of the world, and takes actions to close the gap. The reconcile loop is the heart of it: observe current state, compare to desired state, take the minimum action needed, return. It's designed to be called repeatedly."

"The implementation uses informers — not raw watch calls to etcd. An informer maintains a local in-memory cache of the resources it cares about, refreshed by a list-watch stream. When a resource changes, the event is put into a work queue. The reconcile function pops from the queue, reads from the local cache (not etcd — this reduces load dramatically), and acts."

"The critical design property is idempotency: the reconcile function must produce the same result whether called once or ten times. This is because in a distributed system, the controller might crash mid-reconciliation, or receive duplicate events. Kubernetes controllers are designed so re-running reconcile on an already-reconciled object is safe."

"Failure handling: on error, the item is re-queued with exponential backoff. The work queue has rate limiting to prevent thundering herd when many resources need reconciling simultaneously. This is why after a cluster comes back up from an outage, you see a wave of reconciliation activity that settles down over minutes, not a hard spike."

Interviewer Follow-ups

You're writing a custom controller — how do you test that the reconcile function is idempotent?
What happens if two controllers both try to reconcile the same resource simultaneously?
How do resource versions and optimistic locking prevent stale writes in controllers?

Resource versions (resourceVersion field) implement optimistic concurrency control. If you read an object at version 42 and someone else updates it to 43 before you write, your write fails with a conflict error. Controllers handle this by re-queuing for reconcile — they'll re-read the latest version and retry. This is why controllers never store object state locally between reconcile calls.

Q20 Design a highly available, multi-AZ EKS cluster for a payment processing service. Walk through every layer.

DesignHard

▶

Key Points

Network: 3 private subnets (one per AZ), /22 each, VPC CNI with prefix delegation
Compute: Managed node groups spread across 3 AZs, or Karpenter with multi-AZ NodePool. At least 2 nodes per AZ for HA.
Workload HA: TopologySpreadConstraints across AZs + PodDisruptionBudgets (minAvailable ≥ 1 per AZ)
Storage: StatefulSets with WaitForFirstConsumer StorageClass, per-AZ node groups if EBS used
Ingress: ALB (multi-AZ by default) with target-type:ip, externalTrafficPolicy:Local
Security: IRSA for payment-service pods, KMS for secrets, NetworkPolicy default-deny, SecurityContext non-root
Observability: Prometheus + AlertManager, CloudWatch control plane logs, distributed tracing (OTEL)

Model Answer

"For a payment processor I'd structure this in layers. Network foundation: one VPC per cluster, three private subnets (one per AZ) sized at /22 (~1000 IPs each) with prefix delegation enabled — this gives ~16,000 pod IPs per subnet. Public subnets in each AZ hold only load balancers. No public node IPs."

"Compute: I'd use Karpenter with a NodePool constrained to on-demand only (no spot for payments — spot interruption = transaction disruption). Disruption policy set to WhenEmpty only — no voluntary consolidation during business hours. TopologySpreadConstraints on all Deployments enforce at least one pod per AZ with maxSkew:1."

"Data layer: if using EBS (e.g. for a local cache), WaitForFirstConsumer StorageClass so volumes are provisioned in the pod's AZ. PodDisruptionBudgets with minAvailable = ceil(replicas * 0.75) so upgrades can't take the service below 75% capacity. For the payment DB, I'd likely use Aurora Multi-AZ outside the cluster, accessed by pods via IRSA-authenticated connection."

"Security: IRSA per service component (payment-api gets only the specific DynamoDB tables it needs), KMS-encrypted secrets, NetworkPolicy default-deny in the payment namespace with explicit allow for ingress from the ALB target and egress to the DB endpoint. Pod SecurityContext: runAsNonRoot, readOnlyRootFilesystem, drop ALL capabilities."

"Ingress: multi-AZ ALB with WAF attached, target-type:ip, externalTrafficPolicy:Local, HTTPS only, TLS cert managed via ACM. Health checks on /healthz with 30s interval and 2 unhealthy threshold."

Interviewer Follow-ups

One AZ becomes completely unavailable — what breaks and how quickly does your system recover?
How do you handle database migrations without downtime for the payment service?
Your compliance team requires all API calls to AWS services to be logged — how do you implement this?

Q21 What happens during a rolling update of a Deployment? How do maxSurge and maxUnavailable interact?

ConceptualMedium

▶

Key Points

maxUnavailable: max pods that can be unavailable during update. Default 25%. Set to 0 for zero-downtime (but slower).
maxSurge: max extra pods above desired count. Default 25%. Allows creating new pods before killing old ones.
With maxUnavailable=0, maxSurge=1: creates 1 new pod, waits for it to be Ready, deletes 1 old pod — safest, slowest
With maxUnavailable=1, maxSurge=0: kills 1 old pod, creates 1 new pod — saves capacity but briefly reduces availability
Deployment controller watches ReplicaSets — creates new RS for each rollout revision
kubectl rollout undo scales up previous RS

Model Answer

"A rolling update creates a new ReplicaSet for the updated pod template. The Deployment controller then scales the new RS up and the old RS down simultaneously, respecting maxSurge and maxUnavailable constraints."

"maxUnavailable=0, maxSurge=1 is the zero-downtime configuration: before killing any old pod, the controller ensures a new pod is Ready. It creates 1 new pod (surge), waits for its readiness probe to pass, then terminates 1 old pod. This repeats until all replicas are updated. Slowest but safest — no user traffic hits the new version until it's proven Ready."

"Readiness probes are the critical dependency here. If your new pod version never passes its readiness probe (e.g., the new code has a startup bug), the rolling update halts — it won't kill old pods because it can't bring new ones to Ready. This is actually good: the old version keeps serving traffic. You'd see the Deployment stuck at partial rollout."

"rollout history: each update creates a new RS kept as a revision. kubectl rollout undo scales up the previous RS. The number of retained revisions is controlled by revisionHistoryLimit (default 10)."

Interviewer Follow-ups

How does Kubernetes ensure traffic isn't sent to pods that are starting up?
What's the difference between a readiness probe failure and a liveness probe failure?

Q22 Explain how Kubernetes handles pod termination gracefully. What is SIGTERM, terminationGracePeriodSeconds, and preStop hooks?

ConceptualMedium

▶

Key Points

On pod deletion: kubelet sends SIGTERM to PID 1 of each container, starts grace period timer (default 30s)
preStop hook runs before SIGTERM — useful for deregistering from service discovery or draining connections
After grace period, kubelet sends SIGKILL (forceful kill)
Race condition: kube-proxy iptables update and endpoint removal are async — pod may still receive traffic after SIGTERM. preStop sleep(5) is the common workaround.
terminationGracePeriodSeconds should be > time for app to finish in-flight requests

Model Answer

"Pod termination is a multi-step process with a critical race condition that causes dropped connections in most clusters. When a pod is deleted: the API server marks it as Terminating, which triggers two concurrent paths — kubelet starts the termination sequence, and endpoints controller removes the pod from Service endpoints, which kube-proxy uses to update iptables rules."

"The kubelet path: if a preStop hook is defined, it runs first. Then SIGTERM is sent to all containers. If the process doesn't exit within terminationGracePeriodSeconds, SIGKILL is sent."

"The race condition: iptables rule removal is asynchronous and may lag behind SIGTERM by 1–10 seconds. During that window, new requests can still arrive at the pod while it's trying to shut down. The standard workaround is a preStop hook with sleep 5 — this delays SIGTERM by 5 seconds, giving iptables time to propagate the endpoint removal before the app starts refusing connections."

"For applications that have long-running requests (gRPC streams, websockets), set terminationGracePeriodSeconds to the maximum expected request duration + the sleep buffer. A payment processing service might need 60–90 seconds to drain."

Interviewer Follow-ups

How do you verify a pod is handling SIGTERM correctly in a staging environment?
What happens if your app ignores SIGTERM and you've set terminationGracePeriodSeconds: 3600?

The iptables race condition is one of the most common causes of dropped connections during deployments that senior candidates miss. The preStop sleep is a pragmatic but imperfect fix — the proper solution is also to handle SIGTERM gracefully in the app, stop accepting new connections, and drain existing ones.

Q23 What is a CRD and Operator pattern? When would you build a custom operator vs use an existing one?

ConceptualHard

▶

Key Points

CRD: extends the Kubernetes API with custom resource types. Stored in etcd, managed like native objects.
Operator: a controller that watches CRDs and encodes domain-specific operational knowledge (Day-2 operations)
Operator Capability Levels: Basic Install → Seamless Upgrades → Full Lifecycle → Deep Insights → Auto Pilot
Build custom when: proprietary operational logic, existing operators don't exist, tight integration with internal systems
Use existing when: common infra (Postgres, Kafka, Prometheus), mature operators available (Strimzi, CloudNativePG)
Tools: kubebuilder, controller-runtime, Operator SDK

Model Answer

"A CRD lets you extend Kubernetes with domain-specific resource types. Once you define a CRD, the API server handles storage, validation (via OpenAPI schema), RBAC, and watch — you get all of that for free. Your custom object lives in etcd alongside native Kubernetes objects."

"An Operator is a controller that watches those CRDs and implements operational domain knowledge — the stuff a human operator would do. For a database operator: creating a cluster, handling node failures, taking backups, performing rolling upgrades. The key is encoding runbooks as code."

"Build vs buy: for common infrastructure — Postgres, Kafka, Prometheus, Redis — I'd always evaluate existing operators first. Strimzi for Kafka, CloudNativePG for Postgres, and the Prometheus Operator are battle-tested. Building your own is weeks of work with edge cases you haven't thought of yet."

"I'd build custom when: the operational logic is proprietary to our business domain (e.g., an operator for our internal ML training job lifecycle), when existing operators have architectural constraints that conflict with our requirements, or when we need deep integration with internal tooling. The kubebuilder framework makes this approachable — it scaffolds the controller, generates CRD YAML, and handles the watch/cache/queue plumbing."

Interviewer Follow-ups

What happens to existing CRD instances if you delete the CRD itself?
How do you version a CRD API safely with conversion webhooks?

Q24 How does CoreDNS work in Kubernetes and what are common DNS failure modes in EKS?

ConceptualMedium

▶

Key Points

CoreDNS runs as a Deployment (2 replicas by default) in kube-system. Service IP is the cluster DNS.
Pod /etc/resolv.conf points to CoreDNS ClusterIP. Search domains: default.svc.cluster.local, svc.cluster.local, cluster.local
Short names resolve via search domain chain: svc-name → tries svc-name.namespace.svc.cluster.local first
DNS 5-second timeout: caused by conntrack race condition in Linux kernel with iptables. Fix: switch to TCP DNS, use localDNS caching, or ndots:2
EKS NodeLocal DNSCache: DaemonSet running a DNS cache on each node — eliminates the conntrack issue
CoreDNS is an EKS managed addon — upgrade separately

Model Answer

"CoreDNS is Kubernetes' cluster DNS service — a Deployment (typically 2 replicas for HA) running in kube-system. Its ClusterIP is configured in every pod's /etc/resolv.conf as the nameserver. Each pod's resolv.conf also has search domain suffixes, so a bare hostname like my-svc gets tried as my-svc.my-namespace.svc.cluster.local first, then falling back through the domain chain."

"The most impactful DNS issue in large EKS clusters is the 5-second DNS timeout. It's caused by a Linux kernel race condition in conntrack when multiple threads from the same pod make concurrent DNS queries — SNAT and packet processing can drop one of them, causing a 5-second retry. The symptoms: intermittent 5-second latency spikes on first connections, usually masked by connection pooling but visible in tail latencies."

"NodeLocal DNSCache is the proper EKS fix: a DaemonSet that runs a DNS cache on every node and intercepts DNS traffic via a link-local address before it reaches CoreDNS. This eliminates the conntrack issue because DNS queries no longer traverse iptables NAT for most lookups. It also reduces load on CoreDNS significantly in large clusters."

"Other CoreDNS failures: CoreDNS pods OOMKilled under load (increase memory limits), CoreDNS pods scheduled on a single node that becomes unavailable (use topologySpreadConstraints on the CoreDNS Deployment), and DNS cache poisoning (validate CoreDNS is not forwarding to untrusted resolvers)."

Interviewer Follow-ups

How do you configure CoreDNS to forward internal domain queries to an on-premises resolver?
A pod can reach its service by ClusterIP but not by name — what's wrong?

Q25 How would you implement comprehensive observability for an EKS cluster — what metrics, logs, and traces matter most?

DesignHard

▶

Key Points

Metrics: USE (Utilization, Saturation, Errors) for infrastructure; RED (Rate, Errors, Duration) for services
kube-state-metrics: pod/deployment/node states as Prometheus metrics
node-exporter: host-level CPU/memory/disk/network
Control plane logs (API audit, auth) → CloudWatch; critical for security and debugging RBAC issues
Application logs: Fluent Bit DaemonSet → CloudWatch Logs or S3/OpenSearch
Distributed tracing: OpenTelemetry (ADOT) → AWS X-Ray or Jaeger
Critical alerts: node NotReady, pod crashlooping, PVC Pending, API server error rate, etcd leader changes

Model Answer

"I structure observability around three pillars with clear signal priorities. Metrics: I use kube-prometheus-stack (Prometheus + Grafana) for cluster metrics. kube-state-metrics translates Kubernetes object states into Prometheus metrics — pod phases, deployment rollout status, PVC binding state. node-exporter covers host-level resources. For the control plane, EKS doesn't expose Prometheus metrics directly, so I rely on CloudWatch Container Insights for control plane health."

"Critical alerts I always set up: node in NotReady for >2 minutes (hardware/network issue), pod CrashLoopBackOff (app crash), PersistentVolumeClaim stuck Pending (storage provisioning failure), kube-proxy DaemonSet not fully available, and CoreDNS error rate spike. These are the signals that wake me up at 3am."

"Logs: Fluent Bit DaemonSet (the ADOT Fluent Bit variant ships with Container Insights) forwards container logs to CloudWatch Logs with pod metadata enrichment. Control plane logs — especially API audit and auth logs — must be enabled explicitly in EKS and are critical for security investigations and RBAC debugging."

"Traces: for a Staff-level concern — I use OpenTelemetry (ADOT Collector as a DaemonSet or sidecar) to collect traces from services and forward to X-Ray or Jaeger. The key is propagating trace context across service boundaries, especially across Ingress → service → database. Without traces, you can see that latency increased but not which service in the call chain is responsible."

Interviewer Follow-ups

Prometheus storage is filling up on your cluster. How do you handle long-term metrics retention at scale?
How do you correlate a spike in application error rate with a Kubernetes event (e.g., node replacement)?

Q26 A pod has been compromised — walk through your incident response process for an EKS cluster.

ScenarioHard

▶

Key Points

Contain: apply NetworkPolicy to isolate the pod immediately, revoke IRSA role temp creds (can't revoke but can add deny policy)
Preserve evidence: don't delete the pod yet. Copy /proc, memory dump if needed. Check CloudTrail for API calls made by the pod's IAM role.
Assess blast radius: what IAM permissions did the pod have? What K8s secrets were in that namespace? What services was it talking to?
Remediate: rotate all secrets/tokens the pod could access, revoke and recreate IRSA role, patch the vulnerability, redeploy clean version
Detect: GuardDuty EKS Protection for runtime threat detection, CloudTrail for AWS API calls

Model Answer

"Immediate priority is containment without losing evidence. I'd immediately apply a NetworkPolicy to the compromised pod's labels — deny all ingress and egress except to a forensics endpoint. This isolates it from lateral movement while keeping it running for evidence collection. I'd label the pod with a quarantine label so it's visually identifiable."

"In parallel, I'd check CloudTrail for any AWS API calls made by the pod's IAM role in the last hour. If the pod had an IRSA role with S3 read access, did it exfiltrate data? I can't revoke the temp credentials directly, but I can attach an explicit Deny policy to the IAM role immediately, which overrides all other permissions even for in-flight temp credentials."

"Blast radius assessment: what K8s Secrets were mounted in that namespace? What ServiceAccounts could the compromised process impersonate? If the pod had node-level access (host PID, host network), the blast radius is the entire node and all pods on it — in that case I'd cordon and drain the node."

"For remediation: rotate all credentials the pod could have accessed (DB passwords, API keys in Secrets, IRSA role rotated by creating a new role), patch the container image, redeploy. Then conduct a post-mortem: how did this happen, what GuardDuty rule would have caught it earlier, is Pod Security Admission configured to prevent privileged containers?"

Interviewer Follow-ups

How does GuardDuty EKS Protection work and what threats does it detect?
If the attacker escaped to the node (container breakout), how do you detect and respond?

Q27 How do resource requests and limits work in Kubernetes, and what's the relationship to QoS classes and OOMKill behavior?

ConceptualMedium

▶

Key Points

Request: what the scheduler uses for node selection. The container is guaranteed this amount. Node allocatable = sum of all requests.
Limit: ceiling. CPU limit = throttled if exceeded (not killed). Memory limit = OOMKilled if exceeded.
Guaranteed: requests == limits for all containers. Never evicted unless exceeding limits.
Burstable: at least one container has requests != limits. Evicted after BestEffort under pressure.
BestEffort: no requests or limits set. Evicted first.
CPU is compressible (throttled), memory is incompressible (process killed)

Model Answer

"Requests and limits serve different purposes. Requests are scheduling hints — the scheduler sums all pod requests per node to determine available capacity. A node with 8 vCPU and all requests totaling 7 vCPU will not schedule another 2-vCPU request pod, even if actual usage is only 3 vCPU. This is intentional — requests guarantee resource availability."

"Limits are enforcement ceilings. CPU limits are implemented with Linux cgroups cfs_quota — if a container tries to use more than its CPU limit, the scheduler throttles it without killing it. Memory limits are different: if a container exceeds its memory limit, the Linux OOM killer kills the process. The container restarts (if restartPolicy allows), which shows up as CrashLoopBackOff with OOMKilled reason."

"The QoS class determines eviction priority under node pressure. Guaranteed (requests == limits for ALL containers) means kubelet won't evict the pod under memory pressure unless it's actually over limit. Burstable gets evicted next. BestEffort (no requests/limits at all) gets evicted first."

"My recommendation: set requests == limits for memory on critical workloads (Guaranteed QoS, protects against eviction), but keep CPU limit higher than request or remove it entirely — CPU throttling degrades latency subtly and is hard to detect. For non-critical workloads, set memory requests conservatively with limits 2x higher."

Interviewer Follow-ups

Your Java app is getting OOMKilled but heap usage looks fine — what might be happening?
How does CPU throttling manifest in application latency, and how do you detect it?

Java apps get OOMKilled because the JVM doesn't account for native memory, metaspace, and thread stacks — the container limit is for total process memory, not just heap. Set -XX:MaxRAMPercentage=75 (not -Xmx) so JVM auto-tunes heap to 75% of container limit, leaving headroom for native memory.

Q28 What is a sidecar container? When is it the right pattern vs an init container vs a separate service?

ConceptualMedium

▶

Key Points

Sidecar: runs alongside main container for lifetime of pod. Shares PID, network, volumes. Used for: logging agents, proxies (Envoy), IRSA token refresh, metrics exporters.
Init container: runs to completion before main container starts. Used for: DB migrations, wait-for-dependency, config generation, secret fetching.
K8s 1.29+: sidecar containers (restartPolicy: Always in initContainers) that start before main and stay running — formal sidecar pattern
Separate service: when the functionality needs independent scaling, separate RBAC, or serves multiple workloads

Model Answer

"A sidecar extends or enhances a main container's capabilities by running in the same pod, sharing network namespace and volumes. The canonical examples are: an Envoy proxy sidecar for service mesh traffic control, a Fluent Bit sidecar for log forwarding (especially on Fargate where DaemonSets don't work), and the IRSA token refresh sidecar in older patterns."

"Sidecar vs init container: init containers run sequentially before the main container and must complete successfully — perfect for wait-for-dependency patterns (wait until Postgres is ready), one-time config generation, or DB schema migrations. Sidecars run for the lifetime of the pod alongside the main container."

"In Kubernetes 1.29+, there's now a formal sidecar container type — an initContainer with restartPolicy:Always. It starts before the main container, keeps running, and importantly, it terminates after the main container during pod shutdown. This solves the classic problem where a log forwarder sidecar would die before draining the log buffer because both containers get SIGTERM simultaneously."

"Separate service is better when: the functionality needs independent scaling (log aggregation serving 100 app pods doesn't need to scale 1:1 with apps), it needs its own RBAC or IAM permissions isolated from the app, or it serves multiple different workloads."

Interviewer Follow-ups

How does adding a sidecar affect pod scheduling and resource requests?
The Envoy sidecar in your service mesh is adding 20ms of latency — how do you profile this?

Q29 How would you implement blue-green or canary deployments on EKS without changing application code?

DesignHard

▶

Key Points

Blue-green at Service level: two Deployments (blue/green), switch Service selector. Instant cutover, requires 2x capacity.
Canary at ALB level: AWS LBC supports weighted target groups. Two IngressGroups, weighted annotations (e.g. 90/10 split). No code change.
Argo Rollouts: Kubernetes-native progressive delivery. CanaryStep, analysis runs, automatic rollback on metric degradation.
Flagger: works with Flagger + Istio/Linkerd/nginx for automatic traffic shifting based on Prometheus metrics
Header-based routing for targeted canary (A/B test a specific user cohort)

Model Answer

"At the ALB level, the AWS Load Balancer Controller supports weighted target groups. You deploy a v2 Deployment alongside v1, create a separate Service, and in the Ingress you annotate with alb.ingress.kubernetes.io/actions.forward-rule pointing to both target groups with weights — e.g., 90% to v1, 10% to v2. This requires no application change and shifts traffic at the LB level."

"For full progressive delivery with automatic rollback, Argo Rollouts replaces the standard Deployment and adds traffic shifting logic. You define canary steps: pause for analysis, shift 10% → wait → shift 25% → wait → promote. The analysis template queries Prometheus — if error rate exceeds a threshold, Rollouts automatically rolls back. No human involvement needed."

"Header-based canary is useful for targeted testing: route requests with X-Canary: true header to the new version, all others to the old version. The ALB can route on request headers. This lets you send your QA team or specific user cohort to the new version while production users stay on stable."

"For true blue-green (instant full cutover): run two identical Deployments, change the Service selector label from version: blue to version: green. The cutover is atomic — Service selector update is a single etcd write, and kube-proxy propagates it within seconds."

Interviewer Follow-ups

How do you handle database schema migrations in a blue-green deployment?
What metrics would you use as canary analysis signals for a payment API?

Q30 What is a StatefulSet and how does it differ from a Deployment? When would you use each?

ConceptualMedium

▶

Key Points

StatefulSet gives pods: stable network identity (pod-0, pod-1…), stable persistent storage (PVC per replica, not deleted on scale-down), ordered scaling/rolling updates
Deployment: pods are cattle (ephemeral, interchangeable). StatefulSet: pods have identity (pets).
StatefulSet headless Service: pods get DNS entries like pod-0.svc-name.namespace.svc.cluster.local
Use StatefulSet for: databases, Kafka brokers, ZooKeeper, Elasticsearch nodes, any app that needs stable identity
Scale-down: PVCs are NOT deleted — must be cleaned up manually
Rolling update: ordered (pod N-1 updated before pod N). Can pause mid-rollout.

Model Answer

"The core distinction is pod identity. Deployment pods are interchangeable — if pod-abc crashes, a new pod-xyz replaces it, same spec, different name. StatefulSet pods have stable identities: pod-0, pod-1, pod-2 — if pod-1 crashes, the replacement is also called pod-1 and remounts the same PVC."

"This stable identity enables three things Deployments can't provide: stable network identity via a headless Service (pod-0.db.namespace.svc.cluster.local resolves specifically to pod-0 — critical for Kafka brokers advertising their address to clients), stable persistent storage (each replica gets its own PVC from volumeClaimTemplates, which persists across pod restarts and is not deleted on scale-down), and ordered operations (pod-1 won't start until pod-0 is Ready, rolling updates go in reverse order — pod-2 updates before pod-1 before pod-0)."

"I use StatefulSet for any application where the instance matters: database nodes, Kafka/Pulsar brokers, ZooKeeper, Redis cluster members, Elasticsearch data nodes. I use Deployment for stateless applications where any pod can serve any request."

"Important EKS-specific nuance: StatefulSet PVCs are not deleted when you scale down. If you scale from 3 to 2 replicas, pod-2 is deleted but data-pod-2 PVC remains, consuming EBS costs. This is intentional (protects data) but requires manual cleanup when decommissioning."

Interviewer Follow-ups

How do you perform a StatefulSet rolling update that updates pods in batches with a manual promotion gate?
A StatefulSet pod is stuck in Pending after a node failure — PVC is in a different AZ — how do you recover?

Q31 What is a DaemonSet and how does it interact with taints and node affinity?

ConceptualMedium

▶

Key Points

DaemonSet: exactly one pod per node (or subset). Used for: log forwarders, monitoring agents, network plugins, security scanners.
DaemonSet controller bypasses the scheduler — it assigns pods directly to nodes.
By default DaemonSets tolerate: node.kubernetes.io/not-ready, node.kubernetes.io/unreachable, node.kubernetes.io/disk-pressure — so they run on degraded nodes.
To limit DaemonSet to subset of nodes: nodeSelector or nodeAffinity on the DaemonSet spec.
DaemonSet does NOT run on Fargate nodes.

Model Answer

"DaemonSets ensure exactly one pod per node (or per matching node subset). The DaemonSet controller bypasses the normal scheduler and directly sets nodeName on pods — which is why DaemonSet pods appear on nodes that are in NotReady state, or that have memory pressure. Critical infrastructure like aws-node (VPC CNI), kube-proxy, and log forwarders use DaemonSets because they must run everywhere."

"By default, DaemonSet pods tolerate most node condition taints — this is intentional so that node maintenance tooling (the DaemonSet) can run even when a node is degraded. You can override this by removing tolerations if you want the DaemonSet to only run on healthy nodes."

"To run a DaemonSet on a subset of nodes: add nodeSelector or nodeAffinity to the DaemonSet's pod spec. For example, run the GPU monitoring DaemonSet only on nodes with label gpu=true. Karpenter is aware of DaemonSet overhead — when sizing a new node, it accounts for DaemonSet pod resource requests as overhead."

Interviewer Follow-ups

How do you do a rolling update of a DaemonSet with minimal disruption?
A DaemonSet pod is not running on a specific new node — how do you debug?

Q32 How do you troubleshoot a node that's NotReady in EKS?

ScenarioHard

▶

Key Points

Check node conditions: kubectl describe node → Conditions section. KubeletReady = false?
SSH to node (via SSM Session Manager — no bastion needed on EKS). Check systemctl status kubelet.
Common causes: kubelet crashed (check journalctl -u kubelet -n 100), disk full on root volume, docker/containerd OOM, network issue preventing kubelet from reaching API server, certificate rotation failure
aws-node not running → pods can't get IPs → node appears degraded
For managed node groups: cordon the node, drain it, terminate the EC2 instance — ASG will replace it

Model Answer

"First: kubectl describe node <node> — the Conditions section tells me whether it's MemoryPressure, DiskPressure, PIDPressure, or plain NotReady (kubelet lost contact with API server). The Events section shows recent node-level events."

"I then SSH via AWS Systems Manager Session Manager (no bastion needed if SSM agent is running on the node). Check kubelet: systemctl status kubelet and journalctl -u kubelet -n 200 --no-pager. Common logs: 'failed to get cgroup stats' (kernel version issue), 'certificate has expired' (cert rotation failure), 'PLEG is not healthy' (container runtime unresponsive)."

"If the disk is full (df -h), check for large log files, stuck containers with big log outputs, or image layer accumulation. On EKS, docker system prune (or containerd equivalent) clears unused images. Increase the root EBS volume in the launch template if this is recurring."

"For managed node groups, the operational response is: cordon, drain (respecting PDBs), terminate the EC2 instance, and let the ASG replace it with a fresh node. Investigating the root cause is parallel — don't hold up workload recovery."

Interviewer Follow-ups

How do you prevent disk pressure from occurring in the first place?
Node is NotReady but pods on it are still running — is that possible?

Yes — pods keep running when a node goes NotReady (kubelet cache). They'll only be evicted after the default tolerationSeconds (5 minutes for NotReady/Unreachable taints). This is why you see 5-minute delays before pods reschedule after node failure.

Q33 What is a Horizontal Pod Autoscaler? How does it work and what are its limitations?

ConceptualMedium

▶

Key Points

HPA scales replica count based on metrics from metrics-server (CPU, memory) or custom metrics (KEDA, Prometheus Adapter)
Formula: desiredReplicas = ceil(currentReplicas × currentMetric / targetMetric)
Cooldown: scale-up fast (3 min default stabilization window), scale-down slow (5 min) — prevents thrashing
Limitation: scaling Deployments that use EBS (StatefulSets with RWO volumes don't scale horizontally)
KEDA extends HPA with event-driven scaling (SQS queue depth, Kafka lag, custom metrics)
VPA: vertical scaling (change CPU/memory requests) — can't run with HPA on same resource

Model Answer

"HPA adjusts the replica count of a Deployment or StatefulSet based on observed metrics. It queries the metrics-server every 15 seconds, applies the formula ceil(current × observed/target), and updates the replicas field. The built-in metrics are CPU and memory utilization relative to requests — which is why resource requests must be set for HPA to work."

"The stabilization window prevents thrashing: scale-up decisions are applied quickly (3-minute window — if the trigger persists for 3 min, scale up). Scale-down is conservative (5-minute window — don't scale down until the metric has been below target for 5 min). You can tune these with behavior.scaleDown.stabilizationWindowSeconds."

"KEDA (Kubernetes Event-Driven Autoscaling) extends HPA for external triggers — SQS queue depth, Kafka consumer lag, custom Prometheus metrics, even cron schedules. It creates HPA objects under the hood but with external metrics. This is the right pattern for batch/queue-based workloads where CPU utilization is a lagging indicator of load."

"Key limitation: HPA can scale replicas out but not individual pod resource sizes — that's VPA (Vertical Pod Autoscaler). VPA and HPA can conflict if both are targeting CPU metrics on the same resource. Use HPA for throughput-oriented scaling, VPA for right-sizing single-replica workloads."

Interviewer Follow-ups

HPA is scaling up but new pods are Pending — why, and how do you fix?
What's the problem with scaling based on CPU for a queue-processing service?

Q34 What is a Service Mesh and when does it make sense to add Istio or Linkerd to an EKS cluster?

TradeoffHard

▶

Key Points

Service mesh adds: mTLS between services, traffic management (retries, timeouts, circuit breaking), observability (traces, metrics per service-pair), policy enforcement
Implemented via sidecar proxies (Envoy/Linkerd-proxy) injected by mutating webhook
Overhead: 2–10ms latency per hop, significant memory per sidecar (~50–100MB), operational complexity
Justified when: compliance requires mTLS, complex traffic routing needed, cross-service observability is critical
Alternative to full Istio: Cilium eBPF-based service mesh (no sidecars, lower overhead)

Model Answer

"A service mesh intercepts all service-to-service traffic via sidecar proxies injected into each pod. This gives you: automatic mTLS between all services (zero-trust networking without any application code change), fine-grained traffic management (retry policies, circuit breakers, canary traffic splitting at L7), and deep observability — traces and metrics for every service-to-service call, even across language boundaries."

"The cost is real: latency overhead (2–10ms per hop for Istio Envoy sidecars, lower for Linkerd), memory overhead (each Envoy sidecar uses 50–150MB — at 100 pods, that's 5–15GB extra memory), and significant operational complexity. Debugging traffic issues through a service mesh is much harder than native Kubernetes networking."

"I'd recommend a service mesh when: compliance mandates mTLS in-cluster (PCI-DSS, HIPAA), you need sophisticated traffic management across microservices without code changes, or you need per-service-pair traffic metrics that Prometheus doesn't provide without custom instrumentation."

"For many teams, Cilium's eBPF-based service mesh is a better fit — no sidecars, lower overhead, and it integrates with the CNI. For mTLS specifically, SPIRE/SPIFFE for workload identity might be a lighter option than deploying full Istio."

Interviewer Follow-ups

How would you migrate an existing EKS cluster to Istio with zero downtime?
What's the difference between Istio's virtual services and K8s HTTPRoute (Gateway API)?

Q35 What is Pod Security Admission and how does it replace PodSecurityPolicy?

ConceptualMedium

▶

Key Points

PSP was removed in K8s 1.25 — any upgrade past 1.24 must migrate away
PSA (Pod Security Admission) is built-in, no CRD. Labels on namespace define policy: pod-security.kubernetes.io/enforce=restricted
Three profiles: privileged (no restrictions), baseline (prevents known privilege escalations), restricted (hardened, follows security best practices)
Three modes: enforce (reject), audit (allow + log), warn (allow + user warning)
Migration: start with audit mode to find violations, then enforce

Model Answer

"PSP was a cluster-scoped resource that defined what security constraints pods must meet. The problem: its RBAC model was confusing (a pod could use a PSP if the pod's ServiceAccount could 'use' it — this led to many accidental privilege escalation bugs). It was deprecated in 1.21 and removed in 1.25."

"Pod Security Admission replaces it with a simpler model: you label a namespace with a policy level, and the built-in admission controller enforces it. Three levels: privileged (no restrictions, for system namespaces), baseline (blocks hostPID, hostNetwork, privileged containers, dangerous capabilities), restricted (requires non-root, drops all capabilities, requires seccompProfile)."

"Three modes per level: enforce (reject the pod), audit (allow but log an audit event), warn (allow but show a warning to the user). The migration strategy: add pod-security.kubernetes.io/audit=restricted labels to all namespaces first, check audit logs for violations, fix the workloads, then switch to enforce mode."

"For EKS, namespaces like kube-system must stay privileged (system components need elevated privileges). Application namespaces should target at least baseline, ideally restricted for production workloads."

Interviewer Follow-ups

A third-party Helm chart requires privileged containers — how do you handle this in a restricted namespace?
How does OPA/Gatekeeper complement or replace PSA?

Q36 How do admission webhooks work, and what are the failure mode risks?

ConceptualHard

▶

Key Points

Mutating: runs first, can modify the object (e.g., inject sidecar, add labels, set defaults)
Validating: runs after mutating, can only accept/reject (e.g., OPA/Gatekeeper policy enforcement)
failurePolicy: Fail — if the webhook is unreachable, the request is rejected. Ignore — allow through.
Webhooks add latency to every API call they match — timeout is configurable (default 10s)
If the webhook service is down with Fail policy: no new pods can be scheduled (cluster emergency)
Best practice: scope webhooks with namespaceSelector to exclude kube-system, monitor webhook latency

Model Answer

"Admission webhooks intercept API server requests after authentication/authorization but before persisting to etcd. Mutating webhooks run first — they can modify the object being submitted (inject a sidecar container, add an annotation, set a default resource limit). Validating webhooks run after all mutating webhooks — they can only accept or reject, not modify. OPA/Gatekeeper uses validating webhooks."

"The critical operational risk is failurePolicy: Fail. If your webhook service (e.g., Gatekeeper) becomes unavailable and it's configured with Fail policy, every API request that matches the webhook rule is rejected. In the worst case, no new pods can be scheduled and no Deployments can be updated — effectively a cluster incident. I've seen Istio injection webhooks with Fail policy take down a cluster when the Istiod service crashed."

"Best practices: scope webhooks with namespaceSelector to exclude critical system namespaces (kube-system, kube-public) so the cluster can always self-heal. Set reasonable timeouts (2–5s) and have runbooks for emergency disabling. Monitor webhook latency — a slow webhook adds that latency to every kubectl apply."

Interviewer Follow-ups

You need to emergency-disable a validating webhook that's blocking all pod creation — how?
How do you test a mutating webhook locally before deploying to production?

Q37 Explain the Gateway API and how it improves on Ingress in Kubernetes.

ConceptualMedium

▶

Key Points

Ingress API is limited: single resource, implementation-specific via annotations, no TCP/UDP routing, no traffic splitting standard
Gateway API separates concerns: GatewayClass (infra team defines LB type), Gateway (cluster ops deploys LB), HTTPRoute/TCPRoute (app teams configure routing)
Multi-namespace: HTTPRoutes in app namespace can attach to a Gateway in infra namespace
Standard traffic splitting, header matching, redirects — no annotations needed
AWS LBC supports Gateway API via GatewayClass

Model Answer

"The Ingress API has been stretched beyond its original design. Every feature beyond basic path routing requires implementation-specific annotations — ALB annotations for auth, NLB annotations for TCP, NGINX annotations for rate limiting. There's no standard way to express traffic splitting or header-based routing."

"Gateway API introduces a role-based model with three resource types. GatewayClass is cluster-scoped, controlled by infra teams — it defines the controller (e.g., aws-load-balancer-controller). Gateway is created by cluster operators and provisions the actual load balancer with listener configuration. HTTPRoute is created by app teams and attaches to a Gateway — defining routing rules with standardized syntax for path, header, weight-based traffic splitting."

"The key improvements for platform teams: role separation — app teams write HTTPRoutes without needing cluster-admin, and they can't accidentally misconfigure the shared Gateway. Standardized traffic splitting — canary deployments are first-class, expressed as weight: 90/10 in the HTTPRoute spec without custom annotations. AWS LBC supports Gateway API in newer versions."

Interviewer Follow-ups

How do you migrate from Ingress to Gateway API without downtime?

Q38 What is the CSI driver architecture? Explain how dynamic volume provisioning works end-to-end.

ConceptualHard

▶

Key Points

CSI = Container Storage Interface — standardized plugin API, replaces in-tree volume drivers
Controller component (Deployment): handles CreateVolume, DeleteVolume, AttachVolume, DetachVolume via cloud API
Node component (DaemonSet): handles NodeStageVolume, NodePublishVolume (mount into pod namespace)
Dynamic provisioning flow: PVC created → external-provisioner sidecar watches → calls CSI CreateVolume → PV created → PVC bound → pod scheduled → node CSI mounts volume
external-provisioner, external-attacher, external-snapshotter: standard sidecars from kubernetes-csi

Model Answer

"CSI standardizes the interface between Kubernetes and storage vendors. Before CSI, each storage provider had in-tree code in the Kubernetes codebase — a security and release coupling problem. CSI moves this to out-of-tree plugins that run as pods."

"The architecture has two components. The controller plugin is a Deployment that runs on any node and handles cloud API calls: CreateVolume (provision an EBS volume), DeleteVolume, ControllerPublishVolume (attach to an EC2 instance). It runs with IRSA credentials that have ec2 permissions."

"The node plugin is a DaemonSet on every worker node that handles the local mount operations: NodeStageVolume (format and mount to a staging path on the host), NodePublishVolume (bind-mount from staging into the pod's directory). It runs privileged because it needs to create mounts in the host's mount namespace."

"Dynamic provisioning flow: a PVC is created with a StorageClass. The external-provisioner sidecar inside the controller Deployment watches for unbound PVCs, calls CreateVolume on the CSI driver, and creates a PV object. The PVC binds to the PV. When the pod is scheduled to a node, the external-attacher calls ControllerPublishVolume to attach the EBS volume to that EC2 instance, then the node plugin mounts it into the pod."

Interviewer Follow-ups

Volume attach is stuck — how do you debug it?
How does volume snapshot work with CSI?

Q39 How do you run cost-optimized spot instances on EKS without risking production stability?

DesignHard

▶

Key Points

Spot instances can be reclaimed with 2-minute warning. SIGTERM sent to pods via node termination handler.
AWS Node Termination Handler (or Karpenter's built-in): watches EC2 spot interruption notices, cordons + drains node gracefully before termination
Strategies: spot for stateless/batch, on-demand for stateful. Use karpenter.sh/capacity-type labels to separate.
Spread across multiple instance families + AZs to reduce simultaneous interruptions
PDBs required — spot interruption can remove multiple nodes simultaneously in same family/AZ
Karpenter's spot-to-on-demand fallback: if spot unavailable, automatically provision on-demand

Model Answer

"The 2-minute spot interruption notice is the key constraint. AWS sends an EC2 instance metadata event, and the AWS Node Termination Handler (or Karpenter natively) picks this up, cordons the node immediately, and drains it — evicting pods gracefully before the instance terminates. This gives pods up to 2 minutes to handle SIGTERM and shut down."

"My tiering strategy: stateless workloads on spot (web services, API servers, workers) with multiple replicas spread across instance families (m5, m5a, m5n, m4 — if m5 spot capacity dries up, m5a picks up the slack). Stateful and critical workloads on on-demand. With Karpenter, this is a NodePool label: karpenter.sh/capacity-type: spot for batch pools, on-demand for critical pools."

"AZ and instance family diversification is critical. A mass reclaim event in us-east-1a m5 family would take down all your spot nodes if they're homogeneous. Diversify: require at least 3 different instance families, 3 AZs. Karpenter automatically selects across families based on pricing."

"PDBs are non-negotiable for spot workloads. A spot mass reclaim can hit multiple nodes in the same minute. Without PDBs, your 3-replica service could go to 0 during a reclaim wave."

Interviewer Follow-ups

How do you handle long-running spot jobs (ML training) that can't complete within 2 minutes of interruption?
What's the cost saving percentage you'd expect from spot in a typical workload mix?

Q40 You need to migrate a stateful workload from one EKS cluster to another with minimal downtime. How do you approach this?

DesignHard

▶

Key Points

Snapshot EBS/EFS data, restore in destination cluster (Volume Snapshot API + VolumeSnapshotContent)
Run workload in both clusters temporarily (active-passive or dual-active depending on consistency requirements)
Use Route 53 weighted routing for gradual traffic migration
Tools: Velero for backup/restore of both K8s resources and volumes
Data consistency: quiesce writes before final snapshot if strong consistency required
Verify: run smoke tests on new cluster before cutting over DNS

Model Answer

"Migrating a stateful workload involves three parallel concerns: Kubernetes resource migration, data migration, and traffic cutover. I'd approach them as a pipeline."

"Kubernetes resources: use Velero to back up all resources in the namespace from the source cluster and restore them to the destination. Velero handles PV/PVC metadata, ConfigMaps, Secrets, ServiceAccounts, and IRSA annotations."

"Data migration: take a CSI VolumeSnapshot of each PVC in the source cluster. Use the VolumeSnapshotContent object's snapshot handle to create a new PVC from snapshot in the destination cluster. For large volumes, this runs in parallel with resource setup. For databases, I'd quiesce writes (or use application-level replication) before the final snapshot to avoid consistency issues."

"Traffic cutover: I stand up the workload in the destination cluster in passive mode (workload running, no external traffic). Run smoke tests against the new cluster's internal ALB. Then use Route 53 weighted routing to gradually shift traffic: 10% to new cluster, verify metrics (error rate, latency) for 30 min, shift 50%, verify, then cut to 100%. Keep the old cluster alive for 30 minutes as a rollback option."

"The hardest part is maintaining data consistency during the traffic transition window if the workload has writes. For databases, application-level replication (Postgres logical replication) to the new cluster until cutover is the cleanest approach."

Interviewer Follow-ups

How does Velero handle cross-region migration of EBS snapshots?
What's your rollback plan if the new cluster has issues at 50% traffic?

EKS Architecture — Animated

Shared Responsibility Model

Control Plane Components

Control Plane ↔ Data Plane Communication

Deployment Options

Essential Tools

VPC CNI — How Pod IPs Are Assigned

CNI Options

Prefix Delegation — Pod Density Scale-up

Network Policies

VPC & Subnet Design Rules

Animated Compute Model Comparison

Full Feature Comparison

Managed Node Groups — Commands

Karpenter — NodePool + NodeClass

Storage Options — Animated Decision Map

Storage Comparison

EBS Volume Types

EFS — Elastic File System

Authentication Flow — Animated

Cluster Access — aws-auth vs EKS Access Entries

IRSA — IAM Roles for Service Accounts

Pod Identity — IRSA v2

Secrets Management

Traffic Flow — Animated End to End

ALB vs NLB

AWS Load Balancer Controller

In-Place Upgrade — Animated Sequence

Upgrade Commands

Managed Add-ons

Monitoring

Critical Gotchas

Health Check Command Reference

EKS Architecture & Control Plane

Key Points

Model Answer

Interviewer Follow-ups

Key Points

Model Answer

Interviewer Follow-ups

Key Points

Model Answer

Interviewer Follow-ups

Networking — CNI, VPC, ENIs, Policies

Key Points

Model Answer

Interviewer Follow-ups

Key Points

Model Answer

Interviewer Follow-ups

Key Points

Model Answer

Interviewer Follow-ups

Compute & Scaling

Key Points

Model Answer

Interviewer Follow-ups

Key Points

Model Answer

Interviewer Follow-ups

Key Points

Model Answer

Interviewer Follow-ups

Security — IRSA, Pod Identity, RBAC, Secrets

Key Points

Model Answer

Interviewer Follow-ups

Key Points

Model Answer

Interviewer Follow-ups

Key Points

Model Answer

Interviewer Follow-ups

Storage — EBS, EFS, CSI, StatefulSets

Key Points

Model Answer

Interviewer Follow-ups

Key Points

Model Answer

Interviewer Follow-ups