NAT Gateway Internals — Staff Interview Prep

Staff-Level Interview Prep

NAT Gateway
Internals

How AWS NAT Gateway really works under the hood — SNAT, connection tracking, port exhaustion, AZ affinity, and the failure modes interviewers expect you to know.

Connection Tracking Port Exhaustion AZ Affinity Scaling 5-tuple SNAT

⚙ How It Works

🔌 Connection Tracking

⚡ Port Exhaustion

🏗 HA & AZ Design

💰 Cost & Gotchas

🎯 Interview Q&A

What NAT GW Actually Does

NAT Gateway implements SNAT (Source NAT) — it rewrites the source IP and source port of outbound packets from private EC2s to its own Elastic IP, then tracks the mapping to route replies back. It never touches destination IP — that's DNAT territory (which IGW handles for inbound).

5-Tuple SNAT Translation

Translation unit5-tuple: src IP, src port, dst IP, dst port, protocol

What changesSource IP → EIP · Source port → allocated port

What's unchangedDestination IP, dst port, protocol

Port range1024 – 65535 per EIP (64,512 ports)

Multiple EIPsUp to 8 EIPs; scales port space 8×

ICMPICMP ID field used as pseudo-port

UDPSame mechanism; 350s idle timeout

TCP350s established; 60s on half-close

NAT GW vs NAT Instance

ManagementFully managed (vs self-managed EC2)

HABuilt-in within an AZ (vs single-point EC2)

BandwidthAuto-scales to 100 Gbps

Security GroupsCannot attach SGs to NAT GW

Port forwardingNot supported (NAT instance can)

Bastion useNot possible (NAT instance can)

Cost (idle)~$0.045/hr always-on vs EC2 stoppable

src/dst checkNAT instance must disable; NAT GW N/A

Connection Tracking Table

NAT GW maintains a per-flow connection tracking table in memory. Every active TCP/UDP flow has an entry keyed by 5-tuple. This is what allows it to demultiplex inbound reply packets back to the correct private EC2 — without any routing table entry pointing inward.

Live Connection Table (representative snapshot)

Private (src)	Translated (src)	Destination	Proto	State	Idle
10.0.2.10:52341	52.23.101.45:1024	93.184.216.34:443	TCP	ESTABLISHED	2s
10.0.2.11:48821	52.23.101.45:1025	93.184.216.34:443	TCP	ESTABLISHED	14s
10.0.2.12:61009	52.23.101.45:1026	8.8.8.8:53	UDP	ACTIVE	1s
10.0.2.10:44100	52.23.101.45:1027	169.254.170.2:80	TCP	TIME_WAIT	58s
10.0.2.11:39002	52.23.101.45:1028	10.0.3.5:5432	TCP	SYN_SENT	—

Idle Timeouts — The Trap

TCP established350 seconds (≈5.8 min)

TCP transitory30–350 seconds (FIN / RST)

UDP350 seconds

ICMP30 seconds

What happens at timeoutEntry purged. Stale packet → ErrorPortAllocation

⚠ Long-lived idle TCP connections (DB pools, keep-alives) will be silently dropped. The TCP stack on EC2 thinks the connection is still open; NAT GW has forgotten it. Fix: enable TCP keepalives at the OS or app layer with interval < 350s.

TCP Keep-Alive Fix

Set at kernel level per EC2 instance (or via launch template user-data):

# Keepalive interval (seconds before first probe)
net.ipv4.tcp_keepalive_time = 60

# Interval between probes
net.ipv4.tcp_keepalive_intvl = 10

# Probes before declaring dead
net.ipv4.tcp_keepalive_probes = 3

# Apply
sysctl -p

✓ Also configure keepalive at the app layer: Go net.Dialer.KeepAlive, PostgreSQL keepalives_idle, MySQL wait_timeout. Belt + suspenders.

Connection Tracking vs Route Table — A Subtlety

When a reply packet arrives from the internet at IGW, AWS does not use the route table to forward it. The route table only handles outbound new-flow routing. For inbound replies, the connection tracking table in NAT GW is the sole authority. The packet hits IGW → IGW sees the dst IP is NAT GW's EIP → forwards to NAT GW → NAT GW looks up its conn table → rewrites dst back to private IP → delivers to EC2.

This is why you cannot use NACLs to block return traffic for a NAT'd flow — by the time the packet enters the subnet, it's already been translated. The NACL sees the private IP, not the EIP.

Port Exhaustion — The Scaling Cliff

Each NAT GW EIP has 64,512 ports (1024–65535). A flow is uniquely identified by the 5-tuple from NAT GW's perspective: (EIP, translated-src-port, dst-IP, dst-port, protocol). If many EC2s hammer the same dst-IP:dst-port, the unique dimension is only translated-src-port — and you'll exhaust 64K fast.

Port Math — How You Exhaust

Ports per EIP64,512 (1024–65535)

Unique dimension when dst fixedOnly src-port (64,512 slots total)

50 EC2s → same endpoint64,512 / 50 = ~1,290 conns each max

Error metricErrorPortAllocation in CloudWatch

ImpactNew connections silently dropped

🔥 A Lambda function with concurrency=1000 all calling the same RDS endpoint can exhaust 64K ports in seconds. This is a real prod incident pattern.

Mitigations

① Add more EIPs — up to 8 per NAT GW. Multiplies port space 8× to ~500K.
② Reduce unique dst endpoints — use connection pooling (PgBouncer, RDS Proxy) to reduce distinct dst-IP:port combos.
③ Multiple NAT GWs — split subnets; each private subnet routes to its own NAT GW. Shards the connection space.
④ Reuse connections — HTTP/2 multiplexing, long-lived gRPC streams, fewer short-lived TCP connections.
⑤ PrivateLink / VPC Endpoints — route AWS-API traffic (S3, DynamoDB, SQS) via gateway endpoints. Zero NAT GW ports, zero data cost.
⑥ IPv6 + Egress-only IGW — IPv6 flows bypass NAT entirely. No port translation needed.

Port Space Comparison

1 EIP (baseline)

64K

64,512

2 EIPs

129K

129,024

4 EIPs

258K

258,048

8 EIPs (max)

516K

516,096

Same w/ PrivateLink (S3/DDB)

∞

Zero NAT ports

CloudWatch Alarms to Set

Metric	Namespace	Alarm threshold	Why
ErrorPortAllocation	AWS/NATGateway	Sum > 0 (5 min)	Port exhaustion is occurring NOW
PacketsDropCount	AWS/NATGateway	Sum > 0 (5 min)	Packets being silently dropped
ActiveConnectionCount	AWS/NATGateway	> 50,000 per EIP	Approaching exhaustion warning
ConnectionAttemptCount	AWS/NATGateway	Spike ratio	Detect connection storms
BytesInFromDestination	AWS/NATGateway	Cost spike detection	Unexpected data transfer cost

HA & AZ Design — The #1 Mistake

NAT Gateway is highly available within a single AZ. It is not cross-AZ. If you put one NAT GW in AZ-a and your private EC2s in AZ-b route through it, you pay cross-AZ data transfer AND lose outbound connectivity if AZ-a goes down. One NAT GW per AZ is the canonical pattern.

Single NAT GW Risks

AZ failureAll outbound traffic from other AZs fails

Cross-AZ data cost$0.01/GB for traffic crossing AZ boundary

LatencyAdds cross-AZ hop to every outbound packet

Blast radiusSingle point for port exhaustion

NAT GW per AZ Benefits

AZ failureOnly that AZ's egress is affected

Data costTraffic stays in-AZ, no cross-AZ charges

Port spacePartitioned: 64K ports × number of AZs

Cost trade-off+$0.045/hr per additional NAT GW (3 AZs = $99/mo baseline)

ℹ Private NAT Gateway (2022 feature): You can now create a NAT GW in a private subnet — no EIP required. Useful for VPC-to-VPC traffic where overlapping CIDRs make peering impossible. Traffic is translated but stays private (no internet egress).

Cost Model & Gotchas

NAT GW is often the surprise budget item in AWS bills. Two charges: hourly ($0.045/hr ≈ $33/mo per gateway) + data processing ($0.045/GB outbound). The data charge applies even for traffic that goes to other AWS services — which VPC Endpoints eliminate entirely.

Billing Breakdown

Hourly (per NAT GW)$0.045/hr ≈ $32.85/mo

Data processing$0.045 per GB (both directions)

3 AZs, prod setup~$99/mo just for hourly

1 TB/day outbound~$1,350/mo data charges

Cross-AZ data via NAT+$0.01/GB AZ charge on top

VPC Endpoint (S3/DDB)$0.00 per GB — no NAT touch

Cost Optimizations

S3 / DynamoDB → Gateway Endpoint (free, no NAT)
SQS / SNS / ECR / etc. → Interface Endpoint (PrivateLink)
Lambda → AWS APIs → keep in VPC with endpoints
ECS Fargate → VPC endpoint for ECR image pulls
Dev/non-prod → NAT Instance (t3.nano ≈ $3/mo)
Batch jobs off-hours → Public subnet + auto-assign public IP (no NAT needed)

Hidden Gotchas Checklist

Gotcha	What Happens	Fix
Idle TCP timeout (350s)	DB pool connections silently killed; app sees stale connection errors	OS TCP keepalive < 350s + app-level keepalive
Port exhaustion	New connections dropped; ErrorPortAllocation spikes	Add EIPs, VPC endpoints, connection pooling
Single NAT GW	Cross-AZ charges + single AZ SPOF for egress	1 NAT GW per AZ; route table per private subnet
No SG on NAT GW	Can't restrict which EC2s use it via security group	Use NACLs on private subnet or restrict at EC2 SG egress rules
Cannot log flows	NAT GW itself not visible in VPC Flow Logs by ENI	Enable VPC Flow Logs on private subnet ENIs; CloudWatch NAT GW metrics
EIP cost	Each EIP attached to a running resource: free. Detached: $0.005/hr	Always attach; release unused EIPs
DNS via NAT	Route 53 Resolver handles DNS; doesn't go through NAT GW	Understand that DNS is exempt from NAT; .2 resolver address always works

Staff-Level Interview Q&A

These are the questions that separate Staff candidates from Senior. Expand each to see the answer framing. Focus on the why and trade-offs, not just facts.

Q1 — A private EC2 is making outbound HTTPS calls to an S3 bucket. Describe exactly what happens at each network hop, and identify where you'd optimize cost. ▾

Full hop trace: EC2 → Route table → NAT GW (SNAT: private IP → EIP + ephemeral port) → IGW (1:1 NAT: EIP → public IP of IGW) → AWS S3 public endpoint → response reverses the path, conn table lookup at NAT GW restores original private IP.

Cost optimization: Every byte processed by NAT GW costs $0.045/GB. S3 is accessible via an Gateway VPC Endpoint at zero cost. Add the endpoint, update the route table with a prefix list entry for S3 → the endpoint, and NAT GW is bypassed entirely. This is the most impactful single change for S3-heavy workloads.

Q2 — Your application has 200 Lambda functions all calling the same RDS PostgreSQL instance. You're seeing intermittent connection failures. What's the root cause and fix? ▾

Root cause — port exhaustion: All 200 Lambdas are behind a single NAT GW EIP, calling the same dst IP:port (RDS endpoint :5432). The 5-tuple unique dimension collapses to just translated src port. 64,512 ports ÷ short-lived Lambda connections = exhaustion fast. CloudWatch ErrorPortAllocation will confirm this.

Fix strategy (layered): ① Deploy RDS Proxy — it pools and multiplexes connections, so Lambdas share a fixed set of real DB connections. This is the primary fix. ② Add more EIPs to NAT GW (up to 8). ③ Consider RDS Proxy via PrivateLink to remove NAT entirely. ④ If Lambdas are creating DB connections on every invocation, fix connection reuse by initialising outside the handler.

Q3 — You have a 3-AZ VPC with one NAT GW in AZ-A. A colleague says "it's highly available because NAT GW auto-scales." Are they correct? ▾

Partially correct, fundamentally wrong: NAT GW is managed and auto-scales bandwidth within its AZ. But it has no cross-AZ failover. If AZ-A has an outage, EC2s in AZ-B and AZ-C lose all outbound internet access — even though they're perfectly healthy.

Correct HA design: Deploy one NAT GW in the public subnet of each AZ. Configure the private route table for each AZ to point 0.0.0.0/0 to that AZ's own NAT GW. Now an AZ failure only isolates that AZ's egress — other AZs are unaffected. Side benefit: eliminates cross-AZ data transfer charges ($0.01/GB).

Q4 — Explain the difference between how NAT GW handles TCP vs UDP from a connection tracking perspective. ▾

TCP — stateful with timeout: TCP connection tracking follows the 3-way handshake. NAT GW tracks SYN, SYN-ACK, and FIN/RST to manage state transitions. Idle timeout is 350s for ESTABLISHED, 60s for transitory states. It tracks the full state machine.

UDP — timeout-only: UDP has no handshake. NAT GW creates an entry on the first packet and purges it after 350s of inactivity. There's no concept of "connection closed." Any packet from the destination to the allocated EIP:port within that window gets forwarded back. After timeout, the same dst sending a packet hits no matching entry → dropped.

Key implication: DNS (UDP :53) is fire-and-forget sub-second so timeout doesn't matter. But long-lived UDP applications (e.g., QUIC/HTTP3, game servers) will see NAT bindings expire silently — apps need to implement their own keepalive probes.

Q5 — Design the egress architecture for a multi-account, 3-region AWS organization with 40 VPCs. ▾

Centralized egress via Transit Gateway: The gold standard for >10 VPCs.

① Each region: One shared "Egress VPC" per region (owned by network account). This VPC has NAT GWs (one per AZ) and an IGW.
② TGW: All 40 spoke VPCs attach to TGW. Spoke VPCs have NO IGW, no NAT GW.
③ TGW route tables: Default route in spoke VPCs → TGW → routed to Egress VPC → NAT GW → IGW.
④ Benefits: All internet egress through a controlled choke point. WAF / firewall appliances can be inserted in Egress VPC. Only N NAT GWs (N = AZs × regions) instead of NAT GWs per VPC.
⑤ Trade-offs: TGW data charge ($0.02/GB). Centralized failure point for egress (mitigated by multi-AZ NAT GWs in Egress VPC). Slightly higher latency due to TGW hop.

For AWS-service traffic: Deploy VPC Endpoints (S3, DDB gateway; SQS/SNS interface) in each spoke VPC — never touch NAT or TGW.

Q6 — Why can't you put a Security Group on a NAT Gateway? What's the implication for security controls? ▾

Architecture reason: NAT GW is not an ENI-based resource you manage — it's a managed AWS service that presents as a gateway target in route tables. Security Groups attach to ENIs (Elastic Network Interfaces). Since NAT GW's internal implementation is opaque (multiple ENIs under the hood, managed by AWS), AWS doesn't expose SG attachment.

Security implications & compensating controls:
① EC2 SG egress rules — restrict which destinations your EC2s can reach. The SG is enforced at the EC2's ENI before the packet reaches NAT GW.
② NACLs on private subnet — stateless outbound rules can block specific IP ranges from leaving the subnet toward NAT GW.
③ AWS Network Firewall in the egress path — insert a centralized inspection appliance between private subnet and NAT GW for deep packet inspection, IDS/IPS, domain-based filtering.
④ VPC Flow Logs — audit trail even without SG.

This is a common Staff question to test whether you understand the SG attachment model and can reason about compensating controls.