| Private (src) | Translated (src) | Destination | Proto | State | Idle |
|---|---|---|---|---|---|
| 10.0.2.10:52341 | 52.23.101.45:1024 | 93.184.216.34:443 | TCP | ESTABLISHED | 2s |
| 10.0.2.11:48821 | 52.23.101.45:1025 | 93.184.216.34:443 | TCP | ESTABLISHED | 14s |
| 10.0.2.12:61009 | 52.23.101.45:1026 | 8.8.8.8:53 | UDP | ACTIVE | 1s |
| 10.0.2.10:44100 | 52.23.101.45:1027 | 169.254.170.2:80 | TCP | TIME_WAIT | 58s |
| 10.0.2.11:39002 | 52.23.101.45:1028 | 10.0.3.5:5432 | TCP | SYN_SENT | โ |
# Keepalive interval (seconds before first probe) net.ipv4.tcp_keepalive_time = 60 # Interval between probes net.ipv4.tcp_keepalive_intvl = 10 # Probes before declaring dead net.ipv4.tcp_keepalive_probes = 3 # Apply sysctl -p
net.Dialer.KeepAlive, PostgreSQL keepalives_idle, MySQL wait_timeout. Belt + suspenders.
When a reply packet arrives from the internet at IGW, AWS does not use the route table to forward it. The route table only handles outbound new-flow routing. For inbound replies, the connection tracking table in NAT GW is the sole authority. The packet hits IGW โ IGW sees the dst IP is NAT GW's EIP โ forwards to NAT GW โ NAT GW looks up its conn table โ rewrites dst back to private IP โ delivers to EC2.
This is why you cannot use NACLs to block return traffic for a NAT'd flow โ by the time the packet enters the subnet, it's already been translated. The NACL sees the private IP, not the EIP.
| Metric | Namespace | Alarm threshold | Why |
|---|---|---|---|
| ErrorPortAllocation | AWS/NATGateway | Sum > 0 (5 min) | Port exhaustion is occurring NOW |
| PacketsDropCount | AWS/NATGateway | Sum > 0 (5 min) | Packets being silently dropped |
| ActiveConnectionCount | AWS/NATGateway | > 50,000 per EIP | Approaching exhaustion warning |
| ConnectionAttemptCount | AWS/NATGateway | Spike ratio | Detect connection storms |
| BytesInFromDestination | AWS/NATGateway | Cost spike detection | Unexpected data transfer cost |
| Gotcha | What Happens | Fix |
|---|---|---|
| Idle TCP timeout (350s) | DB pool connections silently killed; app sees stale connection errors | OS TCP keepalive < 350s + app-level keepalive |
| Port exhaustion | New connections dropped; ErrorPortAllocation spikes | Add EIPs, VPC endpoints, connection pooling |
| Single NAT GW | Cross-AZ charges + single AZ SPOF for egress | 1 NAT GW per AZ; route table per private subnet |
| No SG on NAT GW | Can't restrict which EC2s use it via security group | Use NACLs on private subnet or restrict at EC2 SG egress rules |
| Cannot log flows | NAT GW itself not visible in VPC Flow Logs by ENI | Enable VPC Flow Logs on private subnet ENIs; CloudWatch NAT GW metrics |
| EIP cost | Each EIP attached to a running resource: free. Detached: $0.005/hr | Always attach; release unused EIPs |
| DNS via NAT | Route 53 Resolver handles DNS; doesn't go through NAT GW | Understand that DNS is exempt from NAT; .2 resolver address always works |
Gateway VPC Endpoint at zero cost. Add the endpoint, update the route table with a prefix list entry for S3 โ the endpoint, and NAT GW is bypassed entirely. This is the most impactful single change for S3-heavy workloads.
ErrorPortAllocation will confirm this.0.0.0.0/0 to that AZ's own NAT GW. Now an AZ failure only isolates that AZ's egress โ other AZs are unaffected. Side benefit: eliminates cross-AZ data transfer charges ($0.01/GB).