How Cloudflare, Stripe, and AWS actually implement rate limiting in production — architecture decisions, scale numbers, and the specific tradeoffs each company made. Knowing these is table-stakes for Staff interviews at these companies.
Cloudflare Workers · RL Pattern (TypeScript)interface RLState { count: number; windowStart: number } // Durable Object = single-threaded actor = no race conditions export class RateLimiter extends DurableObject { private state: RLState = { count: 0, windowStart: Date.now() } async check(limit: number, windowMs: number): Promise<boolean> { const now = Date.now() if (now - this.state.windowStart >= windowMs) { this.state = { count: 1, windowStart: now } // new window return true } this.state.count++ return this.state.count <= limit } } // DO instance = sticky to one PoP region → strong consistency within region // key = hash(clientIP + path) → routes to consistent DO shard
Stripe · Hierarchical Token Bucket (Lua)-- Called atomically for every API request local function checkBucket(key, capacity, refillRate, now) local bucket = redis.call('HMGET', key, 'tokens', 'last_refill') local tokens = tonumber(bucket[1]) or capacity local lastRefill = tonumber(bucket[2]) or now local elapsed = math.max(0, now - lastRefill) tokens = math.min(capacity, tokens + elapsed * refillRate) if tokens < 1 then return 0 end redis.call('HMSET', key, 'tokens', tokens-1, 'last_refill', now) redis.call('EXPIRE', key, 86400) return 1 end -- Check all 3 tiers. Fail any = deny. local now = redis.call('TIME')[1] -- server time, avoids clock skew if checkBucket(KEYS[1], 10000, 100, now) == 0 then return {0, "global_limit"} end if checkBucket(KEYS[2], 1000, 10, now) == 0 then return {0, "org_limit"} end if checkBucket(KEYS[3], 100, 1, now) == 0 then return {0, "key_limit"} end return {1, "allowed"}
Geo-distributed rate limiting is the hardest variant — you have WAN latency (50–200ms cross-region), potential network partitions, and the requirement to enforce limits globally against a client that may hit any region.
Cross-region round-trip: US-East ↔ EU-West ≈ 90ms. Synchronous global check = +90ms per request. Unacceptable for <10ms API SLAs.
Under CAP: you must choose. Global consistency during partition = block all requests. Availability during partition = allow potential overage. No right answer — depends on use case.
Different regions can have 10–100ms clock drift. Window boundaries differ. Solution: use logical timestamps or a global clock service (TrueTime, HLC).
Split global limit across regions by traffic weight. Global limit=1000, US=500, EU=300, APAC=200. Each region enforces independently. Zero cross-region coordination. Problem: unused quota in one region cannot be borrowed by another. Simple, works well when traffic is predictable.
Go · Quota Partitiontype RegionQuota struct { Limit int64 Region string Weight float64 // fraction of global limit } // Recalibrate weights every 5min from traffic telemetry func (m *Manager) Rebalance(traffic map[string]float64) { total := sum(traffic) for region, t := range traffic { m.quotas[region].Weight = t / total m.quotas[region].Limit = int64(float64(m.globalLimit) * t / total) } }
Each region maintains a G-Counter CRDT. Counters replicate asynchronously across regions (~500ms–2s lag). Each region enforces based on its locally-merged global estimate. Allows small overage = limit × (replication_lag / window_size). For a 60s window with 2s lag: ~3% overage. Ideal for API quotas.
Go · Geo G-Countertype GeoGCounter struct { mu sync.RWMutex counts map[string]int64 // regionID → count ts map[string]int64 // HLC timestamps per region } func (g *GeoGCounter) Merge(remote *GeoGCounter) { g.mu.Lock(); defer g.mu.Unlock() for rid, cnt := range remote.counts { if remote.ts[rid] > g.ts[rid] { // HLC compare g.counts[rid] = cnt g.ts[rid] = remote.ts[rid] } } } func (g *GeoGCounter) GlobalEstimate() int64 { g.mu.RLock(); defer g.mu.RUnlock() var total int64 for _, v := range g.counts { total += v } return total }
Route each clientID to a home region via consistent hash. All RL decisions made locally in home region. Other regions proxy to home region for RL check only. +latency for non-home requests but global consistency. Used when accuracy > latency.
NTP sync is typically ±10–100ms accurate. For rate limiting windows <1s, this matters.
Go · HLC for Geo RL// HLC = max(physical_clock, last_seen_hlc) + counter // Gives causal ordering without global synchronization type HLC struct { WallTime int64 // milliseconds Logical int32 // tie-breaker } func (h *HLCClock) Now() HLC { wall := time.Now().UnixMilli() h.mu.Lock(); defer h.mu.Unlock() if wall > h.last.WallTime { h.last = HLC{WallTime: wall, Logical: 0} } else { h.last.Logical++ } return h.last } func (h *HLCClock) Receive(remote HLC) { h.mu.Lock(); defer h.mu.Unlock() if remote.WallTime > h.last.WallTime { h.last = HLC{WallTime: remote.WallTime, Logical: remote.Logical + 1} } else if remote.WallTime == h.last.WallTime { h.last.Logical = max(h.last.Logical, remote.Logical) + 1 } }
| Replication Lag | Window Size | Max Overage | Acceptable For |
|---|---|---|---|
| 500ms | 1 minute | ~1% | Most API quotas |
| 2s | 1 minute | ~3% | Soft quotas, analytics |
| 2s | 10 seconds | ~20% | Marginal — needs tighter sync |
| 10s | 1 minute | ~17% | Only if accuracy not critical |
| Any | 1 second | Unacceptable | Must use synchronous check |
Formula: max_overage ≈ replication_lag / window_size × 100%
Service meshes (Envoy/Istio) handle rate limiting at the infrastructure layer — decoupled from application code. This is the preferred pattern for platform teams at Staff+ level because it's polyglot, consistent, and operationally uniform.
Envoy supports two RL modes:
Per-Envoy-instance token bucket. No external calls. Sub-ms overhead. Not globally consistent — each pod enforces independently. Good for DDoS protection at the edge, not for per-user quotas.
Envoy confighttp_filters:
- name: envoy.filters.http.local_ratelimit
typed_config:
token_bucket:
max_tokens: 100
fill_interval: 1s
tokens_per_fill: 100
filter_enabled: {default_value: {numerator: 100}}
Envoy calls an external Rate Limit Service (RLS) via gRPC on every request. RLS checks Redis. Globally consistent. +1–3ms latency. The envoyproxy/ratelimit is the reference impl.
Envoy configrate_limits:
- actions:
- request_headers:
header_name: ":authority"
descriptor_key: domain
- request_headers:
header_name: "x-api-key"
descriptor_key: api_key
ratelimit config YAMLdomain: api_gateway descriptors: # Per API key: 1000 req/min - key: api_key rate_limit: unit: MINUTE requests_per_unit: 1000 descriptors: # Nested: per API key per endpoint - key: path value: "/v1/payments" rate_limit: unit: MINUTE requests_per_unit: 100 # Per IP: 10000 req/min (DDoS protection) - key: remote_address rate_limit: unit: MINUTE requests_per_unit: 10000 # Composite key: IP + path - key: remote_address descriptors: - key: path value: "/auth/login" rate_limit: unit: MINUTE requests_per_unit: 10 # brute-force protection
Istio wraps Envoy's RLS with higher-level CRDs. The EnvoyFilter or WasmPlugin approach lets you inject RL logic at mesh level.
Istio · EnvoyFilter for Global RLapiVersion: networking.istio.io/v1alpha3
kind: EnvoyFilter
metadata:
name: api-ratelimit
namespace: prod
spec:
workloadSelector:
labels:
app: api-gateway
configPatches:
- applyTo: HTTP_FILTER
match:
context: SIDECAR_INBOUND
listener:
filterChain:
filter:
name: "envoy.filters.network.http_connection_manager"
patch:
operation: INSERT_BEFORE
value:
name: envoy.filters.http.ratelimit
typed_config:
"@type": type.googleapis.com/envoy.extensions.filters.http.ratelimit.v3.RateLimit
domain: api_prod
rate_limit_service:
grpc_service:
envoy_grpc:
cluster_name: ratelimit_cluster
transport_api_version: V3
| Factor | Mesh (Envoy/Istio) | Application Layer |
|---|---|---|
| Polyglot | ✓ all services same policy | ✗ each service reimplements |
| Business logic awareness | ✗ no app context | ✓ can check subscription tier, feature flags |
| Request body inspection | ✗ headers only (L7) | ✓ full body access |
| Latency | +1–3ms (gRPC to RLS) | <1ms if local cache |
| Ops complexity | CRD management, mesh overhead | just code |
| Best for | Platform RL, DDoS, internal service mesh | Tenant quotas, billing RL |
API Gateway is the front door to your platform. Rate limiting is one tier in a layered pipeline — understanding where RL fits and how it interacts with auth, routing, and caching is what separates Senior from Staff.
| Key Type | Use Case | Pros | Cons |
|---|---|---|---|
| API Key | B2B / developer portals | Business entity, not IP | Key sharing/theft inflates counts |
| User ID (JWT sub) | Consumer apps | Per-user fairness | Need JWT validation before RL check |
| IP address | Unauthenticated endpoints, DDoS | No auth needed, fast | NAT/proxies share IPs; IPv6 makes subnetting complex |
| IP + User-Agent hash | Login brute force | Better fingerprint than IP alone | UA spoofable |
| Org ID | Multi-tenant SaaS | Tenant isolation, shared pool | One noisy tenant affects all their API keys |
| Composite (Org + endpoint) | Granular quotas | Fine-grained control | Key cardinality explosion, more Redis memory |
These headers are the client's interface to your RL system. Getting them right is a signal of production experience.
Go · RL Response Headersfunc (rl *RateLimiter) SetHeaders(w http.ResponseWriter, result RLResult) { // RFC 6585 standard + de-facto w.Header().Set("X-RateLimit-Limit", strconv.FormatInt(result.Limit, 10)) w.Header().Set("X-RateLimit-Remaining", strconv.FormatInt(result.Remaining, 10)) w.Header().Set("X-RateLimit-Reset", strconv.FormatInt(result.ResetUnix, 10)) w.Header().Set("X-RateLimit-Window", "60") // window size in seconds w.Header().Set("X-RateLimit-Policy", "100;w=60") // IETF draft format if !result.Allowed { // How long until they can retry (seconds) retryAfter := result.ResetUnix - time.Now().Unix() w.Header().Set("Retry-After", strconv.FormatInt(retryAfter, 10)) http.Error(w, `{"error":"rate_limit_exceeded","code":429}`, http.StatusTooManyRequests) } } // Under eventual consistency: Remaining is an estimate // Be honest about this in your API docs — "approximate remaining count"
X-RateLimit-Remaining may show 5 even when globally you're at limit. Some teams solve this by always rounding down (pessimistic) or adding a safety margin to the displayed remaining count.
Built-in Rate Limiting plugin (local or Redis-backed). Enterprise version supports sliding window + cluster-aware via Redis Sentinel. Key: configure policy: redis + sync_rate for cluster sync interval.
Lua-based RL via lua-resty-limit-traffic. limit_req_zone for simple shared-memory RL. For distributed: use lua-resty-redis with pipeline. Sub-ms local, +1–2ms Redis mode.
Go + net/http middleware chain. RL as an http.Handler wrapper. Inject via context. Benefit: full access to auth context (user tier, feature flags) for dynamic limit computation.
Idempotency keys and rate limiting are siblings — both protect against duplicate/excess operations. Understanding their interaction, and how they relate to distributed transactions, is a Staff-level differentiator.
Client sends POST /payments. Network timeout — did it go through? Client retries. Now you have two charges. The client didn't know the first succeeded. Without idempotency keys, retries cause duplicate mutations.
Client generates a UUID (Idempotency-Key: uuid-v4). Server stores the key + response on first execution. On retry with same key: return cached response, do NOT re-execute. The operation becomes safe to retry.
Key is a client-generated UUID. Scoped to: (clientID, key) — not globally unique, just unique per client. TTL: typically 24h–7 days.
Redis SET NX (set if not exists). If key exists → return cached response immediately. If not → lock key, execute operation, store response.
Two retries arrive simultaneously. Only one gets the SET NX lock. The other spins/polls until the first completes. Return same response to both.
Store: status code, response body, timestamp. On retry: return exactly the same response — same status code, same body, even if it was a 4xx error.
Go · Idempotency Key Middlewaretype IdempotencyStore interface { TryLock(ctx context.Context, key string) (bool, error) GetResult(ctx context.Context, key string) (*CachedResponse, error) StoreResult(ctx context.Context, key string, res *CachedResponse, ttl time.Duration) error } func IdempotencyMiddleware(store IdempotencyStore) func(http.Handler) http.Handler { return func(next http.Handler) http.Handler { return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) { ikey := r.Header.Get("Idempotency-Key") if ikey == "" { next.ServeHTTP(w, r); return } clientID := extractClientID(r) storeKey := fmt.Sprintf("idem:%s:%s", clientID, ikey) // Check if already executed cached, err := store.GetResult(r.Context(), storeKey) if err == nil && cached != nil { w.Header().Set("Idempotent-Replayed", "true") w.WriteHeader(cached.StatusCode) w.Write(cached.Body) return } // Try to acquire lock (SET NX with 30s TTL for in-flight) locked, _ := store.TryLock(r.Context(), storeKey+":lock") if !locked { // Concurrent retry — poll for result with backoff result := pollForResult(r.Context(), store, storeKey) w.WriteHeader(result.StatusCode); w.Write(result.Body) return } // Capture response, execute, store rec := newResponseRecorder(w) next.ServeHTTP(rec, r) store.StoreResult(r.Context(), storeKey, rec.Result(), 24*time.Hour) }) } }
Every request hits the RL check before idempotency check. Simpler pipeline. Client is penalized for retrying — incentivizes proper retry handling. Stripe does this.
Check idempotency key before RL check. If replay → skip RL. More complex pipeline. Rewards clients that use idempotency keys. Better UX for transient network failures.
Go · Ordering matters// Option A: RL first (Stripe-style) rateLimitMiddleware → idempotencyMiddleware → handler // Option B: Idempotency first (replay-aware) idempotencyMiddleware → rateLimitMiddleware → handler // If idempotencyMiddleware short-circuits (replay), RL is never checked // Option C: Hybrid — count unique idempotency keys, not requests // RL key = hash(clientID + idempotency_key) — so retries share a "slot" rlKey := fmt.Sprintf("rl:%s:%s", clientID, idemKey) // same slot for all retries
When a rate-limited operation spans multiple services (e.g., deduct balance, create order, send email), you need compensating transactions if any step fails.
Rate limit check. If denied → immediate 429, no downstream calls. No compensation needed — nothing happened.
Deduct from balance tentatively. If fails → saga ends, 402.
If fails → compensate: refund Step 2.
If fails → it's OK (notification is non-critical). Don't rollback Steps 2–3.
You cannot operate a rate limiter you can't observe. Observability for RL spans three pillars — metrics, traces, and logs — each answering different operational questions.
Go · Prometheus instrumentationvar ( rlDecisions = prometheus.NewCounterVec(prometheus.CounterOpts{ Name: "ratelimit_decisions_total", Help: "Rate limit decisions by outcome", }, []string{"client_id", "endpoint", "outcome", "region"}) // outcome: "allowed" | "denied" | "error" | "fallback" rlCheckDuration = prometheus.NewHistogramVec(prometheus.HistogramOpts{ Name: "ratelimit_check_duration_ms", Buckets: []float64{0.1, 0.5, 1, 2, 5, 10, 50}, }, []string{"backend", "region"}) // backend: "redis" | "local" | "gossip" rlCounterRatio = prometheus.NewGaugeVec(prometheus.GaugeOpts{ Name: "ratelimit_counter_ratio", Help: "Current counter as fraction of limit (0.0–1.0+)", }, []string{"client_id", "tier"}) // Alert when ratio > 0.9 → client approaching limit // Alert when ratio > 1.0 → overage (gossip lag indicator) rlSyncLagMs = prometheus.NewHistogramVec(prometheus.HistogramOpts{ Name: "ratelimit_gossip_sync_lag_ms", Buckets: []float64{50, 100, 200, 500, 1000, 2000}, }, []string{"from_node", "to_node"}) // Alert when p95 > 500ms → gossip partition risk )
prometheus · alert rulesgroups: - name: ratelimit rules: # RL check adding too much latency - alert: RateLimitHighLatency expr: histogram_quantile(0.99, ratelimit_check_duration_ms) > 10 for: 2m labels: {severity: warning} annotations: summary: "RL check p99 > 10ms — Redis may be degraded" # RL error rate spike — fallback mode active? - alert: RateLimitErrorSpike expr: | rate(ratelimit_decisions_total{outcome="error"}[1m]) / rate(ratelimit_decisions_total[1m]) > 0.01 for: 1m labels: {severity: critical} annotations: summary: "More than 1% of RL checks failing — check Redis" # Gossip sync lag too high - alert: GossipSyncLagHigh expr: histogram_quantile(0.95, ratelimit_gossip_sync_lag_ms) > 500 for: 3m labels: {severity: warning} annotations: summary: "Gossip sync lag p95 > 500ms — possible partition" # Counter overage (more allowed than limit) - alert: RateLimitOverage expr: max(ratelimit_counter_ratio) > 1.15 for: 5m labels: {severity: warning} annotations: summary: "Client exceeding limit by >15% — gossip lag too wide" # Sudden denial spike — DDoS or legitimate traffic surge? - alert: DenialRateSpike expr: | rate(ratelimit_decisions_total{outcome="denied"}[5m]) > 1000 for: 1m labels: {severity: page} annotations: summary: "High denial rate — possible DDoS or misconfigured limit"
Go · OTEL span for RL checkfunc (rl *RateLimiter) Allow(ctx context.Context, clientID string) (bool, error) { ctx, span := otel.Tracer("ratelimiter").Start(ctx, "ratelimit.check", trace.WithSpanKind(trace.SpanKindInternal)) defer span.End() span.SetAttributes( attribute.String("rl.client_id", clientID), attribute.String("rl.backend", rl.backend), // "redis"|"local" attribute.String("rl.algorithm", rl.algorithm), // "sliding_window" attribute.Int64("rl.limit", rl.limit), ) result, err := rl.check(ctx, clientID) if err != nil { span.RecordError(err) span.SetStatus(codes.Error, err.Error()) return false, err } span.SetAttributes( attribute.Bool("rl.allowed", result.Allowed), attribute.Int64("rl.count", result.Count), attribute.Int64("rl.remaining", result.Remaining), attribute.Bool("rl.fallback_mode", result.FallbackMode), ) return result.Allowed, nil }
rl.fallback_mode attribute is key — it tells you in traces when a RL decision was made with stale/local data. Filter on this in Jaeger/Tempo to investigate overage incidents.
JSON log · RL decision event{ "ts": "2026-03-14T10:23:45.123Z", "level": "info", "event": "ratelimit.decision", "trace_id": "4bf92f3577b34da6a3ce929d0e0e4736", "client_id": "org_9xK2mN", "api_key_hash": "sha256:a1b2c3...", // never log raw key "endpoint": "POST /v1/charges", "outcome": "denied", "reason": "key_limit", // which tier triggered "counter": 102, "limit": 100, "window_sec": 60, "reset_at": "2026-03-14T10:24:00Z", "fallback": false, "backend_ms": 1.2, // Redis round-trip "region": "us-east-1", "node_id": "rl-node-07" }
node_id and region — critical for debugging "why did node A allow but node B deny?" incidents.
Time-series: rate(rl_decisions_total[1m]) grouped by outcome. Overlay: denial rate as %. Annotate with config changes (limit updates).
Table: topk(20, sum by(client_id) (increase(rl_decisions_total{outcome="denied"}[1h]))). Links to client details. Updates every 5min.
Heatmap of rl_check_duration_ms. Separate lines for Redis-backed vs local fallback. Alert line at 10ms p99.
Gauge: histogram_quantile(0.95, gossip_sync_lag_ms) per node-pair. Red >500ms. Topology view of which nodes are lagging.
Heatmap of rl_counter_ratio across clients. Rows = client tiers. Highlights which clients are near limit. Useful for capacity planning.
sum(rl_decisions_total{fallback="true"}) over time. Should be near zero. Any spike = Redis health issue. Correlate with Redis latency panel.