Rate Limiting · Advanced · Production Systems

How Cloudflare, Stripe, and AWS actually implement rate limiting in production — architecture decisions, scale numbers, and the specific tradeoffs each company made. Knowing these is table-stakes for Staff interviews at these companies.

Cloudflare

Edge RL · 300+ PoPs · ~55M req/sec peak · Gossip-based

Architecture Each PoP (Point of Presence) runs independent RL workers. No global Redis. Uses a local sliding-window counter per worker process, synced via an in-datacenter gossip bus every ~100ms. Cross-PoP state is not synced — each PoP enforces its own local limit.

Algorithm Sliding window counter (current + previous bucket weighted average). Counters stored in shared memory across worker threads within a PoP. Lua-based rule engine lets customers configure: req threshold, window size, mitigation action (block, challenge, log).

Key Design Choice Soft global limits — a single client can burst above their limit briefly if requests hit different PoPs simultaneously. Cloudflare accepts ~10% overage as a tradeoff for zero cross-PoP latency. This is explicitly documented in their product.

Fingerprinting RL key is a composite: IP + CF-Ray origin + path pattern. JA3 TLS fingerprinting used for bot differentiation. Rate limiting keys are hashed (BLAKE3) before storage — no PII in counters.

Request Flow at a Cloudflare PoP

Ingress_anycast

→

WAF Rules_{eval order}

→

RL Worker_{shared mem}

→

Cache_Tiered

→

Origin_{if miss}

↕ gossip sync every 100ms (within PoP only)

Cloudflare Workers · RL Pattern (TypeScript)interface RLState { count: number; windowStart: number }

// Durable Object = single-threaded actor = no race conditions
export class RateLimiter extends DurableObject {
  private state: RLState = { count: 0, windowStart: Date.now() }

  async check(limit: number, windowMs: number): Promise<boolean> {
    const now = Date.now()
    if (now - this.state.windowStart >= windowMs) {
      this.state = { count: 1, windowStart: now }  // new window
      return true
    }
    this.state.count++
    return this.state.count <= limit
  }
}
// DO instance = sticky to one PoP region → strong consistency within region
// key = hash(clientIP + path) → routes to consistent DO shard

Stripe

Financial APIs · Strong consistency required · Redis-primary · Token Bucket

Why Strong Consistency Stripe cannot allow a client to exceed limits even by 1 request — billing implications, fraud vectors. They sacrifice latency for correctness. Every RL check hits Redis primary, no replica reads.

Architecture Centralized Redis Cluster (6 shards, 3 primaries + 3 replicas). Token bucket algorithm implemented in Lua scripts for atomicity. Separate Redis namespaces per RL tier: per-API-key, per-IP, per-org.

Hierarchical Limits Three-level hierarchy: global → per-org → per-API-key. Each level has independent limits. A single charge request checks all 3 levels atomically using a pipelined Lua script. Most restrictive limit wins.

Circuit Breaker If Redis is unreachable for >50ms: switch to fail-closed mode (deny all) for payment APIs, fail-open (allow with local fallback) for metadata APIs. Stripe has published that they prefer service degradation over fraudulent transactions.

Stripe · Hierarchical Token Bucket (Lua)-- Called atomically for every API request
local function checkBucket(key, capacity, refillRate, now)
  local bucket = redis.call('HMGET', key, 'tokens', 'last_refill')
  local tokens    = tonumber(bucket[1]) or capacity
  local lastRefill = tonumber(bucket[2]) or now
  local elapsed   = math.max(0, now - lastRefill)
  tokens = math.min(capacity, tokens + elapsed * refillRate)
  if tokens < 1 then return 0 end
  redis.call('HMSET', key, 'tokens', tokens-1, 'last_refill', now)
  redis.call('EXPIRE', key, 86400)
  return 1
end

-- Check all 3 tiers. Fail any = deny.
local now = redis.call('TIME')[1]   -- server time, avoids clock skew
if checkBucket(KEYS[1], 10000, 100, now) == 0 then return {0, "global_limit"} end
if checkBucket(KEYS[2], 1000,  10,  now) == 0 then return {0, "org_limit"}    end
if checkBucket(KEYS[3], 100,   1,   now) == 0 then return {0, "key_limit"}    end
return {1, "allowed"}

AWS

AWS API Gateway

Managed RL · Per-stage, per-method, per-key · Token Bucket + Burst

Two-Tier Model AWS uses two limits per route: steady-state rate (tokens/sec refill) and burst limit (bucket size). A client can use burst capacity instantly then is throttled to steady-state. This is standard token bucket but with explicit burst semantics documented in SLA.

Account-level vs Stage-level Account ceiling (10,000 req/sec default, raiseable) → Stage limit → Method limit → Usage Plan limit (per API key). All enforced independently. Account ceiling = hard hardware/infrastructure cap. Others are soft (configurable).

429 Handling Returns X-Amzn-ErrorType: ThrottledException plus Retry-After header. The Retry-After value is computed from remaining bucket fill time. Exponential backoff with jitter is documented as the expected client pattern.

Implementation Detail AWS does not expose their RL internals, but from published patents and re:Invent talks: each edge node runs a token bucket per (apiId, stageId, keyId) tuple. Cross-AZ sync via internal DynamoDB Streams for usage plan tracking (billing-grade accuracy). Per-request checks are local.

Interview insight: When asked "how would you design rate limiting for a payment API like Stripe?", the key differentiator is saying: "I'd use Redis primary-reads with Lua atomicity and fail-closed on partition — because here, incorrect admission (allowing a fraudulent request) is worse than incorrect rejection (503-ing a valid one)." Most candidates default to eventual consistency without articulating why strong consistency is required here.

Geo-distributed rate limiting is the hardest variant — you have WAN latency (50–200ms cross-region), potential network partitions, and the requirement to enforce limits globally against a client that may hit any region.

2.1

The Core Problem

WAN Latency

Cross-region round-trip: US-East ↔ EU-West ≈ 90ms. Synchronous global check = +90ms per request. Unacceptable for <10ms API SLAs.

Partition Tolerance

Under CAP: you must choose. Global consistency during partition = block all requests. Availability during partition = allow potential overage. No right answer — depends on use case.

Clock Skew

Different regions can have 10–100ms clock drift. Window boundaries differ. Solution: use logical timestamps or a global clock service (TrueTime, HLC).

2.2

Three Strategies

Strategy 1 · Regional Quota Partition

Split global limit across regions by traffic weight. Global limit=1000, US=500, EU=300, APAC=200. Each region enforces independently. Zero cross-region coordination. Problem: unused quota in one region cannot be borrowed by another. Simple, works well when traffic is predictable.

Go · Quota Partitiontype RegionQuota struct {
    Limit    int64
    Region   string
    Weight   float64  // fraction of global limit
}
// Recalibrate weights every 5min from traffic telemetry
func (m *Manager) Rebalance(traffic map[string]float64) {
    total := sum(traffic)
    for region, t := range traffic {
        m.quotas[region].Weight = t / total
        m.quotas[region].Limit  = int64(float64(m.globalLimit) * t / total)
    }
}

Strategy 2 · CRDT G-Counter with Async Replication

Each region maintains a G-Counter CRDT. Counters replicate asynchronously across regions (~500ms–2s lag). Each region enforces based on its locally-merged global estimate. Allows small overage = limit × (replication_lag / window_size). For a 60s window with 2s lag: ~3% overage. Ideal for API quotas.

Go · Geo G-Countertype GeoGCounter struct {
    mu     sync.RWMutex
    counts map[string]int64  // regionID → count
    ts     map[string]int64  // HLC timestamps per region
}

func (g *GeoGCounter) Merge(remote *GeoGCounter) {
    g.mu.Lock(); defer g.mu.Unlock()
    for rid, cnt := range remote.counts {
        if remote.ts[rid] > g.ts[rid] {  // HLC compare
            g.counts[rid] = cnt
            g.ts[rid]     = remote.ts[rid]
        }
    }
}

func (g *GeoGCounter) GlobalEstimate() int64 {
    g.mu.RLock(); defer g.mu.RUnlock()
    var total int64
    for _, v := range g.counts { total += v }
    return total
}

Strategy 3 · Sticky Region Routing

Route each clientID to a home region via consistent hash. All RL decisions made locally in home region. Other regions proxy to home region for RL check only. +latency for non-home requests but global consistency. Used when accuracy > latency.

2.3

Hybrid Logical Clocks (HLC) — Why Not NTP

NTP sync is typically ±10–100ms accurate. For rate limiting windows <1s, this matters.

Go · HLC for Geo RL// HLC = max(physical_clock, last_seen_hlc) + counter
// Gives causal ordering without global synchronization
type HLC struct {
    WallTime int64   // milliseconds
    Logical  int32   // tie-breaker
}

func (h *HLCClock) Now() HLC {
    wall := time.Now().UnixMilli()
    h.mu.Lock(); defer h.mu.Unlock()
    if wall > h.last.WallTime {
        h.last = HLC{WallTime: wall, Logical: 0}
    } else {
        h.last.Logical++
    }
    return h.last
}

func (h *HLCClock) Receive(remote HLC) {
    h.mu.Lock(); defer h.mu.Unlock()
    if remote.WallTime > h.last.WallTime {
        h.last = HLC{WallTime: remote.WallTime, Logical: remote.Logical + 1}
    } else if remote.WallTime == h.last.WallTime {
        h.last.Logical = max(h.last.Logical, remote.Logical) + 1
    }
}

Google Spanner uses TrueTime (GPS+atomic clocks, ±7ms). CockroachDB uses HLC. For rate limiting you don't need Spanner-grade — HLC is sufficient and runs in-process.

2.4

Overage Budget Calculation

Replication Lag	Window Size	Max Overage	Acceptable For
500ms	1 minute	~1%	Most API quotas
2s	1 minute	~3%	Soft quotas, analytics
2s	10 seconds	~20%	Marginal — needs tighter sync
10s	1 minute	~17%	Only if accuracy not critical
Any	1 second	Unacceptable	Must use synchronous check

Formula: max_overage ≈ replication_lag / window_size × 100%

Service meshes (Envoy/Istio) handle rate limiting at the infrastructure layer — decoupled from application code. This is the preferred pattern for platform teams at Staff+ level because it's polyglot, consistent, and operationally uniform.

3.1

Envoy Rate Limiting Architecture

Envoy RL request path

Client

→

Envoy
Proxy_sidecar

→

RLS gRPC_{ext. service}

→

Redis_counters

↓ allow/deny

Upstream
Service

Envoy supports two RL modes:

Local Rate Limiting

Per-Envoy-instance token bucket. No external calls. Sub-ms overhead. Not globally consistent — each pod enforces independently. Good for DDoS protection at the edge, not for per-user quotas.

Envoy confighttp_filters:
- name: envoy.filters.http.local_ratelimit
  typed_config:
    token_bucket:
      max_tokens: 100
      fill_interval: 1s
      tokens_per_fill: 100
    filter_enabled: {default_value: {numerator: 100}}

Global Rate Limiting (RLS)

Envoy calls an external Rate Limit Service (RLS) via gRPC on every request. RLS checks Redis. Globally consistent. +1–3ms latency. The envoyproxy/ratelimit is the reference impl.

Envoy configrate_limits:
- actions:
  - request_headers:
      header_name: ":authority"
      descriptor_key: domain
  - request_headers:
      header_name: "x-api-key"
      descriptor_key: api_key

3.2

RLS Descriptor Schema (envoyproxy/ratelimit)

ratelimit config YAMLdomain: api_gateway
descriptors:
  # Per API key: 1000 req/min
  - key: api_key
    rate_limit:
      unit: MINUTE
      requests_per_unit: 1000
    descriptors:
      # Nested: per API key per endpoint
      - key: path
        value: "/v1/payments"
        rate_limit:
          unit: MINUTE
          requests_per_unit: 100

  # Per IP: 10000 req/min (DDoS protection)
  - key: remote_address
    rate_limit:
      unit: MINUTE
      requests_per_unit: 10000

  # Composite key: IP + path
  - key: remote_address
    descriptors:
      - key: path
        value: "/auth/login"
        rate_limit:
          unit: MINUTE
          requests_per_unit: 10  # brute-force protection

3.3

Istio Integration

Istio wraps Envoy's RLS with higher-level CRDs. The EnvoyFilter or WasmPlugin approach lets you inject RL logic at mesh level.

Istio · EnvoyFilter for Global RLapiVersion: networking.istio.io/v1alpha3
kind: EnvoyFilter
metadata:
  name: api-ratelimit
  namespace: prod
spec:
  workloadSelector:
    labels:
      app: api-gateway
  configPatches:
  - applyTo: HTTP_FILTER
    match:
      context: SIDECAR_INBOUND
      listener:
        filterChain:
          filter:
            name: "envoy.filters.network.http_connection_manager"
    patch:
      operation: INSERT_BEFORE
      value:
        name: envoy.filters.http.ratelimit
        typed_config:
          "@type": type.googleapis.com/envoy.extensions.filters.http.ratelimit.v3.RateLimit
          domain: api_prod
          rate_limit_service:
            grpc_service:
              envoy_grpc:
                cluster_name: ratelimit_cluster
            transport_api_version: V3

The key advantage of mesh-level RL: zero application code changes. A Go, Python, Java, or Rust service all get rate limiting identically. The policy is owned by the platform team, not individual service teams.

3.4

Mesh RL vs Application RL — Decision Matrix

Factor	Mesh (Envoy/Istio)	Application Layer
Polyglot	✓ all services same policy	✗ each service reimplements
Business logic awareness	✗ no app context	✓ can check subscription tier, feature flags
Request body inspection	✗ headers only (L7)	✓ full body access
Latency	+1–3ms (gRPC to RLS)	<1ms if local cache
Ops complexity	CRD management, mesh overhead	just code
Best for	Platform RL, DDoS, internal service mesh	Tenant quotas, billing RL

API Gateway is the front door to your platform. Rate limiting is one tier in a layered pipeline — understanding where RL fits and how it interacts with auth, routing, and caching is what separates Senior from Staff.

4.1

Gateway Pipeline — Layer by Layer

API Gateway request processing order (matters for RL)

TLS
Term_L4

IP
Block_WAF

Auth
N_JWT/mTLS

RL
Check_★

Route_upstream

Cache_resp

Trans
form_req/res

Log
Trace_OTEL

Critical: RL must happen after auth (so you have the authenticated clientID) but before routing to upstream (so you don't waste upstream compute on denied requests). This seems obvious but is a common interview mistake.

4.2

RL Key Design — What to Key On

Key Type	Use Case	Pros	Cons
API Key	B2B / developer portals	Business entity, not IP	Key sharing/theft inflates counts
User ID (JWT sub)	Consumer apps	Per-user fairness	Need JWT validation before RL check
IP address	Unauthenticated endpoints, DDoS	No auth needed, fast	NAT/proxies share IPs; IPv6 makes subnetting complex
IP + User-Agent hash	Login brute force	Better fingerprint than IP alone	UA spoofable
Org ID	Multi-tenant SaaS	Tenant isolation, shared pool	One noisy tenant affects all their API keys
Composite (Org + endpoint)	Granular quotas	Fine-grained control	Key cardinality explosion, more Redis memory

4.3

Response Headers — The Contract

These headers are the client's interface to your RL system. Getting them right is a signal of production experience.

Go · RL Response Headersfunc (rl *RateLimiter) SetHeaders(w http.ResponseWriter, result RLResult) {
    // RFC 6585 standard + de-facto
    w.Header().Set("X-RateLimit-Limit",     strconv.FormatInt(result.Limit, 10))
    w.Header().Set("X-RateLimit-Remaining",  strconv.FormatInt(result.Remaining, 10))
    w.Header().Set("X-RateLimit-Reset",     strconv.FormatInt(result.ResetUnix, 10))
    w.Header().Set("X-RateLimit-Window",    "60") // window size in seconds
    w.Header().Set("X-RateLimit-Policy",    "100;w=60") // IETF draft format

    if !result.Allowed {
        // How long until they can retry (seconds)
        retryAfter := result.ResetUnix - time.Now().Unix()
        w.Header().Set("Retry-After", strconv.FormatInt(retryAfter, 10))
        http.Error(w, `{"error":"rate_limit_exceeded","code":429}`, http.StatusTooManyRequests)
    }
}

// Under eventual consistency: Remaining is an estimate
// Be honest about this in your API docs — "approximate remaining count"

Nuance: Under eventual consistency (gossip-based RL), X-RateLimit-Remaining may show 5 even when globally you're at limit. Some teams solve this by always rounding down (pessimistic) or adding a safety margin to the displayed remaining count.

4.4

Kong / NGINX / Custom Gateway RL Patterns

Kong Gateway

Built-in Rate Limiting plugin (local or Redis-backed). Enterprise version supports sliding window + cluster-aware via Redis Sentinel. Key: configure policy: redis + sync_rate for cluster sync interval.

NGINX (OpenResty)

Lua-based RL via lua-resty-limit-traffic. limit_req_zone for simple shared-memory RL. For distributed: use lua-resty-redis with pipeline. Sub-ms local, +1–2ms Redis mode.

Custom Gateway

Go + net/http middleware chain. RL as an http.Handler wrapper. Inject via context. Benefit: full access to auth context (user tier, feature flags) for dynamic limit computation.

Idempotency keys and rate limiting are siblings — both protect against duplicate/excess operations. Understanding their interaction, and how they relate to distributed transactions, is a Staff-level differentiator.

5.1

What Idempotency Keys Solve

The Problem

Client sends POST /payments. Network timeout — did it go through? Client retries. Now you have two charges. The client didn't know the first succeeded. Without idempotency keys, retries cause duplicate mutations.

The Solution

Client generates a UUID (Idempotency-Key: uuid-v4). Server stores the key + response on first execution. On retry with same key: return cached response, do NOT re-execute. The operation becomes safe to retry.

Step 1 · Client sends request + Idempotency-Key

Key is a client-generated UUID. Scoped to: (clientID, key) — not globally unique, just unique per client. TTL: typically 24h–7 days.

Step 2 · Server checks key store

Redis SET NX (set if not exists). If key exists → return cached response immediately. If not → lock key, execute operation, store response.

Step 3 · Concurrent retry handling

Two retries arrive simultaneously. Only one gets the SET NX lock. The other spins/polls until the first completes. Return same response to both.

Step 4 · Response storage

Store: status code, response body, timestamp. On retry: return exactly the same response — same status code, same body, even if it was a 4xx error.

5.2

Implementation

Go · Idempotency Key Middlewaretype IdempotencyStore interface {
    TryLock(ctx context.Context, key string) (bool, error)
    GetResult(ctx context.Context, key string) (*CachedResponse, error)
    StoreResult(ctx context.Context, key string, res *CachedResponse, ttl time.Duration) error
}

func IdempotencyMiddleware(store IdempotencyStore) func(http.Handler) http.Handler {
    return func(next http.Handler) http.Handler {
        return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
            ikey := r.Header.Get("Idempotency-Key")
            if ikey == "" { next.ServeHTTP(w, r); return }

            clientID := extractClientID(r)
            storeKey := fmt.Sprintf("idem:%s:%s", clientID, ikey)

            // Check if already executed
            cached, err := store.GetResult(r.Context(), storeKey)
            if err == nil && cached != nil {
                w.Header().Set("Idempotent-Replayed", "true")
                w.WriteHeader(cached.StatusCode)
                w.Write(cached.Body)
                return
            }

            // Try to acquire lock (SET NX with 30s TTL for in-flight)
            locked, _ := store.TryLock(r.Context(), storeKey+":lock")
            if !locked {
                // Concurrent retry — poll for result with backoff
                result := pollForResult(r.Context(), store, storeKey)
                w.WriteHeader(result.StatusCode); w.Write(result.Body)
                return
            }

            // Capture response, execute, store
            rec := newResponseRecorder(w)
            next.ServeHTTP(rec, r)
            store.StoreResult(r.Context(), storeKey, rec.Result(), 24*time.Hour)
        })
    }
}

5.3

Idempotency + Rate Limiting: The Interaction

Key question: Should retries with the same idempotency key count against the rate limit?

Count Retries (default)

Every request hits the RL check before idempotency check. Simpler pipeline. Client is penalized for retrying — incentivizes proper retry handling. Stripe does this.

Don't Count Retries

Check idempotency key before RL check. If replay → skip RL. More complex pipeline. Rewards clients that use idempotency keys. Better UX for transient network failures.

Go · Ordering matters// Option A: RL first (Stripe-style)
rateLimitMiddleware → idempotencyMiddleware → handler

// Option B: Idempotency first (replay-aware)
idempotencyMiddleware → rateLimitMiddleware → handler
// If idempotencyMiddleware short-circuits (replay), RL is never checked

// Option C: Hybrid — count unique idempotency keys, not requests
// RL key = hash(clientID + idempotency_key) — so retries share a "slot"
rlKey := fmt.Sprintf("rl:%s:%s", clientID, idemKey)  // same slot for all retries

5.4

Distributed Transactions + Rate Limiting (Saga Pattern)

When a rate-limited operation spans multiple services (e.g., deduct balance, create order, send email), you need compensating transactions if any step fails.

Saga pattern with RL check as first step

Step 1 · RL Check (gateway)

Rate limit check. If denied → immediate 429, no downstream calls. No compensation needed — nothing happened.

Step 2 · Reserve (payments-svc)

Deduct from balance tentatively. If fails → saga ends, 402.

Step 3 · Create Order (orders-svc)

If fails → compensate: refund Step 2.

Step 4 · Notify (notifications-svc)

If fails → it's OK (notification is non-critical). Don't rollback Steps 2–3.

RL as the first step in a saga is correct and cheap. If denied, no downstream services are touched, no compensation needed. This is why RL should be at the gateway, not inside individual services.

You cannot operate a rate limiter you can't observe. Observability for RL spans three pillars — metrics, traces, and logs — each answering different operational questions.

6.1

Golden Signals for Rate Limiters

Request Rate

rl.requests.total [allowed|denied] per clientID, endpoint, region

Error Rate

rl.decisions.error — RL check failures (Redis down, timeout)

Latency

rl.check.duration_ms p50/p95/p99 — the overhead RL adds per request

Saturation

redis.memory_used, rl.counter.value / limit ratio per client

6.2

Prometheus Metrics Schema

Go · Prometheus instrumentationvar (
    rlDecisions = prometheus.NewCounterVec(prometheus.CounterOpts{
        Name: "ratelimit_decisions_total",
        Help: "Rate limit decisions by outcome",
    }, []string{"client_id", "endpoint", "outcome", "region"})
    // outcome: "allowed" | "denied" | "error" | "fallback"

    rlCheckDuration = prometheus.NewHistogramVec(prometheus.HistogramOpts{
        Name:    "ratelimit_check_duration_ms",
        Buckets: []float64{0.1, 0.5, 1, 2, 5, 10, 50},
    }, []string{"backend", "region"})
    // backend: "redis" | "local" | "gossip"

    rlCounterRatio = prometheus.NewGaugeVec(prometheus.GaugeOpts{
        Name: "ratelimit_counter_ratio",
        Help: "Current counter as fraction of limit (0.0–1.0+)",
    }, []string{"client_id", "tier"})
    // Alert when ratio > 0.9 → client approaching limit
    // Alert when ratio > 1.0 → overage (gossip lag indicator)

    rlSyncLagMs = prometheus.NewHistogramVec(prometheus.HistogramOpts{
        Name:    "ratelimit_gossip_sync_lag_ms",
        Buckets: []float64{50, 100, 200, 500, 1000, 2000},
    }, []string{"from_node", "to_node"})
    // Alert when p95 > 500ms → gossip partition risk
)

6.3

Alert Rules (Prometheus + AlertManager)

prometheus · alert rulesgroups:
- name: ratelimit
  rules:

  # RL check adding too much latency
  - alert: RateLimitHighLatency
    expr: histogram_quantile(0.99, ratelimit_check_duration_ms) > 10
    for: 2m
    labels: {severity: warning}
    annotations:
      summary: "RL check p99 > 10ms — Redis may be degraded"

  # RL error rate spike — fallback mode active?
  - alert: RateLimitErrorSpike
    expr: |
      rate(ratelimit_decisions_total{outcome="error"}[1m]) /
      rate(ratelimit_decisions_total[1m]) > 0.01
    for: 1m
    labels: {severity: critical}
    annotations:
      summary: "More than 1% of RL checks failing — check Redis"

  # Gossip sync lag too high
  - alert: GossipSyncLagHigh
    expr: histogram_quantile(0.95, ratelimit_gossip_sync_lag_ms) > 500
    for: 3m
    labels: {severity: warning}
    annotations:
      summary: "Gossip sync lag p95 > 500ms — possible partition"

  # Counter overage (more allowed than limit)
  - alert: RateLimitOverage
    expr: max(ratelimit_counter_ratio) > 1.15
    for: 5m
    labels: {severity: warning}
    annotations:
      summary: "Client exceeding limit by >15% — gossip lag too wide"

  # Sudden denial spike — DDoS or legitimate traffic surge?
  - alert: DenialRateSpike
    expr: |
      rate(ratelimit_decisions_total{outcome="denied"}[5m]) > 1000
    for: 1m
    labels: {severity: page}
    annotations:
      summary: "High denial rate — possible DDoS or misconfigured limit"

6.4

Distributed Tracing (OpenTelemetry)

Go · OTEL span for RL checkfunc (rl *RateLimiter) Allow(ctx context.Context, clientID string) (bool, error) {
    ctx, span := otel.Tracer("ratelimiter").Start(ctx, "ratelimit.check",
        trace.WithSpanKind(trace.SpanKindInternal))
    defer span.End()

    span.SetAttributes(
        attribute.String("rl.client_id",  clientID),
        attribute.String("rl.backend",    rl.backend),   // "redis"|"local"
        attribute.String("rl.algorithm",  rl.algorithm), // "sliding_window"
        attribute.Int64("rl.limit",       rl.limit),
    )

    result, err := rl.check(ctx, clientID)
    if err != nil {
        span.RecordError(err)
        span.SetStatus(codes.Error, err.Error())
        return false, err
    }

    span.SetAttributes(
        attribute.Bool("rl.allowed",      result.Allowed),
        attribute.Int64("rl.count",         result.Count),
        attribute.Int64("rl.remaining",    result.Remaining),
        attribute.Bool("rl.fallback_mode", result.FallbackMode),
    )
    return result.Allowed, nil
}

The rl.fallback_mode attribute is key — it tells you in traces when a RL decision was made with stale/local data. Filter on this in Jaeger/Tempo to investigate overage incidents.

6.5

Structured Logging Schema

JSON log · RL decision event{
  "ts":           "2026-03-14T10:23:45.123Z",
  "level":        "info",
  "event":        "ratelimit.decision",
  "trace_id":     "4bf92f3577b34da6a3ce929d0e0e4736",
  "client_id":    "org_9xK2mN",
  "api_key_hash": "sha256:a1b2c3...",  // never log raw key
  "endpoint":     "POST /v1/charges",
  "outcome":      "denied",
  "reason":       "key_limit",          // which tier triggered
  "counter":      102,
  "limit":        100,
  "window_sec":   60,
  "reset_at":     "2026-03-14T10:24:00Z",
  "fallback":     false,
  "backend_ms":   1.2,                  // Redis round-trip
  "region":       "us-east-1",
  "node_id":      "rl-node-07"
}

Never log raw API keys or tokens. Log their hash (SHA-256 truncated). In distributed RL, include node_id and region — critical for debugging "why did node A allow but node B deny?" incidents.

6.6

Grafana Dashboard — What to Build

Panel 1 · Allow/Deny Rate

Time-series: rate(rl_decisions_total[1m]) grouped by outcome. Overlay: denial rate as %. Annotate with config changes (limit updates).

Panel 2 · Top Throttled Clients

Table: topk(20, sum by(client_id) (increase(rl_decisions_total{outcome="denied"}[1h]))). Links to client details. Updates every 5min.

Panel 3 · RL Check Latency

Heatmap of rl_check_duration_ms. Separate lines for Redis-backed vs local fallback. Alert line at 10ms p99.

Panel 4 · Gossip Sync Health

Gauge: histogram_quantile(0.95, gossip_sync_lag_ms) per node-pair. Red >500ms. Topology view of which nodes are lagging.

Panel 5 · Counter Utilization

Heatmap of rl_counter_ratio across clients. Rows = client tiers. Highlights which clients are near limit. Useful for capacity planning.

Panel 6 · Fallback Mode Frequency

sum(rl_decisions_total{fallback="true"}) over time. Should be near zero. Any spike = Redis health issue. Correlate with Redis latency panel.