Gossip Protocol (aka Epidemic Protocol) is a peer-to-peer communication model where nodes periodically exchange state with a random subset of peers — mimicking how rumors spread in a social network.
Node N1 gets new state (e.g., counter=42 for client X). It marks this with a vector clock or timestamp.
N1 picks k=2 random peers (N3, N7). Sends a digest of its state. Peers merge if the version is newer.
N3 and N7 now know. Each independently picks 2 more peers. After O(log N) rounds, all nodes converge.
Full propagation across 100-node cluster in ~7 rounds. No central coordinator. No SPOF.
Susceptible → Infected. Node sends its full state to peers. Simple, but redundant traffic as all nodes become infected.
Nodes periodically pull state from peers. Better for reconciliation but adds latency. Used in anti-entropy repair.
Hybrid: push new state, pull diffs. Most efficient — used in Cassandra, Consul, DynamoDB. Eliminates redundant pushes via Removed state.
In distributed rate limiting, each node tracks counters locally. The challenge: how do you share counter state across nodes without a central bottleneck? Gossip is the answer.
Information reaches all N nodes in O(log N) gossip rounds. With N=1000 and round interval=200ms, full convergence in ~2 seconds.
Node failures don't stop propagation. A message will route around failed nodes in the next gossip round. No leader election needed.
Nodes may have stale data during propagation window. For rate limiting: tolerable if you accept a small overage window (soft limits).
Short interval (50ms) → fresh data, high bandwidth. Long interval (500ms) → stale data, low bandwidth. Tune per SLA.
Gossip alone isn't enough — you need a conflict-free merge strategy. Rate limiting uses G-Counters (Grow-only CRDTs):
type GCounter struct { Counts map[string]int64 // nodeID → local count } func (g *GCounter) Merge(other *GCounter) { for nodeID, count := range other.Counts { if count > g.Counts[nodeID] { g.Counts[nodeID] = count // take max → no conflicts } } } func (g *GCounter) Value() int64 { var total int64 for _, v := range g.Counts { total += v } return total // global count = sum of all node counters } // On incoming request at Node "n1": func (rl *RateLimiter) Allow(clientID string) bool { rl.counter.Counts["n1"]++ // local increment return rl.counter.Value() <= rl.limit // check global sum }
A distributed cache in the rate limiting context is a shared, low-latency store that multiple rate-limiter nodes read/write to maintain globally consistent counters — replacing local-only or gossip-propagated state with a centralized source of truth.
All RL nodes talk to a Redis cluster. Atomic INCR + EXPIRE or Lua scripts for sliding windows. Strong consistency. Single cache failure = system failure.
Each node has a local cache (HashMap). Gossip or async sync merges state. Gossip interval = consistency window. Used by Cloudflare's rate limiter.
Consistent hash → route client to specific cache shard. Each shard owns its clients fully. Best isolation, harder rebalancing on scale.
L1: in-process LRU (sub-ms reads). L2: Redis or Memcached (ms reads). L1 absorbs 90%+ traffic. L2 is source of truth. Write-through or write-back.
-- Fixed Window Counter (Lua, atomic) local key = "rl:{clientID}:{window}" -- e.g. rl:user123:202403191400 local count = redis.call('INCR', key) if count == 1 then redis.call('EXPIRE', key, 60) -- TTL = window size end return count <= limit
-- Sliding Window Log (ZSET) local now = tonumber(ARGV[1]) -- current timestamp ms local window = tonumber(ARGV[2]) -- e.g. 60000ms local limit = tonumber(ARGV[3]) redis.call('ZREMRANGEBYSCORE', KEY, 0, now - window) local count = redis.call('ZCARD', KEY) if count < limit then redis.call('ZADD', KEY, now, now) -- score=timestamp, member=timestamp redis.call('EXPIRE', KEY, window/1000) return 1 -- allowed end return 0 -- denied
| Failure Mode | Cause | Impact on Rate Limiting | Mitigation |
|---|---|---|---|
| Cache Stampede | Key expires, many nodes request simultaneously | All nodes bypass RL, spike to backend | Probabilistic early expiry, lock-based regeneration, local fallback |
| Hot Key Problem | Popular clientID → all traffic to one shard | Shard overload, high latency, drops | Key sharding (clientID:shard_N), local cache for hot keys |
| Network Partition | RL nodes can't reach Redis | Fail open (allow all) vs fail closed (deny all) | Circuit breaker + local fallback counter, configurable fail policy |
| Thundering Herd | Redis restart or failover | Mass counter reset → clients exceed limits temporarily | Sentinel/Cluster HA, warm-up period, shadow counters |
| Clock Skew | Nodes have different system clocks | Window boundaries misaligned, counter leakage | NTP sync, use Redis server time (TIME command), logical clocks |
| Stale Read | Reading from replica during replication lag | Counter looks lower than reality → allow over-limit | Always write/read from primary, or use quorum reads |
To shard rate-limit counters across cache nodes without global coordination:
type HashRing struct { replicas int ring map[uint32]string // hash → nodeID sorted []uint32 } func (h *HashRing) GetNode(clientID string) string { hash := crc32(clientID) // find first ring position >= hash (clockwise) idx := sort.Search(len(h.sorted), func(i int) bool { return h.sorted[i] >= hash }) if idx == len(h.sorted) { idx = 0 } // wrap around return h.ring[h.sorted[idx]] } // clientID "user:123" always routes to same cache node // Node failure → only that node's keys re-routed, not all
Rate limiting is the enforcement of a maximum request count per client per time window. The core challenge at scale: counters must be globally consistent, fast, and fault-tolerant — an impossible triad under CAP.
| Algorithm | Memory | Accuracy | Burst Handling | Distributed Fit |
|---|---|---|---|---|
| Fixed Window | O(1) | Poor (boundary spike) | Double burst at boundary | Easy (INCR/EXPIRE) |
| Sliding Window Log | O(requests) | Exact | Precise | Heavy (ZSET per user) |
| Sliding Window Counter | O(1) | ~1% error | Good approximation | Good (2 counters) |
| Token Bucket | O(1) | Good | Allows bursts up to bucket size | Needs sync for refill |
| Leaky Bucket | O(1) | Good | Smooths bursts | Queue-based, stateful |
func (rl *SlidingWindowRL) Allow(ctx context.Context, clientID string) (bool, error) { now := time.Now() currWindow := now.Truncate(rl.windowSize).Unix() prevWindow := currWindow - int64(rl.windowSize.Seconds()) currKey := fmt.Sprintf("rl:%s:%d", clientID, currWindow) prevKey := fmt.Sprintf("rl:%s:%d", clientID, prevWindow) pipe := rl.redis.Pipeline() incrCmd := pipe.Incr(ctx, currKey) prevCmd := pipe.Get(ctx, prevKey) pipe.Expire(ctx, currKey, rl.windowSize*2) pipe.Exec(ctx) curr := incrCmd.Val() prev, _ := strconv.ParseInt(prevCmd.Val(), 10, 64) // weighted count: prev_count × overlap_fraction + curr_count elapsed := float64(now.UnixNano()-currWindow*1e9) / float64(rl.windowSize) estimated := float64(prev)*(1-elapsed) + float64(curr) return estimated <= float64(rl.limit), nil }
When cache is unreachable, allow all requests. Protects user experience. Risk: traffic spikes hit backend unprotected. Used for non-critical APIs or when DDoS protection is upstream.
When cache is unreachable, deny requests. Protects backend at cost of availability. Used for billing APIs, quota-critical endpoints, or security-sensitive paths.
type TokenBucket struct { capacity int64 refillRate float64 // tokens per second } var tokenBucketScript = redis.NewScript(` local tokens = tonumber(redis.call('GET', KEYS[1]) or ARGV[1]) local last = tonumber(redis.call('GET', KEYS[2]) or ARGV[3]) local now = tonumber(ARGV[3]) local rate = tonumber(ARGV[4]) local cap = tonumber(ARGV[1]) -- refill tokens based on elapsed time local elapsed = math.max(0, now - last) tokens = math.min(cap, tokens + elapsed * rate) if tokens >= 1 then tokens = tokens - 1 redis.call('SET', KEYS[1], tokens, 'EX', 86400) redis.call('SET', KEYS[2], now, 'EX', 86400) return 1 -- allowed end return 0 -- denied `)
How gossip and distributed cache fit together — and the exact tradeoffs you need to articulate in a Staff-level system design interview.
| Dimension | Gossip-Based RL | Centralized Cache (Redis) |
|---|---|---|
| Consistency | Eventual (100ms–2s lag) | Strong (atomic ops) |
| Latency | Sub-ms (local read) | 1–5ms (network hop) |
| Fault Tolerance | Excellent (no SPOF) | Redis failure = outage |
| Accuracy | ±10–20% during propagation | Exact |
| Throughput | Unlimited (local) | Bounded by Redis cluster |
| Best for | Soft limits, high-scale, geo-distributed | Billing, strict quotas, security |
| Real-world examples | Cloudflare RL, Riak, Cassandra | Stripe, Shopify, AWS API GW |
Each node maintains a local counter. Gossip sync every 100–500ms. Handles 95%+ of decisions locally. Uses G-Counter CRDT for conflict-free merge.
For strict limits (billing, OAuth tokens), always hit Redis. Use pipeline + Lua for atomicity. Redis Cluster with sentinel for HA.
Split the global limit across nodes. If limit=1000 and 10 nodes: each node enforces 100. Overage possible but bounded. Recalibrate via gossip rounds.
Route clientID → same RL node via consistent hash at gateway. No cross-node sync needed. Problem: node failure redistributes clients, counters reset.
With k=2 fanout and N=100 nodes, each gossip round generates O(k × N) messages. At 200ms interval: 100 nodes × 2 × 5/sec = 1000 msg/sec. Manageable, but grows with cluster size. Use push-pull to cut redundancy.
During gossip propagation window, a client could exceed limit across nodes. Accept soft enforcement — allow up to 110% of limit, then deny. Design the product around this: "approximately 100 req/min" not "exactly 100".
Rate limit counters are ephemeral. Use TTL aggressively. For sliding window: TTL = window_size × 2. Redis keyspace: O(clients × windows). For 1M active clients, 1-min window: ~2M keys × 8 bytes ≈ 16MB. Totally fine.
Cross-region gossip adds 50–200ms latency. Options: (a) per-region limits (limit=100 per region), (b) global Redis with CRDT replication, (c) gossip within region + async cross-region sync.
Emit metrics: rl.allowed, rl.denied, rl.counter_value, rl.sync_lag_ms. Alert on sync lag spikes — indicates gossip partition. Trace which node denied vs allowed for debugging.
1. REQUIREMENTS CLARIFICATION → Hard limit (billing) or soft limit (API quota)? → Per-user, per-IP, or per-API-key? → Single region or multi-region? → What's acceptable accuracy? (exact vs ~5% overage) → What's target latency budget for RL check? (<1ms? <5ms?) 2. ALGORITHM CHOICE → Soft limit at scale → Sliding Window Counter + Gossip → Hard limit, low scale → Token Bucket + Redis → Smooth traffic → Leaky Bucket 3. DATA LAYER → Redis Cluster for strong consistency → Local CRDT + Gossip for eventual consistency → Two-tier: local L1 + Redis L2 4. FAILURE MODES → Redis down: fail-open or local fallback? → Network partition: accept stale reads? → Node crash: counter loss → sticky routing helps 5. SCALING & OPERATIONS → Consistent hashing for shard assignment → Gossip for node membership (like Cassandra) → Metrics, alerting, sync lag monitoring
Click a node to inject a rate-limit counter update. Watch it propagate via gossip rounds.
Simulate requests hitting multiple RL nodes with gossip-based counter sync. Observe how counters converge.