Cell Router Architecture

Staff-level system design reference

Cell architecture is a blast-radius reduction strategy. Instead of one giant distributed system, you partition everything — customers, data, compute — into independent cells, each a full-stack replica of your service. A cell router sits in front and steers each request to the right cell.

Why does this exist?

At scale, the hardest problem isn't building features — it's containing failure. A single noisy neighbour, a bad deploy, a runaway query, a cascade — any of these can take down your entire customer base. Cell architecture trades some efficiency for strong failure isolation.

Without cell architecture

One bad actor takes down everyone. A degraded database affects all customers. A bad deploy affects all users. Blast radius = 100%.

With cell architecture

Failure is scoped to one cell (typically 1–5% of customers). Other cells are unaffected. You can hotfix, roll back, or evacuate a single cell.

The four core problems it solves

1. Blast radius containment
When a cell fails, only the tenants pinned to that cell are affected. You can surgically evacuate or fix without a global incident. Trade-off: some % of customers will have degraded service during a cell failure rather than all customers having partial degradation.
2. Noisy neighbour prevention
Large enterprise customers can be pinned to dedicated cells. High-traffic tenants can't starve shared resources. You can offer SLA tiers by cell class. Trade-off: bin-packing efficiency drops — a dedicated cell is never fully utilized.
3. Independent deployability
Deploy a new version to cell-03 only. If it works for 10 minutes, roll to cell-04. Safer than % canary because the cell is fully isolated. Trade-off: you must build multi-version compatibility into your API and data layer.
4. Regional data residency
EU customers → EU cell → EU database → EU compute. GDPR, SOC2, FedRAMP scoped per-cell. Trade-off: cross-cell operations (e.g. moving a customer between regions) become hard migrations.

System Architecture

Each cell is a complete replica of your stack. They share no runtime state. The cell router is the only global component in the request path.

clients / DNS anycast / GeoDNS cell router tenant lookup → cell mapping → proxy / redirect cell mapping store tenant_id → cell_id cell-01 (shared · us-east) app servers job workers cache message queue (cell-local) PostgreSQL primary + replica tenants: t001–t500 cell-02 (shared · us-east) app servers job workers cache message queue (cell-local) PostgreSQL primary + replica tenants: t501–t1000 cell-03 (dedicated · enterprise) app servers job workers ded. cache isolated queue dedicated DB cluster (Aurora) tenant: AcmeCorp control plane (out-of-band) cell registry health monitor cell provisioner migration engine deploy orchestrator observability layer shared cell dedicated cell router lookup control signal
Shared globally (SPOF risk)

Cell router · Cell mapping store · Control plane · DNS / CDN · Auth service (if centralized)

Cell-local (isolated)

App servers · Job workers · Cache (Redis) · Message queue · Database (primary + replica)

Request Flow Animations

Step through five key scenarios. Each shows exactly what happens to a request — or to your system — at each stage.

Normal Flow
Cache Miss
Cell Failure
Throttling
Migration
Speed
Step 0 / 0
Current step
Select a scenario above, then click Next to begin.

Cell Router Logic

The cell router is a thin stateless proxy. Its only job: given an inbound request, determine which cell owns this tenant and forward accordingly. Must be fast (<5ms P99), highly available, and handle routing table updates without restarts.

Tenant identifier extraction strategies

StrategyHow it worksUsed byTrade-off
Subdomainacme.myapp.com → extract acme from Host headerSlack, ShopifyClean, zero latency. Requires wildcard TLS cert management.
JWT claimDecode JWT, read tenant_id claimAuth0, many APIsNo extra lookup. Requires JWT validation at router — adds CPU.
URL path prefix/org/{tenant_id}/api/...GitHub, GitLabSimple. Couples routing to API design. Hard to migrate URLs.
Custom headerX-Tenant-ID: acme injected by CDNInternal microservicesInvisible to end users. Requires CDN/gateway upstream.
API key lookupRouter maps API key → tenant in mapping storeStripe, TwilioExtra hop. Key rotation invalidates cached mapping.

Routing modes

L7 proxy (preferred)

Router terminates connection and opens a new one to the target cell. Can inject headers, enforce auth, circuit-break. Adds ~1–3ms. Used by Nginx, Envoy, Cloudflare Workers.

HTTP redirect (302/307)

Return a redirect to cell-03.internal/api/.... Client makes a second request. Only works for browser clients — breaks API consumers and gRPC.

DNS-level routing

GeoDNS returns different A record per cell. Very low overhead. Slow TTL propagation makes instant failover hard. Used for coarse-grained region routing.

BGP / anycast

Network routes packets to nearest PoP, which does L7 routing. Used by Cloudflare, AWS, Fastly for their own infra. Overkill for most product teams.

Routing table update propagation

Poll from Redis (simple): routers poll every 5–30s. Stale window = poll interval.

Pub/sub invalidation: router subscribes to Redis keyspace events. Near-instant propagation. More operational complexity.

Dual-write + shadow period: during migration, both cells accept writes. Once consistent, cut over. Safest but complex.

Required Components

To implement cell architecture from scratch, you need these building blocks. Click each to see implementation details.

Failure Modes

Cell architecture's main value is blast radius containment — but it introduces its own failure classes.

Select a failure scenario above.

How Real Companies Do It

Trade-offs & When Not To Use It

Cell architecture solves blast radius and noisy neighbour problems but introduces significant operational complexity. A Staff engineer must know not just how it works but when it's overkill.

The cost matrix

What you gainWhat you pay
Blast radius ≤ 1 cellOperational complexity × N cells
Independent deployabilityMust maintain multi-version API compatibility
Noisy neighbour preventionLower resource utilisation per cell (idle headroom)
Tenant data residencyCross-tenant queries become impossible or expensive
Progressive rolloutsCell routing adds a hop + latency (1–5ms)
Enterprise dedicated cellsTenant migration is a hard operational problem

Anti-patterns

Global shared components

If your auth service, feature flag system, or notification queue is global, a failure there takes out all cells. The router and cell mapping store must themselves be extremely HA — they are your true global SPOFs.

Cross-cell fan-out queries

If a feature needs data across all tenants (e.g. "show activity across all your org's sub-teams"), cells make this hard. You must replicate a summary to a global analytics store, do scatter-gather, or forbid the feature.

Cell size too small

If each cell has <100 tenants, you have hundreds of cells — each with its own DB, queue, cache. The operational burden becomes untenable without strong automation. Cells need to be large enough that the overhead amortises.

When to use vs. alternatives

Strong fit

B2B SaaS with clear tenant boundaries. Enterprise isolation. >100k tenants. Strict SLA tiers. Regulatory data residency.

Moderate fit

B2C apps where cells map to geographic regions or user cohorts. Less clean cell boundary — risk of cross-cell reads.

Poor fit

Consumer apps with no tenant concept. Early-stage products. Teams without platform engineering maturity.

Interview Q&A

Staff-level questions with signal indicators. These are the kinds of questions you'll face at OpenAI, Anthropic, Databricks, or any company asking you to design a multi-tenant platform.