Cell Router Architecture

Staff-level system design reference

Cell architecture is a blast-radius reduction strategy. Instead of one giant distributed system, you partition everything — customers, data, compute — into independent cells, each a full-stack replica of your service. A cell router sits in front and steers each request to the right cell.

Why does this exist?

At scale, the hardest problem isn't building features — it's containing failure. A single noisy neighbour, a bad deploy, a runaway query, a cascade — any of these can take down your entire customer base. Cell architecture trades some efficiency for strong failure isolation.

Without cell architecture

One bad actor takes down everyone. A degraded database affects all customers. A bad deploy affects all users. Blast radius = 100%.

With cell architecture

Failure is scoped to one cell (typically 1–5% of customers). Other cells are unaffected. You can hotfix, roll back, or evacuate a single cell.

The four core problems it solves

1. Blast radius containment

When a cell fails, only the tenants pinned to that cell are affected. You can surgically evacuate or fix without a global incident. Trade-off: some % of customers will have degraded service during a cell failure rather than all customers having partial degradation.

2. Noisy neighbour prevention

Large enterprise customers can be pinned to dedicated cells. High-traffic tenants can't starve shared resources. You can offer SLA tiers by cell class. Trade-off: bin-packing efficiency drops — a dedicated cell is never fully utilized.

3. Independent deployability

Deploy a new version to cell-03 only. If it works for 10 minutes, roll to cell-04. Safer than % canary because the cell is fully isolated. Trade-off: you must build multi-version compatibility into your API and data layer.

4. Regional data residency

EU customers → EU cell → EU database → EU compute. GDPR, SOC2, FedRAMP scoped per-cell. Trade-off: cross-cell operations (e.g. moving a customer between regions) become hard migrations.

System Architecture

Each cell is a complete replica of your stack. They share no runtime state. The cell router is the only global component in the request path.

Shared globally (SPOF risk)

Cell router · Cell mapping store · Control plane · DNS / CDN · Auth service (if centralized)

Cell-local (isolated)

App servers · Job workers · Cache (Redis) · Message queue · Database (primary + replica)

Request Flow Animations

Step through five key scenarios. Each shows exactly what happens to a request — or to your system — at each stage.

Normal Flow

Cache Miss

Cell Failure

Throttling

Migration

Speed 1×

Step 0 / 0

Current step

Select a scenario above, then click Next to begin.

Cell Router Logic

The cell router is a thin stateless proxy. Its only job: given an inbound request, determine which cell owns this tenant and forward accordingly. Must be fast (<5ms P99), highly available, and handle routing table updates without restarts.

Tenant identifier extraction strategies

Strategy	How it works	Used by	Trade-off
Subdomain	`acme.myapp.com` → extract `acme` from Host header	Slack, Shopify	Clean, zero latency. Requires wildcard TLS cert management.
JWT claim	Decode JWT, read `tenant_id` claim	Auth0, many APIs	No extra lookup. Requires JWT validation at router — adds CPU.
URL path prefix	`/org/{tenant_id}/api/...`	GitHub, GitLab	Simple. Couples routing to API design. Hard to migrate URLs.
Custom header	`X-Tenant-ID: acme` injected by CDN	Internal microservices	Invisible to end users. Requires CDN/gateway upstream.
API key lookup	Router maps API key → tenant in mapping store	Stripe, Twilio	Extra hop. Key rotation invalidates cached mapping.

Routing modes

L7 proxy (preferred)

Router terminates connection and opens a new one to the target cell. Can inject headers, enforce auth, circuit-break. Adds ~1–3ms. Used by Nginx, Envoy, Cloudflare Workers.

HTTP redirect (302/307)

Return a redirect to cell-03.internal/api/.... Client makes a second request. Only works for browser clients — breaks API consumers and gRPC.

DNS-level routing

GeoDNS returns different A record per cell. Very low overhead. Slow TTL propagation makes instant failover hard. Used for coarse-grained region routing.

BGP / anycast

Network routes packets to nearest PoP, which does L7 routing. Used by Cloudflare, AWS, Fastly for their own infra. Overkill for most product teams.

Routing table update propagation

Poll from Redis (simple): routers poll every 5–30s. Stale window = poll interval.

Pub/sub invalidation: router subscribes to Redis keyspace events. Near-instant propagation. More operational complexity.

Dual-write + shadow period: during migration, both cells accept writes. Once consistent, cut over. Safest but complex.

Required Components

To implement cell architecture from scratch, you need these building blocks. Click each to see implementation details.

Failure Modes

Cell architecture's main value is blast radius containment — but it introduces its own failure classes.

Select a failure scenario above.

How Real Companies Do It

Trade-offs & When Not To Use It

Cell architecture solves blast radius and noisy neighbour problems but introduces significant operational complexity. A Staff engineer must know not just how it works but when it's overkill.

The cost matrix

What you gain	What you pay
Blast radius ≤ 1 cell	Operational complexity × N cells
Independent deployability	Must maintain multi-version API compatibility
Noisy neighbour prevention	Lower resource utilisation per cell (idle headroom)
Tenant data residency	Cross-tenant queries become impossible or expensive
Progressive rollouts	Cell routing adds a hop + latency (1–5ms)
Enterprise dedicated cells	Tenant migration is a hard operational problem

Anti-patterns

Global shared components

If your auth service, feature flag system, or notification queue is global, a failure there takes out all cells. The router and cell mapping store must themselves be extremely HA — they are your true global SPOFs.

Cross-cell fan-out queries

If a feature needs data across all tenants (e.g. "show activity across all your org's sub-teams"), cells make this hard. You must replicate a summary to a global analytics store, do scatter-gather, or forbid the feature.

Cell size too small

If each cell has <100 tenants, you have hundreds of cells — each with its own DB, queue, cache. The operational burden becomes untenable without strong automation. Cells need to be large enough that the overhead amortises.

When to use vs. alternatives

Strong fit

B2B SaaS with clear tenant boundaries. Enterprise isolation. >100k tenants. Strict SLA tiers. Regulatory data residency.

Moderate fit

B2C apps where cells map to geographic regions or user cohorts. Less clean cell boundary — risk of cross-cell reads.

Poor fit

Consumer apps with no tenant concept. Early-stage products. Teams without platform engineering maturity.

Interview Q&A

Staff-level questions with signal indicators. These are the kinds of questions you'll face at OpenAI, Anthropic, Databricks, or any company asking you to design a multi-tenant platform.