Cell architecture is a blast-radius reduction strategy. Instead of one giant distributed system, you partition everything — customers, data, compute — into independent cells, each a full-stack replica of your service. A cell router sits in front and steers each request to the right cell.
At scale, the hardest problem isn't building features — it's containing failure. A single noisy neighbour, a bad deploy, a runaway query, a cascade — any of these can take down your entire customer base. Cell architecture trades some efficiency for strong failure isolation.
One bad actor takes down everyone. A degraded database affects all customers. A bad deploy affects all users. Blast radius = 100%.
Failure is scoped to one cell (typically 1–5% of customers). Other cells are unaffected. You can hotfix, roll back, or evacuate a single cell.
Each cell is a complete replica of your stack. They share no runtime state. The cell router is the only global component in the request path.
Cell router · Cell mapping store · Control plane · DNS / CDN · Auth service (if centralized)
App servers · Job workers · Cache (Redis) · Message queue · Database (primary + replica)
Step through five key scenarios. Each shows exactly what happens to a request — or to your system — at each stage.
The cell router is a thin stateless proxy. Its only job: given an inbound request, determine which cell owns this tenant and forward accordingly. Must be fast (<5ms P99), highly available, and handle routing table updates without restarts.
| Strategy | How it works | Used by | Trade-off |
|---|---|---|---|
| Subdomain | acme.myapp.com → extract acme from Host header | Slack, Shopify | Clean, zero latency. Requires wildcard TLS cert management. |
| JWT claim | Decode JWT, read tenant_id claim | Auth0, many APIs | No extra lookup. Requires JWT validation at router — adds CPU. |
| URL path prefix | /org/{tenant_id}/api/... | GitHub, GitLab | Simple. Couples routing to API design. Hard to migrate URLs. |
| Custom header | X-Tenant-ID: acme injected by CDN | Internal microservices | Invisible to end users. Requires CDN/gateway upstream. |
| API key lookup | Router maps API key → tenant in mapping store | Stripe, Twilio | Extra hop. Key rotation invalidates cached mapping. |
Router terminates connection and opens a new one to the target cell. Can inject headers, enforce auth, circuit-break. Adds ~1–3ms. Used by Nginx, Envoy, Cloudflare Workers.
Return a redirect to cell-03.internal/api/.... Client makes a second request. Only works for browser clients — breaks API consumers and gRPC.
GeoDNS returns different A record per cell. Very low overhead. Slow TTL propagation makes instant failover hard. Used for coarse-grained region routing.
Network routes packets to nearest PoP, which does L7 routing. Used by Cloudflare, AWS, Fastly for their own infra. Overkill for most product teams.
Poll from Redis (simple): routers poll every 5–30s. Stale window = poll interval.
Pub/sub invalidation: router subscribes to Redis keyspace events. Near-instant propagation. More operational complexity.
Dual-write + shadow period: during migration, both cells accept writes. Once consistent, cut over. Safest but complex.
To implement cell architecture from scratch, you need these building blocks. Click each to see implementation details.
Cell architecture's main value is blast radius containment — but it introduces its own failure classes.
Cell architecture solves blast radius and noisy neighbour problems but introduces significant operational complexity. A Staff engineer must know not just how it works but when it's overkill.
| What you gain | What you pay |
|---|---|
| Blast radius ≤ 1 cell | Operational complexity × N cells |
| Independent deployability | Must maintain multi-version API compatibility |
| Noisy neighbour prevention | Lower resource utilisation per cell (idle headroom) |
| Tenant data residency | Cross-tenant queries become impossible or expensive |
| Progressive rollouts | Cell routing adds a hop + latency (1–5ms) |
| Enterprise dedicated cells | Tenant migration is a hard operational problem |
If your auth service, feature flag system, or notification queue is global, a failure there takes out all cells. The router and cell mapping store must themselves be extremely HA — they are your true global SPOFs.
If a feature needs data across all tenants (e.g. "show activity across all your org's sub-teams"), cells make this hard. You must replicate a summary to a global analytics store, do scatter-gather, or forbid the feature.
If each cell has <100 tenants, you have hundreds of cells — each with its own DB, queue, cache. The operational burden becomes untenable without strong automation. Cells need to be large enough that the overhead amortises.
B2B SaaS with clear tenant boundaries. Enterprise isolation. >100k tenants. Strict SLA tiers. Regulatory data residency.
B2C apps where cells map to geographic regions or user cohorts. Less clean cell boundary — risk of cross-cell reads.
Consumer apps with no tenant concept. Early-stage products. Teams without platform engineering maturity.
Staff-level questions with signal indicators. These are the kinds of questions you'll face at OpenAI, Anthropic, Databricks, or any company asking you to design a multi-tenant platform.