The Problem —
Why This Is Hard
Start with a single database. Transactions just work. The moment you split data across two systems, everything you took for granted breaks. This chapter builds the intuition for why distributed transactions are fundamentally difficult.
The Single-Database World
In a single database, transactions are your superpower. You write BEGIN, do your work across multiple tables, and call COMMIT. Either all changes persist, or none do. The database guarantees atomicity — the A in ACID. Crash in the middle? The database rolls back. Network hiccup? The transaction either committed or it didn't. You never see half a transaction.
Consider a simple bank transfer: debit $100 from Account A, credit $100 to Account B. In a single database this is trivial — wrap it in a transaction and it's atomic. The balance never disappears and never doubles.
BEGIN; UPDATE accounts SET balance = balance - 100 WHERE id = 'account_A'; UPDATE accounts SET balance = balance + 100 WHERE id = 'account_B'; COMMIT; -- Either both updates happen, or neither does. Always.
Now Split Across Two Systems
Modern systems rarely keep all data in one place. You might have Bank A's database in one service and Bank B's database in another — entirely separate systems that don't share a transaction log. How do you transfer money now?
The naive approach: call service A to debit, then call service B to credit. This works most of the time. But what happens when the process crashes between those two calls? Or when service B's network is momentarily unavailable?
These are the failure scenarios that keep engineers up at night:
The Two Solutions Engineers Reached For
The distributed systems community has spent decades on this problem. Two major approaches emerged, each making a different philosophical bet:
Neither is universally better. The right choice depends on your consistency requirements, your tolerance for complexity, and whether you can afford blocking. The chapters ahead explore both in depth.
Two-Phase Commit
2PC is the classical approach to distributed atomicity. A coordinator orchestrates two phases across all participants: first check everyone can commit, then tell everyone to commit. Walk through the protocol step by step.
The Participants
2PC involves two roles. The Coordinator drives the protocol — it decides when to start, collects votes, and announces the final decision. The Participants (also called Resource Managers in XA) are the individual databases or services that will do the actual work. A typical 2PC setup might have one coordinator and two to five participants.
Phase 1 — Prepare (Vote)
The coordinator sends a PREPARE message to all participants. This is the most important phase to understand deeply.
When a participant receives PREPARE, it must decide: "Can I guarantee I will commit this transaction if asked to?" To answer yes, the participant must:
Phase 2 — Commit or Abort
If the coordinator received VOTE-YES from every participant, it writes COMMIT to its own WAL and broadcasts COMMIT to all participants. Each participant commits the transaction, releases locks, and sends ACK.
If any participant voted NO (or didn't respond within a timeout), the coordinator writes ABORT to its WAL and broadcasts ABORT. Each participant rolls back using its undo log and releases locks.
The Role of the Write-Ahead Log
Every decision in 2PC is written to a WAL before it is acted upon. This is the durability guarantee. If the coordinator crashes after writing COMMIT to its log but before sending the message, it will re-read the log on recovery and re-send COMMIT. If it crashes before writing COMMIT, it will re-read the log on recovery, see only PREPARE, and send ABORT.
Participants that receive a COMMIT or ABORT and crash mid-application will also recover from their WAL — the PREPARED entry tells them what transaction is in flight, and they re-contact the coordinator to get the final verdict.
-- Coordinator side (simplified) -- Phase 1: Prepare both participants BEGIN; -- execute work on participant 1... PREPARE TRANSACTION 'txn-abc-001'; -- writes to WAL, holds locks -- Phase 2a: If all voted yes COMMIT PREPARED 'txn-abc-001'; -- Phase 2b: If any voted no ROLLBACK PREPARED 'txn-abc-001'; -- Check for stuck prepared transactions SELECT * FROM pg_prepared_xacts; -- should usually be empty!
Why 2PC Fails —
The Blocking Problem
2PC works beautifully when nothing goes wrong. But distributed systems are defined by the things that go wrong. The coordinator is a single point of failure that can leave your entire system frozen — holding locks, unable to proceed.
The Uncertainty Window
The dangerous window in 2PC is the gap between Phase 1 completion and Phase 2 delivery. Once all participants have voted YES, they are in the PREPARED state — they must commit or rollback based on the coordinator's decision, and they cannot decide on their own.
Now imagine the coordinator crashes right after collecting all votes but before sending the Phase 2 message. What happens?
Crash Scenarios
Coordinator Recovery
When the coordinator restarts, it re-reads its WAL to determine what decisions it had reached. This is why the coordinator must write its final decision to durable storage before sending Phase 2 messages.
The recovery logic follows a simple rule: if the WAL says COMMIT, re-send COMMIT to all participants. If the WAL says ABORT or has no record of Phase 2, re-send ABORT. Participants that receive a re-sent message check their own WAL — if they already completed the transaction, they simply send ACK and move on.
Three-Phase Commit (3PC) — The Theoretical Fix
3PC adds a PRE-COMMIT phase between the vote and the final commit. The idea: before sending COMMIT, the coordinator tells participants "I'm about to commit." Participants acknowledge. Only then does the coordinator send the actual COMMIT.
This allows participants to make a safe assumption if the coordinator goes silent after PRE-COMMIT: if every participant has received PRE-COMMIT, they can safely commit on their own. If any participant has not received PRE-COMMIT, they can safely abort.
In practice, the industry moved away from 2PC and 3PC entirely for most use cases. The Saga pattern — which accepts that distributed atomicity is too expensive — became the dominant approach. 2PC is still used, but only in narrow circumstances.
When 2PC is Actually Fine
Despite its problems, 2PC is entirely reasonable in several scenarios:
PREPARE TRANSACTION + a reliable coordinator is a legitimate solution.The Saga Pattern
Instead of one distributed atomic transaction, design a sequence of local transactions — each one with a compensating action that can undo it. If any step fails, run the compensations backwards. No coordinator that can block. No distributed locks.
Compensating Transactions
Every step in a Saga must have a compensating transaction — an operation that semantically reverses the effect of the forward step. Compensations are not database rollbacks. They are new, positive forward-moving operations that undo the business effect.
Worked Example — E-Commerce Order
An order involves four separate services, each with its own database. Here is the full happy path and the compensation path for each step:
PENDING. Assign order ID. Write to local DB — single, atomic, local transaction.CONFIRMED. Saga is complete — no more compensations needed.Choreography vs Orchestration
There are two ways to implement a Saga: either services talk to each other directly via events (choreography), or a central coordinator tells each service what to do next (orchestration). Both implement the same Saga — they differ in who knows about the overall flow.
Order created → Inventory listens → reserves stock → publishes StockReserved → Payment listens → charges → publishes PaymentCharged → Shipping listens → books shipment.
Pros: No central brain, services are loosely coupled, each owns its logic.
Cons: Hard to see the overall flow, hard to debug, cyclic dependencies can emerge.
Orchestrator: call Inventory → wait → call Payment → wait → call Shipping → done.
Pros: The flow is visible and owned, easier to debug, compensations are explicit.
Cons: Orchestrator becomes a dependency, can become a God service.
Saga Deep Dive —
Isolation & Pitfalls
Sagas trade atomicity for availability. The hidden cost is isolation — or rather, the complete lack of it. Two concurrent Sagas can see each other's intermediate state, leading to subtle bugs that are hard to reproduce. Here's what that looks like and how to defend against it.
The Isolation Problem
In a normal database transaction, isolation means your in-progress work is invisible to other transactions. In a Saga, each local transaction commits immediately — its effects are visible to the entire world before the Saga is complete.
Consider two customers ordering the last item in stock simultaneously. Saga A reserves the item (step 2) and is about to charge payment. Saga B also reads the inventory and sees... 1 item available. Both proceed. One will eventually fail at shipping, but not before both have charged payment.
Lost Update: Saga A reads a value, Saga B reads the same value, both update it based on what they read — one update is lost.
Non-repeatable Read: Saga A reads a value, Saga B updates it, Saga A reads it again — sees a different value mid-execution.
Countermeasures
When a Saga starts modifying a resource, mark it as in-flight with a sentinel flag. Other transactions that encounter this flag must either wait, fail fast, or handle the pending state explicitly.
-- Step 1: Mark item as pending (semantic lock) UPDATE inventory SET status = 'RESERVING', saga_id = 'saga-123' WHERE item_id = 'item-456' AND status = 'AVAILABLE'; -- optimistic: only if still available -- Other sagas see status='RESERVING' and back off -- On completion: UPDATE status='RESERVED' -- On compensation: UPDATE status='AVAILABLE', saga_id=NULL
Identify the point of no return in your Saga — the step after which you are committed to moving forward (no more compensations). Design around this: make steps before the pivot compensatable, steps after the pivot retryable. For the order example, the pivot is payment being charged — you don't cancel after collecting money, you only refund.
Design updates so that order doesn't matter. Instead of SET stock = 5 (absolute), use SET stock = stock - 1 (relative). Two concurrent decrements both succeed without one overwriting the other. This eliminates lost updates for many inventory and balance operations.
Network retries are normal in distributed systems. Every Saga step must be safely retryable — calling it twice must produce the same result as calling it once. The standard technique is an idempotency key: a unique ID included with every request. The receiver checks if it has already processed this ID. If yes, return the cached result without re-executing.
-- Idempotency table: deduplication at DB level CREATE TABLE idempotency_keys ( key TEXT PRIMARY KEY, -- e.g. 'saga-123:charge-payment' result JSONB, created_at TIMESTAMPTZ DEFAULT NOW() ); -- On every payment request: INSERT INTO idempotency_keys (key, result) VALUES ('saga-123:charge-payment', do_charge()) ON CONFLICT (key) DO NOTHING; -- If key exists, SELECT result → return cached. Never charge twice.
The Orchestrator as a Durable State Machine
An orchestration-based Saga requires the orchestrator to be durable. If the orchestrator crashes mid-saga, it must be able to resume from exactly where it left off. This means the orchestrator's state must be persisted to a database at every step transition.
CREATE TABLE sagas ( id UUID PRIMARY KEY, type TEXT, -- 'ORDER_SAGA' state TEXT, -- current step name status TEXT, -- RUNNING | COMPENSATING | DONE | FAILED context JSONB, -- all data the saga needs updated_at TIMESTAMPTZ ); -- State transitions for ORDER_SAGA: -- PENDING → INVENTORY_RESERVING → PAYMENT_CHARGING → SHIPPING_BOOKING → DONE -- Failure path (compensating): -- SHIPPING_FAILED → PAYMENT_REFUNDING → INVENTORY_RELEASING → CANCELLED
Each step: (1) load saga state from DB, (2) call the service, (3) on success, update state to next step, (4) on failure, update state to the compensation path. Because each transition is an atomic local DB write, crashes are safe — the orchestrator restarts and resumes from the last saved state.
The Outbox Pattern
Choreography-based Sagas depend on reliable event publishing. But publishing to a message broker and committing to a database are two separate operations — one can fail without the other. The Outbox pattern solves this with a simple but powerful trick.
The Dual-Write Problem
The naive implementation of choreography looks like this: commit your business data, then publish an event to Kafka. Two separate operations. What can go wrong?
The Outbox Solution
The fix is elegant: add an outbox table to your database. When your business logic runs, write both the business data AND the event message to this outbox table, in the same local transaction. Local transactions are atomic. If the commit succeeds, both the data and the outbox row exist. If it fails, neither does.
A separate relay process (called a message relay or outbox poller) reads unpublished rows from the outbox table and publishes them to Kafka. On success, it marks the row as published. This publish step can be retried indefinitely — the consumer must be idempotent to handle at-least-once delivery.
CREATE TABLE outbox ( id UUID PRIMARY KEY DEFAULT gen_random_uuid(), aggregate_id TEXT, -- e.g. order_id event_type TEXT, -- e.g. 'OrderCreated' payload JSONB, published BOOLEAN DEFAULT FALSE, created_at TIMESTAMPTZ DEFAULT NOW() ); -- Application writes both atomically: BEGIN; INSERT INTO orders (id, customer_id, total) VALUES (...); INSERT INTO outbox (aggregate_id, event_type, payload) VALUES ('order-123', 'OrderCreated', '{"order_id":"order-123",...}'); COMMIT; -- both succeed or both fail. Always.
CDC-Based Relay with Debezium
Instead of polling the outbox table, a more elegant approach is Change Data Capture (CDC). Debezium tails the PostgreSQL WAL and publishes any INSERT to the outbox table directly to Kafka — without any application-level polling code. The WAL is already the durable log of what happened; Debezium just reads it.
WHERE published=false on an interval. Works well at low-to-medium volume. Adds DB load from polling. Risk of delay between commit and publish.When to Use What
2PC, Saga, or just eventual consistency — the choice is never about which is "better." It's about matching the mechanism to your consistency requirements, operational tolerance, and business domain. Here's the decision framework.
The Full Comparison
| Dimension | 2PC | Saga (Choreography) | Saga (Orchestration) | Eventual Only |
|---|---|---|---|---|
| Atomicity | Strong | Eventual | Eventual | None |
| Isolation | Full ACID | None (ACD) | None (ACD) | None |
| Availability | Blocking on failure | Non-blocking | Non-blocking | Always available |
| Latency | 2 round trips minimum | Async, variable | Async, variable | Single write |
| Observability | Synchronous, clear | Scattered events | Central state machine | Simple |
| Complexity | Protocol complexity | Compensation logic | Orchestrator + compensation | Simple |
| Best for | Same DC, small txns, XA | Loose coupling, simple flows | Complex flows, visibility | Likes, counts, feeds |
Decision Guide
Real-World Examples
The One-Sentence Summary
Use Saga + Orchestration when the flow is complex, cross-team, and you need visibility.
Use Saga + Choreography + Outbox when services are loosely coupled and teams need autonomy.
Use nothing when the business domain naturally tolerates eventual consistency — which is more often than engineers assume.