Chapter 00 · Foundation
00

The Problem —
Why This Is Hard

Start with a single database. Transactions just work. The moment you split data across two systems, everything you took for granted breaks. This chapter builds the intuition for why distributed transactions are fundamentally difficult.

ACID Atomicity Distributed Systems Failure Modes
Mental Model
"A transaction across two databases is like two people simultaneously signing two different contracts in different rooms — you can't make both signatures happen at exactly the same instant."

The Single-Database World

In a single database, transactions are your superpower. You write BEGIN, do your work across multiple tables, and call COMMIT. Either all changes persist, or none do. The database guarantees atomicity — the A in ACID. Crash in the middle? The database rolls back. Network hiccup? The transaction either committed or it didn't. You never see half a transaction.

Consider a simple bank transfer: debit $100 from Account A, credit $100 to Account B. In a single database this is trivial — wrap it in a transaction and it's atomic. The balance never disappears and never doubles.

SQL — Safe in a single database
BEGIN;
  UPDATE accounts SET balance = balance - 100 WHERE id = 'account_A';
  UPDATE accounts SET balance = balance + 100 WHERE id = 'account_B';
COMMIT;
-- Either both updates happen, or neither does. Always.

Now Split Across Two Systems

Modern systems rarely keep all data in one place. You might have Bank A's database in one service and Bank B's database in another — entirely separate systems that don't share a transaction log. How do you transfer money now?

The naive approach: call service A to debit, then call service B to credit. This works most of the time. But what happens when the process crashes between those two calls? Or when service B's network is momentarily unavailable?

The Core Problem
There is no shared "transaction manager" watching both systems. You cannot atomically commit to two independent processes. The instant you have two separate systems, you have introduced the possibility of partial failure — one system succeeds and the other fails.

These are the failure scenarios that keep engineers up at night:

1
Crash after debit, before credit: Account A loses $100. Account B never receives it. Money disappears from the system.
2
Network timeout on the credit call: Did service B receive the request or not? You don't know. Retrying blindly might credit twice. Not retrying risks losing the money.
3
Service B fails mid-write: Service B started writing but crashed. The debit on A is committed. The credit on B is in an unknown state.

The Two Solutions Engineers Reached For

The distributed systems community has spent decades on this problem. Two major approaches emerged, each making a different philosophical bet:

Two-Phase Commit (2PC)
Try to preserve ACID atomicity across distributed systems. A coordinator orchestrates a two-phase protocol — all participants promise to commit before anyone actually does. Strong guarantee, but the coordinator becomes a single point of failure that can leave the system blocked.
The Saga Pattern
Give up on distributed atomicity. Instead, break the transaction into a sequence of local transactions, each with a compensating action that undoes it. If something fails, run the compensations backwards. Resilient and scalable, but you lose the I in ACID — isolation.

Neither is universally better. The right choice depends on your consistency requirements, your tolerance for complexity, and whether you can afford blocking. The chapters ahead explore both in depth.

Chapter 01 · Protocol
01

Two-Phase Commit

2PC is the classical approach to distributed atomicity. A coordinator orchestrates two phases across all participants: first check everyone can commit, then tell everyone to commit. Walk through the protocol step by step.

2PC XA Transactions Coordinator Prepared State WAL
Mental Model
"2PC is like a wedding ceremony. The officiant (coordinator) asks everyone 'do you all agree?' before pronouncing anything. But if the officiant faints after hearing 'I do' from everyone but before saying 'you're married' — everyone is stuck waiting forever."

The Participants

2PC involves two roles. The Coordinator drives the protocol — it decides when to start, collects votes, and announces the final decision. The Participants (also called Resource Managers in XA) are the individual databases or services that will do the actual work. A typical 2PC setup might have one coordinator and two to five participants.

Real-world context
In practice, 2PC is baked into database protocols. XA (eXtended Architecture) is the standard interface — Java's JTA uses it, and databases like PostgreSQL, MySQL, and Oracle implement XA. The application server acts as coordinator. Most engineers never write 2PC from scratch — they configure it.

Phase 1 — Prepare (Vote)

The coordinator sends a PREPARE message to all participants. This is the most important phase to understand deeply.

When a participant receives PREPARE, it must decide: "Can I guarantee I will commit this transaction if asked to?" To answer yes, the participant must:

1
Execute all the work — run the SQL, validate constraints, acquire locks. The actual database mutations happen here.
2
Write to disk (WAL) — flush the undo log and redo log to durable storage. This is critical: if the participant crashes and recovers, it must be able to complete the commit or rollback from this log entry.
3
Hold the locks — the participant keeps all row/table locks acquired during execution. Other transactions cannot touch these rows until 2PC completes.
4
Reply VOTE-YES or VOTE-NO — if everything succeeded, send VOTE-YES. If anything failed (constraint violation, disk full, etc.), send VOTE-NO.
What "Prepared" means on disk
A participant in the PREPARED state has made an irrevocable promise. The work is done. The logs are flushed. The locks are held. The only thing missing is the coordinator's final verdict. This state can survive a crash — on restart, the participant checks its WAL, finds the prepared transaction, and re-contacts the coordinator to ask "what did you decide?"

Phase 2 — Commit or Abort

If the coordinator received VOTE-YES from every participant, it writes COMMIT to its own WAL and broadcasts COMMIT to all participants. Each participant commits the transaction, releases locks, and sends ACK.

If any participant voted NO (or didn't respond within a timeout), the coordinator writes ABORT to its WAL and broadcasts ABORT. Each participant rolls back using its undo log and releases locks.

2PC Protocol — Step-by-Step Visualizer
Press Next Step to walk through the 2PC protocol

The Role of the Write-Ahead Log

Every decision in 2PC is written to a WAL before it is acted upon. This is the durability guarantee. If the coordinator crashes after writing COMMIT to its log but before sending the message, it will re-read the log on recovery and re-send COMMIT. If it crashes before writing COMMIT, it will re-read the log on recovery, see only PREPARE, and send ABORT.

Participants that receive a COMMIT or ABORT and crash mid-application will also recover from their WAL — the PREPARED entry tells them what transaction is in flight, and they re-contact the coordinator to get the final verdict.

PostgreSQL — Distributed 2PC via XA
-- Coordinator side (simplified)
-- Phase 1: Prepare both participants
BEGIN;
  -- execute work on participant 1...
PREPARE TRANSACTION 'txn-abc-001';  -- writes to WAL, holds locks

-- Phase 2a: If all voted yes
COMMIT PREPARED 'txn-abc-001';

-- Phase 2b: If any voted no
ROLLBACK PREPARED 'txn-abc-001';

-- Check for stuck prepared transactions
SELECT * FROM pg_prepared_xacts;  -- should usually be empty!
Chapter 02 · Failure Analysis
02

Why 2PC Fails —
The Blocking Problem

2PC works beautifully when nothing goes wrong. But distributed systems are defined by the things that go wrong. The coordinator is a single point of failure that can leave your entire system frozen — holding locks, unable to proceed.

Blocking Coordinator Failure 3PC Uncertainty Window
Mental Model
"Participants in the PREPARED state are like soldiers awaiting orders. They have loaded their weapons, are ready to fire — but they cannot act without the order. If the commanding officer (coordinator) goes silent, they are stuck indefinitely. They cannot stand down because they promised to fire."

The Uncertainty Window

The dangerous window in 2PC is the gap between Phase 1 completion and Phase 2 delivery. Once all participants have voted YES, they are in the PREPARED state — they must commit or rollback based on the coordinator's decision, and they cannot decide on their own.

Now imagine the coordinator crashes right after collecting all votes but before sending the Phase 2 message. What happens?

The Blocking Problem
All participants are stuck holding their locks indefinitely. They cannot commit (they haven't received COMMIT). They cannot rollback (they voted YES — doing so unilaterally would violate the protocol). They are blocked. Every row they locked is inaccessible to other transactions. The system is partially frozen until the coordinator recovers.

Crash Scenarios

Coordinator Failure — What Actually Happens

Coordinator Recovery

When the coordinator restarts, it re-reads its WAL to determine what decisions it had reached. This is why the coordinator must write its final decision to durable storage before sending Phase 2 messages.

The recovery logic follows a simple rule: if the WAL says COMMIT, re-send COMMIT to all participants. If the WAL says ABORT or has no record of Phase 2, re-send ABORT. Participants that receive a re-sent message check their own WAL — if they already completed the transaction, they simply send ACK and move on.

The catch
Recovery only works if the coordinator eventually comes back. If the coordinator is permanently lost (hardware destroyed, data corrupted), participants can be stuck forever. Some systems implement a timeout after which participants can try to contact each other to see if anyone knows the verdict — but this is complex and still not guaranteed to resolve.

Three-Phase Commit (3PC) — The Theoretical Fix

3PC adds a PRE-COMMIT phase between the vote and the final commit. The idea: before sending COMMIT, the coordinator tells participants "I'm about to commit." Participants acknowledge. Only then does the coordinator send the actual COMMIT.

This allows participants to make a safe assumption if the coordinator goes silent after PRE-COMMIT: if every participant has received PRE-COMMIT, they can safely commit on their own. If any participant has not received PRE-COMMIT, they can safely abort.

3PC advantage
Non-blocking in the failure case. If the coordinator dies after PRE-COMMIT, participants can reach a decision without waiting for coordinator recovery.
Why nobody uses it
3PC assumes a synchronous network — no indefinite delays. In real distributed systems (subject to network partitions), 3PC can still block or produce inconsistent results. It also adds latency and complexity for a problem most teams solve differently.

In practice, the industry moved away from 2PC and 3PC entirely for most use cases. The Saga pattern — which accepts that distributed atomicity is too expensive — became the dominant approach. 2PC is still used, but only in narrow circumstances.

When 2PC is Actually Fine

Despite its problems, 2PC is entirely reasonable in several scenarios:

Same datacenter, low latency: The coordinator crash window is very small when RTT is under 1ms. The blocking duration is correspondingly short.
Small number of participants: 2–3 participants, short-lived transactions. Lock contention stays bounded.
When you need synchronous confirmation: You need to know, right now, that all systems committed. Saga's eventual nature is unacceptable.
Database-level XA: Many teams use XA to span two databases in the same datacenter. PostgreSQL's PREPARE TRANSACTION + a reliable coordinator is a legitimate solution.
Chapter 03 · Pattern
03

The Saga Pattern

Instead of one distributed atomic transaction, design a sequence of local transactions — each one with a compensating action that can undo it. If any step fails, run the compensations backwards. No coordinator that can block. No distributed locks.

Saga Choreography Orchestration Compensating Transactions Eventual Consistency
Mental Model
"A Saga is like booking a trip: you book the flight, then the hotel, then the rental car. If the rental car is unavailable, you cancel the hotel and the flight. Each booking has a cancellation. You never held all three in a single atomic reservation — but you can always undo the steps you completed."

Compensating Transactions

Every step in a Saga must have a compensating transaction — an operation that semantically reverses the effect of the forward step. Compensations are not database rollbacks. They are new, positive forward-moving operations that undo the business effect.

Compensation is not rollback
A database rollback erases a transaction as if it never happened. A compensation acknowledges the original transaction happened and creates a new record that reverses the business effect. This distinction matters for audit logs, accounting ledgers, and anything that needs a complete history.

Worked Example — E-Commerce Order

An order involves four separate services, each with its own database. Here is the full happy path and the compensation path for each step:

1
Order Service — Create Order
Insert order record with status PENDING. Assign order ID. Write to local DB — single, atomic, local transaction.
Compensation: UPDATE orders SET status='CANCELLED' WHERE id=?
2
Inventory Service — Reserve Items
Decrement available stock. Write reservation record. If stock is insufficient, return failure — compensation of step 1 is triggered.
Compensation: Increment stock back, delete reservation record
3
Payment Service — Charge Customer
Call payment processor. Record transaction. If payment fails, compensate steps 2 and 1 in reverse order.
Compensation: Issue refund via payment processor, mark payment as refunded
4
Shipping Service — Book Shipment
Create shipment record. Notify warehouse. Update order status to CONFIRMED. Saga is complete — no more compensations needed.
Compensation: Cancel shipment booking, restock warehouse queue

Choreography vs Orchestration

There are two ways to implement a Saga: either services talk to each other directly via events (choreography), or a central coordinator tells each service what to do next (orchestration). Both implement the same Saga — they differ in who knows about the overall flow.

Choreography — Event-Driven
Each service listens for events and publishes its own.

Order created → Inventory listens → reserves stock → publishes StockReserved → Payment listens → charges → publishes PaymentCharged → Shipping listens → books shipment.

Pros: No central brain, services are loosely coupled, each owns its logic.
Cons: Hard to see the overall flow, hard to debug, cyclic dependencies can emerge.
Orchestration — State Machine
A central Saga Orchestrator service drives each step by calling services and reacting to their responses.

Orchestrator: call Inventory → wait → call Payment → wait → call Shipping → done.

Pros: The flow is visible and owned, easier to debug, compensations are explicit.
Cons: Orchestrator becomes a dependency, can become a God service.
Saga Flow Visualizer — Happy Path & Compensation
Step through the order Saga — inject a failure to see compensation
Chapter 04 · Deep Dive
04

Saga Deep Dive —
Isolation & Pitfalls

Sagas trade atomicity for availability. The hidden cost is isolation — or rather, the complete lack of it. Two concurrent Sagas can see each other's intermediate state, leading to subtle bugs that are hard to reproduce. Here's what that looks like and how to defend against it.

ACD Lost Update Dirty Read Semantic Locks Idempotency
Mental Model
"A Saga is ACD — Atomic (eventually), Consistent, Durable — but deliberately NOT Isolated. The I in ACID is the one property distributed systems can't afford. Acknowledging this explicitly is the first step to building safe Sagas."

The Isolation Problem

In a normal database transaction, isolation means your in-progress work is invisible to other transactions. In a Saga, each local transaction commits immediately — its effects are visible to the entire world before the Saga is complete.

Consider two customers ordering the last item in stock simultaneously. Saga A reserves the item (step 2) and is about to charge payment. Saga B also reads the inventory and sees... 1 item available. Both proceed. One will eventually fail at shipping, but not before both have charged payment.

Anomalies Sagas Can Produce
Dirty Read: Saga B reads data written by Saga A that hasn't completed yet. If A compensates, B has read data that "never happened."

Lost Update: Saga A reads a value, Saga B reads the same value, both update it based on what they read — one update is lost.

Non-repeatable Read: Saga A reads a value, Saga B updates it, Saga A reads it again — sees a different value mid-execution.

Countermeasures

1. Semantic Locks

When a Saga starts modifying a resource, mark it as in-flight with a sentinel flag. Other transactions that encounter this flag must either wait, fail fast, or handle the pending state explicitly.

SQL — Semantic lock on inventory reservation
-- Step 1: Mark item as pending (semantic lock)
UPDATE inventory
   SET status = 'RESERVING', saga_id = 'saga-123'
 WHERE item_id = 'item-456'
   AND status = 'AVAILABLE';  -- optimistic: only if still available

-- Other sagas see status='RESERVING' and back off
-- On completion: UPDATE status='RESERVED'
-- On compensation: UPDATE status='AVAILABLE', saga_id=NULL
2. The Pivot Transaction

Identify the point of no return in your Saga — the step after which you are committed to moving forward (no more compensations). Design around this: make steps before the pivot compensatable, steps after the pivot retryable. For the order example, the pivot is payment being charged — you don't cancel after collecting money, you only refund.

3. Commutative Updates

Design updates so that order doesn't matter. Instead of SET stock = 5 (absolute), use SET stock = stock - 1 (relative). Two concurrent decrements both succeed without one overwriting the other. This eliminates lost updates for many inventory and balance operations.

4. Idempotency at Every Step

Network retries are normal in distributed systems. Every Saga step must be safely retryable — calling it twice must produce the same result as calling it once. The standard technique is an idempotency key: a unique ID included with every request. The receiver checks if it has already processed this ID. If yes, return the cached result without re-executing.

Pattern — Idempotency key on payment charge
-- Idempotency table: deduplication at DB level
CREATE TABLE idempotency_keys (
  key        TEXT PRIMARY KEY,        -- e.g. 'saga-123:charge-payment'
  result     JSONB,
  created_at TIMESTAMPTZ DEFAULT NOW()
);

-- On every payment request:
INSERT INTO idempotency_keys (key, result)
VALUES ('saga-123:charge-payment', do_charge())
ON CONFLICT (key) DO NOTHING;
-- If key exists, SELECT result → return cached. Never charge twice.

The Orchestrator as a Durable State Machine

An orchestration-based Saga requires the orchestrator to be durable. If the orchestrator crashes mid-saga, it must be able to resume from exactly where it left off. This means the orchestrator's state must be persisted to a database at every step transition.

SQL — Saga state machine table
CREATE TABLE sagas (
  id          UUID PRIMARY KEY,
  type        TEXT,                   -- 'ORDER_SAGA'
  state       TEXT,                   -- current step name
  status      TEXT,                   -- RUNNING | COMPENSATING | DONE | FAILED
  context     JSONB,                  -- all data the saga needs
  updated_at  TIMESTAMPTZ
);

-- State transitions for ORDER_SAGA:
-- PENDING → INVENTORY_RESERVING → PAYMENT_CHARGING → SHIPPING_BOOKING → DONE
-- Failure path (compensating):
-- SHIPPING_FAILED → PAYMENT_REFUNDING → INVENTORY_RELEASING → CANCELLED

Each step: (1) load saga state from DB, (2) call the service, (3) on success, update state to next step, (4) on failure, update state to the compensation path. Because each transition is an atomic local DB write, crashes are safe — the orchestrator restarts and resumes from the last saved state.

Chapter 05 · Infrastructure
05

The Outbox Pattern

Choreography-based Sagas depend on reliable event publishing. But publishing to a message broker and committing to a database are two separate operations — one can fail without the other. The Outbox pattern solves this with a simple but powerful trick.

Outbox Dual-Write Debezium CDC At-Least-Once
Mental Model
"Instead of writing to the database AND sending a message (two separate things that can fail independently), write the message into the database as a row — in the same transaction as your business data. Then a separate process reads that row and publishes it. The publish is now retryable without risk of losing the event."

The Dual-Write Problem

The naive implementation of choreography looks like this: commit your business data, then publish an event to Kafka. Two separate operations. What can go wrong?

❌ Broken: Dual-Write — Process crashes between the two operations
App
business logic
DB Commit
order created ✓
💥 CRASH
Kafka
event never sent ✗
Result: Order exists in DB, but Inventory service never heard about it. Order is stuck forever.
❌ Also Broken: Reversed order — Kafka first, then DB crash
App
Kafka
event sent ✓
💥 CRASH
DB Commit
never happened ✗
Result: Inventory service processes an order that doesn't exist in the Order DB yet. Phantom order.

The Outbox Solution

The fix is elegant: add an outbox table to your database. When your business logic runs, write both the business data AND the event message to this outbox table, in the same local transaction. Local transactions are atomic. If the commit succeeds, both the data and the outbox row exist. If it fails, neither does.

A separate relay process (called a message relay or outbox poller) reads unpublished rows from the outbox table and publishes them to Kafka. On success, it marks the row as published. This publish step can be retried indefinitely — the consumer must be idempotent to handle at-least-once delivery.

✓ Outbox Pattern — atomically safe
App
Single DB Transaction
orders table
+
outbox table
Relay
polls outbox
Kafka
event published ✓
Mark sent
outbox.published=true
Result: Event is published if and only if the DB transaction committed. The relay retries until Kafka acks. Consumers handle duplicates via idempotency keys.
SQL — Outbox table + application write
CREATE TABLE outbox (
  id           UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  aggregate_id TEXT,             -- e.g. order_id
  event_type   TEXT,             -- e.g. 'OrderCreated'
  payload      JSONB,
  published    BOOLEAN DEFAULT FALSE,
  created_at   TIMESTAMPTZ DEFAULT NOW()
);

-- Application writes both atomically:
BEGIN;
  INSERT INTO orders (id, customer_id, total) VALUES (...);
  INSERT INTO outbox (aggregate_id, event_type, payload)
    VALUES ('order-123', 'OrderCreated', '{"order_id":"order-123",...}');
COMMIT;  -- both succeed or both fail. Always.

CDC-Based Relay with Debezium

Instead of polling the outbox table, a more elegant approach is Change Data Capture (CDC). Debezium tails the PostgreSQL WAL and publishes any INSERT to the outbox table directly to Kafka — without any application-level polling code. The WAL is already the durable log of what happened; Debezium just reads it.

Polling-based relay
Simple to implement. App polls WHERE published=false on an interval. Works well at low-to-medium volume. Adds DB load from polling. Risk of delay between commit and publish.
CDC-based (Debezium)
Reads directly from Postgres WAL. Near-zero latency. No polling overhead. More infrastructure to operate. The production-grade choice at scale.
Chapter 06 · Decision Framework
06

When to Use What

2PC, Saga, or just eventual consistency — the choice is never about which is "better." It's about matching the mechanism to your consistency requirements, operational tolerance, and business domain. Here's the decision framework.

Decision Framework Trade-offs Production Guidance
Mental Model
"Most engineers reach for Saga when they need distributed consistency. But the right first question is: do you actually need distributed transactions at all? The best distributed transaction is the one you don't have."

The Full Comparison

Dimension 2PC Saga (Choreography) Saga (Orchestration) Eventual Only
Atomicity Strong Eventual Eventual None
Isolation Full ACID None (ACD) None (ACD) None
Availability Blocking on failure Non-blocking Non-blocking Always available
Latency 2 round trips minimum Async, variable Async, variable Single write
Observability Synchronous, clear Scattered events Central state machine Simple
Complexity Protocol complexity Compensation logic Orchestrator + compensation Simple
Best for Same DC, small txns, XA Loose coupling, simple flows Complex flows, visibility Likes, counts, feeds

Decision Guide

1
Can you avoid it entirely? Can you redesign to put the data in one service with one database? Or accept that one system lags slightly behind? Many "distributed transaction" problems disappear with a different service boundary. Try this first.
2
Do you need synchronous confirmation? Does the API caller need to know, right now, that all systems committed? And are you within a single datacenter with <5ms RTT? → Use 2PC / XA.
3
Is the flow simple with few steps? 2–3 services, clear ownership, teams prefer autonomy? → Saga with Choreography + Outbox pattern.
4
Is the flow complex or high-stakes? 4+ services, you need visibility into in-flight sagas, you need to handle partial failures gracefully? → Saga with Orchestration. Accept the orchestrator as a dependency.
5
Can the business tolerate eventual consistency? View counts, like counts, recommendations, feeds — these are inherently eventually consistent. Don't add transaction machinery to them. Use simple async writes.

Real-World Examples

Stripe: Uses idempotency keys + event sourcing. Every payment state change is an immutable ledger event. The ledger is the source of truth. Saga-like compensation via explicit refund events.
Uber: Trip state machine is an orchestrated Saga. Each state transition (REQUESTED → DRIVER_ASSIGNED → STARTED → COMPLETED) is durable. Failures trigger well-defined compensations.
Amazon: The original Saga paper came from Amazon's work on their order workflow. Each step owned by a separate team. Choreography via SQS/SNS events, with a central workflow tracker for visibility.
Traditional banks: Still use 2PC via XA for core ledger operations. The latency is acceptable (<10ms), the number of participants is small (2–3 databases), and the regulatory requirement for atomicity outweighs the complexity.

The One-Sentence Summary

Take Away
Use 2PC when you need synchronous distributed atomicity and can tolerate coordinator risk in a controlled environment.

Use Saga + Orchestration when the flow is complex, cross-team, and you need visibility.

Use Saga + Choreography + Outbox when services are loosely coupled and teams need autonomy.

Use nothing when the business domain naturally tolerates eventual consistency — which is more often than engineers assume.