Module 06

Distributed
Systems

The hardest problems in engineering. Where junior and senior engineers truly diverge.

Advanced Senior-Level

Distributed 01

The Distributed Consensus Problem

Distributed consensus is the problem of getting multiple nodes in a network to agree on a single value, even when some nodes crash or messages are delayed. This seems simple but is provably hard โ€” the FLP impossibility result proves that in a fully asynchronous system with even one faulty node, consensus is impossible.

Imagine 5 generals surrounding a city. They must agree on whether to attack or retreat. Some generals might be traitors (Byzantine failures). Messages can be delayed or lost. How do the loyal generals agree on the same action? This is the Byzantine Generals Problem โ€” the most famous distributed systems problem.

Why Consensus Matters in Practice

  • Leader election: Which database node is the primary? Which Kafka broker owns this partition?
  • Distributed locking: Only one process should hold the lock at a time.
  • Atomic broadcast: All nodes receive the same events in the same order.
  • Configuration management: etcd/ZooKeeper store cluster config โ€” all nodes must see the same config.
โฆ

Distributed 02

Raft Consensus Algorithm

Raft was designed to be understandable โ€” unlike Paxos. It's used by etcd (Kubernetes), CockroachDB, TiKV, Consul, and many more. The Raft paper's goal: "a consensus algorithm for replicated state machines that is as easy to understand as possible."

Three Roles

  • Leader: Handles all client requests. Replicates log entries to followers. Sends heartbeats every N ms to prevent elections.
  • Follower: Passive. Accepts log entries from leader. If no heartbeat received within election timeout, becomes candidate.
  • Candidate: Requests votes from peers. If majority votes received, becomes leader. If another leader is seen, reverts to follower.

Leader Election

Normal operation:
  Leader --heartbeat--> All Followers (every 150ms)

Leader crashes:
  Followers wait for election timeout (150โ€“300ms, randomised)
  First follower to timeout becomes Candidate
  Candidate increments term, votes for itself
  Candidate --RequestVote(term=5)--> Other nodes
  If node hasn't voted this term AND candidate log is up-to-date:
    โ†’ Grant vote
  If candidate receives majority votes:
    โ†’ Becomes leader for term 5
    โ†’ Sends heartbeat to all followers
    โ†’ Normal operation resumes

Why randomised election timeouts? If all followers had the same timeout, they'd all start an election simultaneously. Multiple candidates split votes and no one wins. Randomisation ensures one node starts first with high probability.

Log Replication

Client โ†’ Leader: "SET x = 5"

1. Leader appends entry to its log (uncommitted)
2. Leader sends AppendEntries to all followers
3. Majority of followers append entry to their logs
4. Leader commits the entry (applies to state machine)
5. Leader notifies followers: entry committed
6. Followers apply entry to their state machines
7. Leader responds to client: "OK"

If minority of followers are slow/dead:
  โ†’ Still commits (majority is enough)
  โ†’ Slow followers catch up when they reconnect
โฆ

Distributed 03

Vector Clocks & Causality

In distributed systems, there's no global clock. Two nodes can't agree on the exact time something happened. Vector clocks give us a way to reason about causality โ€” did event A happen before event B?

A vector clock is a list of counters, one per node. When a node sends a message, it increments its counter and includes the full vector. When a node receives a message, it takes the max of each counter, then increments its own.

Nodes: A, B, C

A sends to B:   A=[1,0,0]  "write x=1"
B receives:     B=[1,1,0]  (max of [1,0,0] and [0,1,0], then increment B)
B sends to C:   B=[1,2,0]
C receives:     C=[1,2,1]

Now: did A's event causally precede C's event?
A=[1,0,0] < C=[1,2,1] on all elements โ†’ YES, A happened before C โœ“

Concurrent events (neither caused the other):
  X=[2,0,0]  Y=[0,2,0]
  Neither vector dominates โ†’ concurrent โ†’ potential conflict

Amazon Dynamo (and DynamoDB) uses vector clocks to detect concurrent writes and resolve conflicts. When two replicas have diverged versions with concurrent vector clocks, the application layer must resolve the conflict (e.g., shopping cart: merge both, take union).

โฆ

Distributed 04

Gossip Protocols

Gossip (epidemic) protocols spread information through a cluster by having each node periodically share its state with a few random peers. Eventually, everyone knows everything โ€” without a central coordinator.

Office gossip. You tell 3 colleagues a secret. Each of them tells 3 more people. Within a few rounds, the entire office knows. Information spreads exponentially without any central notice board.

What Gossip Is Used For

  • Membership: Cassandra uses gossip to know which nodes are alive or dead. Each node gossips with 1โ€“3 random peers every second. Within log(N) rounds, everyone knows about a new node or a failure.
  • Failure detection: If a node isn't gossiping back, it's probably dead. Probabilistic failure detection without a central heartbeat server.
  • State dissemination: Spread configuration changes, routing table updates, key-value state across a cluster.

Convergence Time

In a cluster of N nodes, gossip propagates in O(log N) rounds. 1000 nodes converge in ~10 rounds. 1 million nodes in ~20 rounds. This is why gossip scales so well.

โฆ

Distributed 05

Quorum Systems

A quorum is the minimum number of nodes that must agree for an operation to be considered successful. The key insight: if W + R > N (nodes), reads and writes will always overlap by at least one node โ€” ensuring you always read the latest write.

N = 5 replicas (replication factor)
W = 3 (must write to 3 nodes for success)
R = 3 (must read from 3 nodes and take latest)

W + R = 6 > N = 5 โ†’ Strong consistency guaranteed โœ“
(At least 1 node is always in both write and read set)

Trade-off configurations:
W=1, R=5: Fast writes, slow reads (scan all for latest)
W=5, R=1: Slow writes, fast reads
W=3, R=3: Balanced (Cassandra QUORUM setting)
W=1, R=1: Fastest, no consistency guarantee
โฆ

Distributed 06

Split Brain

Split brain occurs when a network partition divides a cluster into two groups, and both groups elect a leader โ€” now you have two leaders accepting writes simultaneously. When the partition heals, you have conflicting state.

Scenario: 5-node database cluster. Network partition splits it: [Node 1, 2] | [Node 3, 4, 5]. The larger partition (3,4,5) elects Node 3 as leader. But Nodes 1,2 lose contact and Node 1 was the old leader โ€” it doesn't know it's been replaced. Now both nodes 1 and 3 accept writes. When the partition heals: two different versions of truth. Which writes win?

Prevention

  • Quorum majority requirement: A leader is only valid if it has majority (N/2 + 1) nodes. If [1,2] can't reach 3 out of 5, they refuse to operate. The majority partition (3,4,5) remains operational. This is Raft's approach.
  • Fencing tokens: When a new leader is elected, it gets a monotonically increasing token. Any write must include the token. Storage systems reject writes with old tokens. Old leader's writes are rejected.
  • STONITH (Shoot The Other Node In The Head): Winning partition forcibly powers off the losing partition nodes. Brutal but effective in some environments.
โฆ

Distributed 07

Two-Phase Commit (2PC)

2PC is a protocol to achieve atomicity across multiple nodes โ€” either all nodes commit a transaction or none do.

Phase 1 โ€” Prepare:
  Coordinator โ†’ all Participants: "PREPARE"
  Each participant: write to WAL, lock resources, reply "YES" or "NO"

Phase 2 โ€” Commit or Abort:
  If ALL replied YES:
    Coordinator โ†’ all Participants: "COMMIT"
    Participants commit, release locks, reply "ACK"
  If ANY replied NO:
    Coordinator โ†’ all Participants: "ABORT"
    Participants rollback, release locks

The blocking problem: If the coordinator crashes after sending PREPARE but before COMMIT, participants are stuck. They've said YES and locked resources but can't commit or abort without the coordinator's decision. They block indefinitely. This is 2PC's fatal flaw โ€” it's a blocking protocol.

When to Use 2PC

2PC is appropriate for: database clustering (internal to a single DB system), XA transactions in enterprise systems, when you control all participants and network is reliable. Avoid for high-availability distributed systems where you can't afford blocking.

โฆ

Distributed 08

Sagas โ€” Distributed Transactions Without 2PC

A Saga is a sequence of local transactions, where each step publishes an event or message to trigger the next step. If any step fails, compensating transactions are run to undo previous steps.

Order Saga (e-commerce):
  1. OrderService: Create order (PENDING)
  2. PaymentService: Charge card
     โ†’ If fails: emit OrderFailed, cancel order
  3. InventoryService: Reserve items
     โ†’ If fails: emit PaymentRefund, refund card, cancel order
  4. ShippingService: Schedule delivery
     โ†’ If fails: emit InventoryRelease, PaymentRefund, cancel order
  5. OrderService: Mark order CONFIRMED

Each step is a local transaction. No distributed lock needed.
Failure triggers backward-running compensating transactions.

Choreography Sagas

  • Services emit events and react to events
  • No central coordinator
  • Decoupled, scalable
  • Hard to visualise the overall flow
  • Good for simple workflows

Orchestration Sagas

  • Central saga orchestrator commands services
  • Clear control flow visible in one place
  • Easier to add error handling
  • Orchestrator is a potential bottleneck
  • Good for complex workflows (Uber, Netflix)
โฆ

Distributed 09

Interactive Raft Simulator

Watch Raft leader election in real-time. Click a node to kill it and trigger a new election.

Raft Cluster โ€” 5 Nodes Term: 1