In Search of an Understandable Consensus Algorithm (Extended Version)🔗

Features🔗

  • Strong leader: logs only flow out from leader to other servers
  • Leader election: uses randomized timers, simple and rapid conflict resolution
  • Membership changes: new joint consensus approach allows normal operation during configuration changes

Replicated State Machines🔗

Collection of servers computing identical copies of the same state, typically using a replicated log.

Server's consensus modules ensure eventual constistency of the replicated logs, including ordering.

Typical guarentees:

  • Correctness under non-Byzantine conditions (in the presence of byzantine faults)
  • Available if any majority of servers are operational
  • Not dependent on time (asynchronous). Bad clocks & delays do not cause errors, only impact availability.

Paxos🔗

"Single decree Paxos"🔗

Proven protocol for single decision (eg. single log entry). Uses peer-to-peer architecture (no leader).

Difficult to understand even in this form.

"multi-Paxos"🔗

Practical systems require ability to agree on many decisions. Must extend original Paxos protocol, however:

  • There is no canonical multi-Paxos.
  • Most implementations deviate signifigantly in ways that are complex and error-prone.

Understandability🔗

Raft focuses on understandability. When making a design decision, understandability is the number one factor.

It's acheived by:

  • Decomposition / modularity. Leader election, log replication, safety, and membership changes are handled separately in Raft.
  • Smaller state space: focus on reducing possible states.
  • Reducing nondeterminism (where applicable)

Raft🔗

Raft decomposes the consensus problem into three subproblems:

  • Leader election
  • Log replication
  • Safety (ensuring consistency between machines)

Cluster of several (typically 5 - tolerate 2 failures) servers. Each server is either leader, follower, or candidate.

Terms (period with 1 leader) serve as a logical clock.

Basic consensus requires only 2 RPCs:

  • RequestVote (candidate -> others)
  • AppendEntries (leader -> followers)

Conflicts handled by randomized election timeout. Servers will transition to new election in random amount of time, servers which transistion earlier will have higher likelyhood of being leader. Vanishingly low probability of repeated conflicts.

Safety🔗

Logs must be present on a majority of nodes to be committed. New leader must be more up-to-date than the majority of nodes which vote for it in order to get elected. -> Therefore, new leader must be up to date.

Voters deny vote if candidate is out-of-date relative to its own log.

Configuration Change🔗

Config is determined by latest Config entry in servers log (does not need to be committed).

Transistion with transitional config state C_old,new. C_old -> C_old,new -> C_new

New servers with empty logs join as non-voting members to prevent availability gaps.

Leader may not belong to C_new, in this case it steps down after transition to C_new. While replicating C_new it is managing the cluster but does not vote / count itself in determining majorities for voting.

To prevent removed servers from triggering elections, servers disregard RequestVote RPCs when they believe a leader exists (have recieved a heartbeat from a leader within the the last $(election timeout period) amount of time).

Log compaction🔗

Approaches🔗
  • Snapshotting: snapshot the committed machine state up to a certain point and discard all logs leading up to that point.
  • Log cleaning: too complicated.
  • Log-Structured Merge Tree: seen in designing data-intensive applications

Servers snapshot independantly, however leader may send its snapshot to very out-dated followers.

Snapshot performance considerations:

  • Frequency:
    • too frequent wastes disk bandwidth (high churn)
    • not frequent enough could exhaust capacity (log too big)
  • Availability delay: taking a snapshot can take a long time, updates cannot be accepted during snapshotting. Solution is to use copy-on-write? eg. fork() CoW allows for "instant" in-memory snapshot of whole machine.

Client Interaction🔗

Send all requests to leader. (First to random server, then redirected to most recent known leader)

Uses unique serial numbers per-request to ensure linearlizability semantics. To prevent stale reads, leader must commit a no-op to confirm which of its entries have been committed at the start of its term.

References🔗

ROSENBLUM, M., AND OUSTERHOUT, J. K. The design and implementation of a log-structured file system. ACM Trans. Comput. Syst. 10 (February 1992), 26–52.