A step by step approach to raft consensus algorithm

This is my second time to read through Raft Algorithm, and it is hard to recall what I have learned in first reading. This time I decide to record my thoughts for future recall. Hope it is useful to newbies in distributed systems like me.

What does consensus algorithm mean ?

Consensus algorithm is the process used to achieve agreement on shared state among faulty processes in distributed system.

Introduce Replicated State Machines

Here, we define state machine as State' = Machine(State, Input) for simplicity. A state machine can be defined with its start state, and makes progress with sequence of inputs, produces sequence of intermediate states. For examples:

1
2
3
State' = Machine(State, Input)
State'' = Machine(State', Input')
State''' = Machine(State'', Input'')

Given a state machine, how can we figure out that it is a replicated state machine ?

First, replicated state machine is deterministic. Given same start state with same sequence of inputs, the state machine always produce same intermediate states.

Second, two state machines built from same logic on possibly different processes or nodes and even possibly different languages must be same. Here we define two state machines as same based on deterministic: given same start state and same sequence of inputs, if two state machines produce same sequence of intermediate states, we say these two state machine are same. Thus given multiple copies of same state machines with same start state, feeding with same sequence of inputs, they must produces same sequence of intermediate states.

Third, states, including start state and intermediate states, and inputs must be self-contained, thus can be replicated to other processes or nodes with help of serialization and deserialization.

The word replicated in replicated state machine has duple means: deterministic and replication. Occasionally, replicated state machine is used in non distributed systems to validate processes using its deterministic. More often, replicated state machines are used to validate or repliate processes among processes or nodes in distributed systems.

How replicated state machines implemented in distributed systems ?

Replicated state machines are typically implemented using a replicated log, this is particular due to the fact that state machine process sequence of inputs one by one in order, non batch, non concurrency, I guess. Here, replicated in replicated log means replication cross processes or nodes.

In replciated log, inputs to state machines are appended as sequence of entries or records in order, these entries will be replicated to replicated logs reside in other nodes in later time, and feeded to state machines reside in their nodes for apply eventually.

Thus if given identical sequence entries from replicated log, multiple copies of same state machine will produce same sequence of states. But how can we promise identical sequence cross different faulty nodes in distributed systems ? That is consensus algorithm’s job.

Raft

Raft implements consensus by first electing a server as leader, then giving the leader complete responsibility for managing the replicated log. The leader accepts log entries from clients, replicates them on other servers, and tells servers when it is safe to apply log entries to their state machines.

Raft decomposes the consensus problem into two relatively independent subproblems:

  • Leader election: a new leader must be chosen when starting the cluster and when an existing leader fails.
  • Log replication: the leader must accept log entries from clients and replicate them across the cluster, forcing the other logs to agree with its own.

Prerequisites

Majority and minority

For a raft cluster contains odd number of servers, say 2*N + 1, or even number of servers, say 2*N, we say majority of the cluster comprises at least N + 1 servers, and minority of the cluster comprises at most N servers.

Server roles

At any given time each server is in one of three states: leader, follower, or candidate. In normal operation there is exactly one leader and all of the other servers are followers. Followers are passive: they issue no requests on their own but simply respond to requests from leaders and candidates. The leader handles all client requests (if a client contacts a follower, the follower redirects it to the leader). The third state, candidate, is used to elect a new leader. The following figure shows the states and their transitions:

Terms

Raft divides time into terms of arbitrary length. Terms are numbered with consecutive integers. Each term begins with an election, in which one or more candidates attempt to become leader. If a candidate wins the election, then it serves as leader for the rest of the term. In some situations an election will result in a split vote. In this case the term will end with no leader; a new term (with a new election) will begin shortly. Raft ensures that there is at most one leader in a given term.

Logs

Logs are composed of entries, which are numbered sequentially as index. Each entry contains the term in which it was created and a command for the state machine. Raft determines which of two logs is more up-to-date by comparing the index and term of the last entries in the logs. If the logs have last entries with different terms, then the log with the later term is more up-to-date. If the logs end with the same term, then whichever log is longer is more up-to-date.

Leader election

Raft uses a heartbeat mechanism to trigger leader election. When servers start up, they begin as followers. A server remains in follower state as long as it receives valid RPCs from a leader or candidate. Leaders send periodic heartbeats to all followers in order to maintain their authority. If a follower receives no communication over a period of time called the election timeout, then it assumes there is no viable leader and begins an election to choose a new leader.

To begin an election, a follower increments its current term and transitions to candidate state. It then votes for itself and issues RequestVote RPCs in parallel to each of the other servers in the cluster. A candidate continues in this state until one of three things happens:

  • It wins the election.

    A candidate wins an election if it receives votes from a majority of the servers in the full cluster for the same term. Each server will vote for at most one candidate in a given term, on a first-come-first-served basis with restriction: the voter denies its vote if its own log is more up-to-date than that of the candidate. The majority rule ensures that at most one candidate can win the election for a particular term. Once a candidate wins an election, it becomes leader. It then sends heartbeat messages to all of the other servers to establish its authority and prevent new elections.

  • Another server establishes itself as leader,

    While waiting for votes, a candidate may receive an AppendEntries RPC from another server claiming to be leader. If the leader’s term (included in its RPC) is at least as large as the candidate’s current term, then the candidate recognizes the leader as legitimate and returns to follower state. If the term in the RPC is smaller than the candidate’s current term, then the candidate rejects the RPC and continues in candidate state.

  • Another election timeout goes by with no winner.

    If many followers become candidates at the same time, votes could be split so that no candidate obtains a majority. When this happens, each candidate will time out and start a new election by incrementing its term and initiating another round of RequestVote RPCs. Raft uses randomized election timeouts to ensure that split votes are rare and that they are resolved quickly. To prevent split votes in the first place, election timeouts are chosen randomly from a fixed interval (e.g., 150–300 ms). This spreads out the servers so that in most cases only a single server will time out; it wins the election and sends heartbeats before any other servers time out. The same mechanism is used to handle split votes. Each candidate restarts its randomized election timeout at the start of an election, and it waits for that timeout to elapse before starting the next election; this reduces the likelihood of another split vote in the new election.

Log replication

Once a leader has been elected, it begins servicing client requests. Each client request contains a command to be executed by the replicated state machine. The leader appends the command to its log as a new entry, then issues AppendEntries RPCs in parallel to each of the other servers to replicate the entry. When the entry has been safely replicated, the leader applies the entry to its state machine and returns the result of that execution to the client. If followers crash or run slowly, or if network packets are lost, the leader retries AppendEntries RPCs indefinitely (even after it has responded to the client) until all followers eventually store all log entries.

The leader decides when it is safe to apply a log entry to the state machines; such an entry is called committed. A log entry is committed once the leader that created the entry has replicated it on a majority of the servers. This also commits all preceding entries in the leader’s log, including entries created by previous leaders. The leader keeps track of the highest index it knows to be committed, and it includes that index in future AppendEntries RPCs (including heartbeats) so that the other servers eventually find out. Once a follower learns that a log entry is committed, it applies the entry to its local state machine (in log order).

Raft log mechanism was designed to maintain a high level of coherency between the logs on different servers. It maintains the following properties, which together constitute the Log Matching Property:

  1. If two entries in different logs have the same index and term, then they store the same command.

    This property follows from the fact that a leader creates at most one entry with a given log index in a given term, and log entries never change their position in the log. New leader is elected with a larger term than previous leaders.

  2. If two entries in different logs have the same index and term, then the logs are identical in all preceding entries.

    This property is guaranteed by a consistency check performed by AppendEntries. When sending an AppendEntries RPC, the leader includes the index and term of the entry in its log that immediately precedes the new entries. If the follower does not find an entry in its log with the same index and term, then it refuses the new entries. The consistency check acts as an induction step: the initial empty state of the logs satisfies the Log Matching Property, and the consistency check preserves the Log Matching Property whenever logs are extended. As a result, whenever AppendEntries returns successfully, the leader knows that the follower’s log is identical to its own log up through the new entries.

During normal operation, the logs of the leader and followers stay consistent, so the AppendEntries consistency check never fails. However, leader crashes can leave the logs inconsistent (the old leader may not have fully replicated all of the entries in its log). These inconsistencies can compound over a series of leader and follower crashes. A follower may be missing entries that are present on the leader, it may have extra entries that are not present on the leader, or both. Missing and extraneous entries in a log may span multiple terms.

In Raft, the leader handles inconsistencies by forcing the followers’ logs to duplicate its own. This means that conflicting entries in follower logs will be overwritten with entries from the leader’s log. To bring a follower’s log into consistency with its own, the leader must find the latest log entry where the two logs agree, delete any entries in the follower’s log after that point, and send the follower all of the leader’s entries after that point. All of these actions happen in response to the consistency check performed by AppendEntries RPCs. The leader maintains a nextIndex for each follower, which is the index of the next log entry the leader will send to that follower. When a leader first comes to power, it initializes all nextIndex values to the index just after the last one in ts log. If a follower’s log is inconsistent with the leader’s, the AppendEntries consistency check will fail in the next AppendEntries RPC. After a rejection, the leader decrements the follower’s nextIndex and retries the AppendEntries RPC. Eventually the nextIndex will reach a point where the leader and follower logs match. When this happens, AppendEntries will succeed, which removes any conflicting entries in the follower’s log and appends entries from the leader’s log (if any). Once AppendEntries succeeds, the follower’s log is consistent with the leader’s, and it will remain that way for the rest of the term.

Safety

Raft guarantees that each of these properties is true at all times:

  • Election Safety: At most one leader can be elected in a given term.

    Each server will vote for at most on candidate in a given term, and a candidate must win votes from majority of the cluster to be elected as leader.

  • Leader Append-Only: A leader never overwrites or deletes entries in its log; it only appends new entries.

  • Log Matching Property: If two logs contain an entry with the same index and term, then the logs are identical in all entries up through the given index.
  • Leader Completeness Property: If a log entry is committed in a given term, then that entry will be present in the logs of the leaders for all higher-numbered terms.

    A log entry is committed once the leader that created the entry has replicated it on a majority of the servers (at least N+1). New leader elected afterward wins votes from majority (at least N+1) servers. From inequation (N+1) + (N+1) = 2*N + 2 > 2*N + 1, we knows that at least one server from later majority contains that committed log entry. With up-to-date election restriction, we can conclude that the new leader has that committed log entry.

  • State Machine Safety: If a server has applied a log entry at a given index to its state machine, no other server will ever apply a different log entry for the same index.

    At the time a server applies a log entry to its state machine, its log must be identical to the leader’s log up through that entry, and the entry must be committed. Consider the lowest term in which any server applies a given log index; the Leader Completeness Property guarantees that the leaders for all higher terms will store that same log entry, so servers that apply the index in later terms will apply the same value. Thus, the State Machine Safety Property holds.

  • Raft requires servers to apply entries in log index order.

    Combined with the State Machine Safety Property, this means that all servers will apply exactly the same set of log entries to their state machines, in the same order.

Raft guarantees that committed entries are durable and will eventually be executed by all of the available state machines.