-
Peter J. Keleher authoredPeter J. Keleher authored
RAFT
v1.0
Due: Nov 19, 2023, 11:59:59 pm
Setup
Download startup files here.
Part I: Failure-free operation
Implement raft as described in Figure 2 of the paper.
In particular:
- implement the log
- implement leader election
- election timeouts
- election restriction
- appendentries
- heartbeat
- "catching up" a lagging follower
- committing
- informing the client
- failure resilience
Failures you need to handle
-
Failure of a client: Client failures should not affect anything.
-
Failure of the leader: Your system should detect failures, elect a new leader (with server selection), and continue processing commands from clients while a quorum is still available. For example: a 5-node system should be able to recover from leader failure and continue committing client operations.
-
Slow followers: i.e. followers that are behind and need to be "caught up" to the leader's log state. We will implement slow followers by "pausing" containers use the command
docker compose pause <hostname>
, and later un-pausing them.
Command Syntax and Observable Output
Your server must be implemented in a single file at the top level named
raft.go
. Your client must be at the same level and named
client.go
. The only other files you need are Dockerfile
and
compose.yaml
. You will also need the /pb
subdirectory.
Your servers and clients are called as follows:
-
raft.go -s <listen addr this server>,<server 0>,<server 1>,.... [-d]
: For example, we might start a server withgo run raft.go -s localhost:8002,localhost:8000,localhost:8001,localhost:8002 -d
. This will be a single raft server instance that listens at address:8002
, in a system consisting of three servers listening atlocalhost:8000
(id 0),localhost:8001
(id 1), andlocalhost:8002
(id 2). The-s
argument is mandatory, but the-d
flag is not. -
client.go <leader> <command>
: The "leader" is the presumed leader of the system and the command is a simple uninterpreted string. For example:go run client.go localhost:8002 one
submits the command "one" tolocalhost:8002
. If this host is not the leader it returns false to the client, which prints this out. The RPC to the server will not return until the command is committed, at which point the client will print out "true".
Raft Server Output
Your server may print anything with the -d
flag. Without this flag,
however, it should print at most one line acknowledging when startup
is complete, and then only the following output:
At each heartbeat the leader should print:
AE [<match index array>] ci:<commit index>, <log len>[<term>-"<cmd>",....]
For example, your leader might print out:
AE [-1 -1] ci:4, 5[8-"L0",8-"one",8-"two",9-"L1",10-"L1"]
The matchindex is an integer slice described as in Figure 2. A
null/nil or "no match" is signified w/ a -1. ci
is the current
leader commit index. The number before the left bracket ('[') is the
current log length. The entire log is then printed, surrounded by
brackets. Entries have format <term>-"<cmd>"
, and are separated by
commas.
Followers should print the same, but use " " instead of "AE".
When a follower converts to candidacy it should print out "election" alone on a line.
During catchup, a server should print a single line for each time it goes backwards:
lastLogIndex <index> <(address:port)
There should be no other output, and this output must be followed exactly.
Details
- Use whatever timeouts you want during debugging, but you should use those in the startup file when submitting.
Testing.
Test 1 (10 pts) Two servers.
- Two-server election, both have to agree.
- Submit a command and verify that the client does not return until the command has been committed and applied.
Test 2 (10 pts) Continuing the above...
- Kill the leader and immediately restart; the follower must become the new leader because it's log is more up-to-date.
- Commit another command and power-cycle the leader again, again verifying that the follower, with it's more up-to-date log, become leader.
Test 3 (10 pts) Three servers.
Test 4 (10 pts) Compose w/ three servers.
Test 5 (10 pts) Compose w/ three servers:
- pause a follower, verify that the commands can still be committed
- unpause the follower, verify that the follower catches up
Test 6 (10 pts) Compose w/ three servers:
- pause the leader, verify that the commands can still be committed after new election
- unpause the previous leader, verify that it rejoins, reverts to a follower, catches up
Test 7 (10 pts) Compose w/ five servers.
> cli localhost:5501 put one
redirect localhost:5502
> cli localhost:5502 put one
term 1, index 0: commit "one"
Leader fails:
> cli localhost:5502 put two
redirect localhost:5501
> cli localhost:5502 put two
term 2, index 1: commit "two"