Skip to content
Snippets Groups Projects
user avatar
Peter J. Keleher authored
e5ea8196
History
Code owners
Assign users and groups as approvers for specific file changes. Learn more.

A Database, and Anti-Entropy

Due: Oct 10, 2021, 11:59:59 pm, v1

Recent Changes:

Overview

This project is a relatively straightforward one-way syncing of one server to another, but there are a few tricky aspects and you will need to spend a bit of time learning gRPC. There are two aspects to our filesystem's state: the log and file chunks. The log is easy to synchronize (for our purposes, at least), but we are going to synchronize the chunk database using Merkle trees implemented using gRPC.

Links

  • p3.tgz: provided files. Your P3 directory on the repository should have the structure given in this tarball.
  • gRPC (go quick start)

Anti-Entropy and Merkle Trees

Merkle Trees are forms of hash trees where each leaf is the hash of a single chunk, and each non-leaf is the hash of the concatenation of the hash of it's child nodes. We are going to take severe liberties w/ the definition of a hash or merkle tree in order to make the project tractible. For this project, we define the hash of a chunk or a string as the base32 encoding of the sha1 of that chunk or string.

Salient points:

  • The tree is a full balanced tree:
    • all leaves at same depth
    • all nodes except leaves have exactly 32 children
    • depth is an integer greater than 0
  • The tree's purpose is to summarize the set of all chunks locally present at a server.
  • Each node has a sig, consisting of the hash of the data rooted at the subtree below.
  • We divide the space of chunk hashes into uniform-sized buckets whose width is determined solely by the tree's height.
  • Each leaf represents one bucket, and summarizes all chunks present at the server in that bucket.
  • The sig of a leaf represents the hash of the sigs of the chunks under the leaf, where the chunk sigs are sorted alphanumerically and concatenated. For example, given sig values "3B", "F3", "7A", and "K4", the hash would be computed as hash(3B|7A|F3|K4).
  • The sig of an empty bucket is the hash of an empty string.
  • The Sig of a non-leaf is the hash of the concatenation of the hash values of all children of that node, not sorted.

For this project we have defined the fan-out of the tree as 32, which is conveniently the base of the encoding we use to create signatures, allowing the tree to be visualized easily.

merkle

This figure is only showing a small portion of a depth (height) 3 tree. There are 32 interior nodes on the second level and 322 leaves on the third level.

The labels of leaf nodes are the hash of the concatenation of the hashes of chunks in that leaf's bucket (buckets are not strictly relevant for interior nodes). Buckets get less wide as you go down the tree. If this were a depth 1 tree (consisting only of the root), the root's bucket would contain all possible hash values.

If this were a depth 2 tree, Node 0, Node 1 etc. would be leaves, and their buckets would contain chunks whose hashes start with "A", "B", etc. Node 0-0's bucket is those chunks whose hashes start with "AA", Node 0-1's bucket is those chunks whose hashes start with "AB" etc.

Since this is a depth 3 tree, only the last level has "buckets" (the ability to have chunks). The buckets at Node 0, Node 1 and the root are empty. They are just shown for explanatory purposes, and to show how chunks would be assigned to buckets in level 1 or 2 trees.

The match between our fanout and the encoding means that it will be easy to determine the correct bucket for each chunk.

Merkle Workflow

Note: most of the input and output messages for the gRPC are not yet implemented.

The following is an example anti-entropy session initiated by from the command line. The exchange is initiated by the command line client in the "cmd" subdirectory by issuing a SyncChunks message to a server. The following tells the server at port 8000 to pull all chunks from the server at port 8001, using a depth 3 tree at both servers:

   go run main.go -L ":8000" -R ":8001" -h 3 -m

Call "localhost:8000" s1, and "localhost:8001" s2. As a result of the above:

  1. s1 sends a "TreeGet" msg to s2, specifying a tree height of 3.
  2. s2 locally creates a complete Merkle tree of height 3, placing signatures of all chunks at the appropriate leaf nodes of the fully balanced tree, and replies to to s1 with the root. This struct defines the root's hash, as well as the hashes of its 32 children in a ChildSigs field. s2 caches its tree as s1 might need to query it again.
  3. Meanwhile, s1 has created a tree describing it's own local chunks, and compares it's root hash with that returned by it's request to s2. If the two are equal, the exchange is complete.
  4. If not, s1 determines which branch(s) under the root are different by comparing ChildSigs[i] on s1 and s2, for i in the range 0..31.
  5. For each difference found, s1 sends a "TreeGet" message to s2, specifying the desired node by supplying the sig. A sig of "" is interpreted as a request for the root. Otherwise, the tree is searched for the sig.
  6. When the above recursive procedure reaches a leaf, the TreeSig node returned has no ChildSigs defined, but may have defined ChunkSigs, representing chunks in that leaf's bucket.
  7. For each such chunk sig that is not already present locally in s1, S1 issues an ordinary "get" message to s2, bringing that chunk back locally. s1 then stores this chunk in the local store.

In the absence of failure or other operations proceeding concurrently (both of which we ignore), this procedure will terminate with s1's set of chunks including all those at s2. In the absence of any other changes, a second identical sync will not need retrieve any additional chunks.

Note: This does not, however, imply that such a second sync will not require multiple request-response pairs to establish this set inclusion. s1 might well have chunks that s2 does not, so their trees and hashes will continue to differ because our anti-entropy is uni-directional. In the absence of ongoing new data, however, syncing s2 to s1, followed by s1 to s2, should always result in identical chunk databases.

Your Tasks.

Task 1: Define the RPC Interface

I have given you the layout of your p3 directory in the repository. The pb directory once again holds protobuf definitions, but that same file also now defines the RPC interface between servers (TreeGet, ChunkGet, LogGet), and between servers and the command line tool (SyncChunks, SyncLog, Die).

I have not created all the service definitions, only the first. However, cmd/main.go should run unaltered with your code, giving you some help in figuring out what the service definitions should look like.

Task 2: Implement Merkle Trees in your Code

Synchronizing the sets of chunks present at different servers requires first describing those set, which we do with Merkle trees. You must implement the merkle trees in your code. No using third-party libraries for this.

Task 3: Implement the RPC Communication Routines

Synchronizing servers must both create merkle trees. The RPC routines compare the two trees by requesting individual nodes from the remote server. Only remote nodes differing from the corresponding local nodes need be retrieved. Likewise, only remote chunks not present in the local database need be retrieved.

The remote log is copied in it's entirety, and completely overwrites the local log. FUSE calls can be used to invalidate nodes of the local file system so that refreshes can be used to change the view to the revise version. However, this is more trouble that it is worth for this project. Therefore, we assume that the local (pulling) server is always restarted before inspecing its new namespace.

Factoids:

  • A tree of height 1 consists solely of the root, which is a leaf, and has all chunks it its 'ChunkSigs' field.
  • Each TreeGet response from s2 contains only a single tree node, but that node may have multiple hashes in either ChildSigs or ChunkSigs.

Testing

I will run through the tests shown in the video, plus a few others. In particular, I will:

  • Create a single file on one server, synchronize it to the other server, verify chunk sizes and total data transferred.
  • Do the same with redis.
  • Verify that repeated pulls of identical systems only require a single node, and no chunks, to be transferred. In particular, appropriate data should be printed out from the tool in /cmd).
  • Verify that the above works for any height (though I will probably just check height 3 and 1).

Notes

  • The maximum size of gRPC message is 4MB by default. We should not be running up against this limit in my testing.
  • The _opt flavor of arguments is only supported on newer versions of the protobuf compiler. Should work fine on lagoon, where the compiler version is 3.6.1.
  • You do not need to worry about changes happening concurrently with synchronization, or simultaneous synchronization with multiple partners.

Grading

I will use the following rough guide to assign grades:

  • 60 pts: application of -m, -l, and -D commands from the tool result in identical namespaces (after a restart)
  • 20 pts: minimal number of chunks and nodes are transferred
  • 20 pts: merkle tree works with different depths (but still same on both communicating servers).

Create a README.md file in your p3 directory describing what works, what doesn't, and any divergences from this document.

Extra credit

  • 20 pts: Get your system to immediately reflect changes without a restart.

Watch the video