diff --git a/notes/lin1.png b/notes/lin1.png new file mode 100644 index 0000000000000000000000000000000000000000..67d953b9f24a95601d5c6d44bcec2c0017ffa37c Binary files /dev/null and b/notes/lin1.png differ diff --git a/notes/lin2.png b/notes/lin2.png new file mode 100644 index 0000000000000000000000000000000000000000..8839735d7a5fe90f89caea0c6c62e5de0ede4ee3 Binary files /dev/null and b/notes/lin2.png differ diff --git a/notes/linCrash.png b/notes/linCrash.png new file mode 100644 index 0000000000000000000000000000000000000000..66acebd0dea14f28530066ce05283209df091809 Binary files /dev/null and b/notes/linCrash.png differ diff --git a/notes/linRam.md b/notes/linRam.md new file mode 100644 index 0000000000000000000000000000000000000000..9dda56916948f30c0ac666f721da5d45d75f515f --- /dev/null +++ b/notes/linRam.md @@ -0,0 +1,122 @@ +# Implementing Linearizability at Large Scale and Low Latency + +## RAMCloud +- basically all RAM +- "durable writes" get replicated in other RAM +- log-structured, *cleaner* etc. +- massively parallel, 5 usec end-to-end RPCs. + +**Linearizability**: +- collection of operations is *linearizable* if each operation appears to occur **instantaneously** and exactly once at **some point in time between its invocation and its completion**. +- **must not be possible** for any client of the system, either the one initiating an operation or other clients operating concurrently, **to observe contradictory behavior** + +Good: + + +Bad: + + +Retry of idempotent operation bad: + + +Bottom line: +- *at-least-once* bad +- *exactly-once* good + - detect and stop retry of completed op + - return same value as first execution + + +## Architecture +RPC's needed so that clients can be notified of completed operations. + +Problems / solutions: +- RPC identification + - globally unique names +- **completion record** durability + - must be atomic wrt the actual mutation + - store completion record w/ object +- retry rendezvous + - find record, even if sent to different server + - store completion record w/ object + - for transactions, pick one datum +- garbage collection: when we know request will never be retried + - after client acks response + - after client crashes + +## Client failure detection +- leases + - must renew + - essentially a heartbeat + - want to scale to **a million clients**???! + +## Lifetime of an RPC + +When received by server: +- `checkDuplicate()` in ResultTracker (on *server*) + - normal case returns new, proceeds + - completed previously, returns previous value + - in-progress (toss the request or nack the client) + - stale retry - error to client +- *normal case of new RPC* + - execute the RPC + - create completion record + - return to client +- *asssumes some local durable log* + +## Design + +`RequestTracker` (on client) +- tracks active RPCs +- `firstIncomplete` sequence number added to outgoing RPCs to servers + - server deletes records for earlier RPCs +- only 512 outstanding RPCs from single client + +`LeaseTracker` (on clients and servers) + +`ResultTracker` (on server) + + +### Lifetime of RPC +- new RPC - unique identifier using client ID w/ new seq num from RequestTracker (from server?) +- server calls `checkDuplicate`: new / completed / in_progress / stale +- server executes + 1. creates RPC identifier + 1. creates object identifier (completion record w/ migrate w/ object) + 1. result returns + 1. operation side effects and completion record **appended to a log atomically** + 1. `recordCompletion` on ResultTracker, system-dependent + 1. return result to client + +### Lease management +- Zookeeper +- renewal overhead + - low because *stable storage not updated on renewal* +- validation overhead + - in-memory + - *cluster clock* for server's to do most lease validation + - ask lease server only when close to expiration + +## Transactions with RIFL (RAMCloud implementation) +- sinfonia-ish two-phase commit: + - `prepare` (version/lock checks) + - `decision` (this phase in background, client already notified) + ("the transaction is effectively committed once a durable lock record has been written for each of the objects") +- updates deferred until commit request +- reads executed normally, *versions recorded* +- writes on commit phase, possibly w/ expected version + - fail *if version-check fails, or locked by another transaction* +- fast case + - *single server* owns all objects in transaction + - *read-only*, even in distributed case only a single round +- on client crash, recovery coordinator finishes, hoping to abort unless already committed + +## Issues +- worried that storing completion records will not scale well + - local operation + - either disk or (for RAMCloud) replicating elsewhere +- why optional version checking in transactions? (atomic operation primitive) + + +## Questions/comments + +