diff --git a/notes/spannerCockroach.md b/notes/spannerCockroach.md new file mode 100644 index 0000000000000000000000000000000000000000..c51e1dc846a4a8791d37a589e3bcbadef427ce1a --- /dev/null +++ b/notes/spannerCockroach.md @@ -0,0 +1,164 @@ + +# Spanner + +### Construction: + +- a single *universe* +- universe has many *zones*, each either a server farm or a part of +one. The unit of physical replication. +- each zone has 100 .. several thousand *spanservers* +- a spanserver consists of multiple *replicas* + +## Spanserver: + + +- implements between 100 and 1000 *tablets* +- a tablet is a bag of mappings: + + (key:string, timestamp:int64) --> string + +- runs a *Paxos* (long-lived leaders) instance on top of each tablet. + + +### Paxos group + +- leader replica implements *lock table* +- two-phase locking +- leader also implements a *transaction manager* + +### Transaction implementation + +- If only a single Paxos instance (single tablet) involved, +transaction manager skipped. +- Otherwise, two-phase commit across the paxos leaders. + +### Directories (buckets) + +- tags in a tablet that have common prefixes +- smallest unit of placement by apps + +### Data Model + +- schematized semi-relational tables +- query language +- general-purpose read/write transactions + +Motivated by: + +- popularity of megastore, even though it's slow +- popularity of Dremel (data analysis tool) +- complaints about bigtable: even cons across data centers, no + "cross-row" transactions (Percolator built to address this) + +## TrueTime + + + +Node time: + +- *TimeMaster* machines per data center (GPS and/or atomic clocks) +- *timeslave daemon* per machine +- Marzullo's algorithm to reject outlying masters +- slowly increasing uncertainty between time syncs: ε between +1 and 7ms + +Operations: + +- *read-write transactions* +- *read-only transactions* +- *snapshot reads* +- transactions are `internally` re-tried (client not involved) +- implicit extension of leases by successfull writes + + +### Assigning timestamps to RW transactions + +- two-phase locking +- timestamp can be any time after all locks acquired, before any locks +released +- spanners assigns trans the timestamp paxos gives the trans commit +write + +> If start of T2 is after commit of T1, then the commit timestamp of +> T2 must be greater than that of T1. + +**Commit wait**: +Coordinator leader ensure that clients cannot see any data committed +by Ti until TT.after(s\_i) is true + +- expected wait is at least 2 * Ε +- wait is usually overlapped w/ Paxos execution + + + +[The Talk](https://www.usenix.org/conference/osdi12/technical-sessions/presentation/corbett) + +Thoughts: +- ian: synergy between long-lived leaders and two-phase locking. +- would love to know how active `MoveDir` is, and more about it's policies. + + + +# CockroachDB: Life Without Atomic Clocks + +## Quotes +- "our challenge was in providing similar guarantees of external consistency +without atomic clocks." +- "A system which provides an absolute ordering of causally related events, +regardless of observer, provides for a particularly strong guarantee of +external consistency known as “linearizability.†+- "in a non-distributed database, serializability implies linearizability" +- "CockroachDB’s external consistency guarantee is by default only + serializability, though with some features that can help bridge the gap in + practice." + + +"Before a node is allowed to report that a transaction has committed, it must +wait 7ms. Because all clocks in the system are within 7ms of each other, +waiting 7ms means that no subsequent transaction may commit at an earlier +timestamp, even if the earlier transaction was committed on a node with a +clock which was fast by the maximum 7ms" + +Approach: +- "The possibility of reordering commit timestamps for causally related + transactions is likely a marginal problem in practice." + +- "What *could* happen is that examining the database at a historical timestamp + might yield paradoxical situations where transaction A is not yet visible + while transaction B is, even though transaction A is known to have preceded + B, as they’re causally related. However, this can only happen if there’s no + overlap between the keys read or written during the transactions." + +## Approach + +**causality token** - max timestamp encountered (anywhere) in a transaction. +- used to ensure that causal chains maintained by serving as minimum commit + timestamp for that transaction. + + +**Problem:** +- *Ti* reading data from mult nodes might fail to read already-committed data + +**Solution:** +- choose xtion timestamp based on node wall time: [*commit timestamp , commit timestamp + max uncertainty*] +- define upper bound by adding maxoffset +- read quickly as long as no version seen s.t.: + - version after commit timestamp + - but before upper bound +- if so, re-start transaction w/ version's timestamp + - upper bound doesn't change. + - lower bound gets *higher*, just above the problematic node write (window shrinks) + - never restart based on same-key version read from same node in one + transaction. + + + +So: +- spanner always short-waits on writes for short interval (~7ms) +- cockroachdb sometimes long waits on reads (~250ms) +- causal reverse still possible (though probably not a problem in practice) + +Strict serializability makes all of the guarantees of one-copy serializability +that we discussed above. In addition, it guarantees that if a transaction X +completed before transaction Y started (in real time) then X will be placed +before Y in the serial order