diff --git a/notes/ram.md b/notes/ram.md index 5eb5b80d8720e709e3617136c76dea606cace599..d733a169cfada4e0be687660d981ce31f972e584 100644 --- a/notes/ram.md +++ b/notes/ram.md @@ -118,10 +118,19 @@ Concurrency: **End-to-end recovery times of 1-2 seconds for 35 GB.** -# Interesting notes +## Interesting notes - scattered backups increase possibility of data lost (vs all segment backups on same replicas (think of probability that that system w/ 2 backups, all three of those would fail, vs three random failures affecting scattered backups. - avail from fast recovery rather than weakened consistency +## Comments +- I didn't see the paper touch on many limitations of their approach. While it seems good, are there any less obvious drawbacks? +Latency of RPCs but can they really control that if they're using TCP? +- Are there any security risks with this approach? Could ORAM be used to make it more secure? +- If we wanted to implement a worst-case recovery in their approach to using DRAM, what would be the absolute worst case? +Would it be if all the data was stored on a single disk and that disk failed? +Would it be if all the data was stored on a single disk and that disk failed, and the data was partitioned across multiple recovery leaders? + + # Linearizability # Implementing Linearizability at Large Scale and Low Latency diff --git a/notes/spannerCockroach.md b/notes/spannerCockroach.md index e59c05a9177c8706696a2490319242ff73f7860c..35bb99a4594890ff464960b35ecd90951dd022f4 100644 --- a/notes/spannerCockroach.md +++ b/notes/spannerCockroach.md @@ -84,7 +84,7 @@ write **Commit wait**: Coordinator leader ensure that clients cannot see any data committed -by Ti until TT.after(s\_i) is true +by *T<sub>i</sub>* until TT.after(s\_i) is true - expected wait is at least 2 * Ε - wait is usually overlapped w/ Paxos execution @@ -145,11 +145,10 @@ Approach: - doesn't help w/ independent causal chains **Another Problem:** -- *Ti* reading data from mult nodes might fail to read already-committed data +- *T<sub>i</sub>* reading data from mult nodes might fail to read already-committed data **Solution:** - choose xtion timestamp based on node wall time: [*commit timestamp , commit timestamp + max uncertainty*] -- define upper bound by adding maxoffset - read quickly as long as no version seen s.t.: - version after commit timestamp - but before upper bound diff --git a/notes/time.md~ b/notes/time.md~ new file mode 100755 index 0000000000000000000000000000000000000000..a868a914d7199a20c0e4c62194b98961fcb29ea9 --- /dev/null +++ b/notes/time.md~ @@ -0,0 +1,204 @@ +# Finding global properties + +*Consistent Snapshots* + +Assume: +- piecewise deterministic +- reliable, ordered, uni-directional communication channels +- no other comm + +To get snapshot (atomically do): +1. take ckpt at one process, +2. send token on all outgoing edges (next msg) + + +At receipt of token, if haven't already seen it (atomically do): +1. take ckpt +1. send token on all outgoing edges + +Want: +- consistent checkpoints +- channel contents. + + + +Consistent state is: +- A, B, C when they received the snapshot command +- m_2 + +Reconstructing this state: +- might not ever have happened +- **is** equiv to state that happened + +# Another way: Logical Time for dist systems. + +notions: +- wall clock (lots of problems) +- logical time + +## How should causality work? + +(motivation) +- data init 0 +- all values unique + + +``` + P1 P2 + w(x)1 + w(y)1 + r(y)1 + r(x)? +``` + +``` + w(x)1 -> r(x)1 +``` +"session semantics" + +Also, what about: +``` + P1 P2 + w(x)1 w(y)1 + r(y)0 r(x)0 +``` + +w(x)1 -> r(y)0 -> w(y)1 -> r(x)0 -> w(x)1 + +oops + +## Happens-Before + +1. program order ('->') +1. **send** msg before **receipt** +1. transitive closure + +If time only moves forward: +- '->' has no cycles +- *partial* order + +if `!(e -> e')` and `!(e' -> e)` +- `e,e'` **concurrent** + +## Example for Logical Time + +### Assume: +- **fully-replicated** key-value (KV) store +- unicast messages (assume they are *writes* to the kv) + - reliable, ordered +- comm *only* through msg-passing. + +### How to achieve causality? +- events of one proc linearly ordered +- send (write) *happens-before* receive (read) + +### **Lamports scalar clocks**: +1. internal event or send event at Pi, `Ci := Ci + d` +2. each msg carries send timestamp +3. at Pj rcv w/ time t, clock `Cj := max(Cj,t) + d (usually 1)` + +(many choices initial 0, increment at send, receive set) + +``` + P1 e1 /> e5 \ e6 + P2 e2 / e3 \ + P3 e4 \> e7 +``` + +Implications: +- events at diff procs can have same timestamp + +What if we need uniqueness? +(tie-break of proc #) + +What do we lose w/ scalar clock? +- don't know if `e -> e'` even if C(e) less + +--- +How to fix? + +## Vector time (vector clocks) +- each proc knows something about what other procs have seen +- Each proc Pi has clock Ci that is vector of len n (# procs). + +**Vector clock:** +- initialized as zero vector +- increment local at event +- at send: + - increment local + - send w/ vector +- at receipt + - increment local + - `Ci = piecewise-max(Ci, t)` + + + +So Ci[i] shows how many events have occurred in i. + +What do these relations mean w/ vectors (**u**, **v**): +- `u <= v` iff forall i: `(u[i] <= v[i])` +- u < v iff `(u <= v)` && exists i s.t. `u[i] < v[i]` +- u || v iff `!(u < v) && !(v < u)` + +What do we get? +- can now tell if two points causually related. + +Properties: +- `u(a) < v(b)` then `a -> b` +- antisymmetry: if `u < b `, then `!(v < u)` +- transitivity +- if `u(a) < v(b)`, then `wall(a) < wall(b)` + +--------- + +### What do we do w/ vector time? + +- finding "earliest" event +- session semantics +- resolve deadlocks +- consistent checkpoints + - ensure no msgs received but not sent, no cycles +- observer getting notifications from different processes might want order +- debugging + - can show that some event can not have caused another + - reduce information necessary to replay +- detect race conditions + - if two procs interact outside msgs, and interaction concurrent, it's + a race +- measuring "degree of *possible* parallelism" + +--- + +## Matrix clocks + +Know what others know of others + +(pic) + + P1 1 2> 3\ + P2 1 2 3> 4/ \ + P3 1:"sky is blue" 2/ 3> + + 3 4 3 + 2 4 n + 2 4 3 + +Updates to local Mi: +- Pi's local event, including a send: Mi[i][i]++ +- On receive of matrix Mj: + - increment Mi[i][i]++ + - forall k, Mi[i][k] = max(Mi[i][k], Mi[j][k]) + +lower bound on what other hosts know, and is useful +- checkpointing +- garbage collection + + +Performance +- scalar cheap +- vector, matrix mostly impractical, but: + - incremental (tuples) + + +--- + diff --git a/notes/trutime1.png b/notes/trutime1.png new file mode 100644 index 0000000000000000000000000000000000000000..b2d3e74bb5130aed0e8654ff7a294d415e738d12 Binary files /dev/null and b/notes/trutime1.png differ