Skip to content
Snippets Groups Projects
Commit 06b74104 authored by Peter J. Keleher's avatar Peter J. Keleher
Browse files

auto

parent 02b14c9d
No related branches found
No related tags found
No related merge requests found
......@@ -118,10 +118,19 @@ Concurrency:
**End-to-end recovery times of 1-2 seconds for 35 GB.**
# Interesting notes
## Interesting notes
- scattered backups increase possibility of data lost (vs all segment backups on same replicas (think of probability that that system w/ 2 backups, all three of those would fail, vs three random failures affecting scattered backups.
- avail from fast recovery rather than weakened consistency
## Comments
- I didn't see the paper touch on many limitations of their approach. While it seems good, are there any less obvious drawbacks?
Latency of RPCs but can they really control that if they're using TCP?
- Are there any security risks with this approach? Could ORAM be used to make it more secure?
- If we wanted to implement a worst-case recovery in their approach to using DRAM, what would be the absolute worst case?
Would it be if all the data was stored on a single disk and that disk failed?
Would it be if all the data was stored on a single disk and that disk failed, and the data was partitioned across multiple recovery leaders?
# Linearizability
# Implementing Linearizability at Large Scale and Low Latency
......
......@@ -84,7 +84,7 @@ write
**Commit wait**:
Coordinator leader ensure that clients cannot see any data committed
by Ti until TT.after(s\_i) is true
by *T<sub>i</sub>* until TT.after(s\_i) is true
- expected wait is at least 2 * &Epsilon;
- wait is usually overlapped w/ Paxos execution
......@@ -145,11 +145,10 @@ Approach:
- doesn't help w/ independent causal chains
**Another Problem:**
- *Ti* reading data from mult nodes might fail to read already-committed data
- *T<sub>i</sub>* reading data from mult nodes might fail to read already-committed data
**Solution:**
- choose xtion timestamp based on node wall time: [*commit timestamp , commit timestamp + max uncertainty*]
- define upper bound by adding maxoffset
- read quickly as long as no version seen s.t.:
- version after commit timestamp
- but before upper bound
......
# Finding global properties
*Consistent Snapshots*
Assume:
- piecewise deterministic
- reliable, ordered, uni-directional communication channels
- no other comm
To get snapshot (atomically do):
1. take ckpt at one process,
2. send token on all outgoing edges (next msg)
At receipt of token, if haven't already seen it (atomically do):
1. take ckpt
1. send token on all outgoing edges
Want:
- consistent checkpoints
- channel contents.
![snapshots](snapshots.png)
Consistent state is:
- A, B, C when they received the snapshot command
- m_2
Reconstructing this state:
- might not ever have happened
- **is** equiv to state that happened
# Another way: Logical Time for dist systems.
notions:
- wall clock (lots of problems)
- logical time
## How should causality work?
(motivation)
- data init 0
- all values unique
```
P1 P2
w(x)1
w(y)1
r(y)1
r(x)?
```
```
w(x)1 -> r(x)1
```
"session semantics"
Also, what about:
```
P1 P2
w(x)1 w(y)1
r(y)0 r(x)0
```
w(x)1 -> r(y)0 -> w(y)1 -> r(x)0 -> w(x)1
oops
## Happens-Before
1. program order ('->')
1. **send** msg before **receipt**
1. transitive closure
If time only moves forward:
- '->' has no cycles
- *partial* order
if `!(e -> e')` and `!(e' -> e)`
- `e,e'` **concurrent**
## Example for Logical Time
### Assume:
- **fully-replicated** key-value (KV) store
- unicast messages (assume they are *writes* to the kv)
- reliable, ordered
- comm *only* through msg-passing.
### How to achieve causality?
- events of one proc linearly ordered
- send (write) *happens-before* receive (read)
### **Lamports scalar clocks**:
1. internal event or send event at Pi, `Ci := Ci + d`
2. each msg carries send timestamp
3. at Pj rcv w/ time t, clock `Cj := max(Cj,t) + d (usually 1)`
(many choices initial 0, increment at send, receive set)
```
P1 e1 /> e5 \ e6
P2 e2 / e3 \
P3 e4 \> e7
```
Implications:
- events at diff procs can have same timestamp
What if we need uniqueness?
(tie-break of proc #)
What do we lose w/ scalar clock?
- don't know if `e -> e'` even if C(e) less
---
How to fix?
## Vector time (vector clocks)
- each proc knows something about what other procs have seen
- Each proc Pi has clock Ci that is vector of len n (# procs).
**Vector clock:**
- initialized as zero vector
- increment local at event
- at send:
- increment local
- send w/ vector
- at receipt
- increment local
- `Ci = piecewise-max(Ci, t)`
![vector clocks](vectorclock.png)
So Ci[i] shows how many events have occurred in i.
What do these relations mean w/ vectors (**u**, **v**):
- `u <= v` iff forall i: `(u[i] <= v[i])`
- u < v iff `(u <= v)` && exists i s.t. `u[i] < v[i]`
- u || v iff `!(u < v) && !(v < u)`
What do we get?
- can now tell if two points causually related.
Properties:
- `u(a) < v(b)` then `a -> b`
- antisymmetry: if `u < b `, then `!(v < u)`
- transitivity
- if `u(a) < v(b)`, then `wall(a) < wall(b)`
---------
### What do we do w/ vector time?
- finding "earliest" event
- session semantics
- resolve deadlocks
- consistent checkpoints
- ensure no msgs received but not sent, no cycles
- observer getting notifications from different processes might want order
- debugging
- can show that some event can not have caused another
- reduce information necessary to replay
- detect race conditions
- if two procs interact outside msgs, and interaction concurrent, it's
a race
- measuring "degree of *possible* parallelism"
---
## Matrix clocks
Know what others know of others
(pic)
P1 1 2> 3\
P2 1 2 3> 4/ \
P3 1:"sky is blue" 2/ 3>
3 4 3
2 4 n
2 4 3
Updates to local Mi:
- Pi's local event, including a send: Mi[i][i]++
- On receive of matrix Mj:
- increment Mi[i][i]++
- forall k, Mi[i][k] = max(Mi[i][k], Mi[j][k])
lower bound on what other hosts know, and is useful
- checkpointing
- garbage collection
Performance
- scalar cheap
- vector, matrix mostly impractical, but:
- incremental (tuples)
---
notes/trutime1.png

77 KiB

0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment