-
Peter J. Keleher authoredPeter J. Keleher authored
Don’t Settle for Eventual: Scalable Causal Consistency for Wide-Area Storage with COPS
Contributions:
- causal+
- linearizable within cluster
- replicated keys using
chain replication
- replicated keys using
- causal across clusters
- bolt-on-ish
- use of explicit dependencies (before bolt-on)
- get (read) transactions
COPS
- individual accesses
- explicit tracking of dependences
- nearest
- contexts (threads, or explicit)
- Lamport Clocks
- version is CLOCK.nodeID
- convergent, not necessarily eventual
- version is CLOCK.nodeID
- "context" implemented incrementally as set of nearest neighbors
- read:
- return value added to context
- write:
- context added to write
- context zero'd out
- context set to new write
- read:
- propagated with writes to remote clusters
- write installed at new cluster by
-
depCheck()
s to owner of each write in "nearest" -
depCheck()
blocks until true -
depCheck()
re-issued on time-out
-
COPS-GT
causal not always enough
Straightforward problems
Alice
s0: ACL == "world"
p0:"all is good"
s1: change ACL to exclude Boss
p2: "hate my job"
Assume reads implemented by ACL check followed by read
Boss:
- check security: see s0, which includes Boss
- read returns p2
- Alice is fired
TOCTOU vulnerability (time-of-check-to-time-of-use)
Another implementation
Might consider re-implementing library to issue GETs in reverse order, checking "nearest" as you go
- works for above example (p2 depends on s1)
But if Alice adds a few more actions:
p3: delete last post
s2: add Boss back to ACL
...the reverse order doesn't help
- could return p2, s2 (both possibly latest version, causally consistent individually when issued)
A causally-consistent set of reads
s0, p0: "world" / "all is good" s1, p0: "no-boss" / "all is good" s1, p2: "no-boss" / "hate job" s1, p3: "no-boss" / "all is good" s2, p3: "world" / "all is good"
GT implementation
- "all dependences"
- really means one-per-key that have been locally modified
- "nearest" still big win over this because there might be many keys
- individual writes not coordinated
- by keeping all dependences, not just nearest
- clearly need GC
Notes
- Kaitlyn - just measured against themselves
- Elliot - ALPS
- Huayang - CAP, what does paxos give up?
- Nikhil - when not convergent?
Bolt-On Consistency
Fun scenario
- Sally posts "I think Billy is missing!"
- Seconds later, Billy calls mom, says "Safe!", Sally posts "False alarm!"
- James posts "What a relief!"
Henry sees JUST original post and James', is confused.
Shim layer to support causal consistency over weaker (eventually-consistent) systems.
Design and Implementation
Write path:
- Client write performed in local store, send to ECDS w/ metadata.
Read path (causal):
- Client read satistfied by local store
Read path (pessimistic):
- synchronous checking on data, and dependencies, in the ECDS, rather than relying on async resolve process to do this for us.
Motivation
- Several NoSQL stores have added stronger consistency, implying need: Vogels: Choosing consistency..
- HAT work shows causal even in the face of partitions
- allows purely local reads (so does eventual!)
- no production-ready store provides it
Challenges
- handling overwritten histories:
- eventually-consistent stores often choose a single winning write, existing cc algs fail because the cc layer might not see all writes
- maintaining availability
Innovations
- alg that buffers writes and ensures writes satisfy causal cut.
Causality
Differentiate between:
- potential causality: what we usually think of, and which they do not consider further, and
- explicit causality, i.e. tags provided by app
Bolt-on separation of concerns
- safety properties (causal shim) from
- liveness (eventually consistent data store (ECDS))
Bolt-on
Operations:
- get(key)
- put(key, value, after), where after is any set of previously returned writes (identifiers)
Consistency cases
Let's assume x1 -> y1 (x1 existed in store when/where y1 was created). So any replica reading y1 should have (or have had) x1 as well. Assume x1, y1 created by different processes, and discuss what happens at P when it wants to read y:
- P has not received any writes
- P has received x1, but not y1
- P has received y1, but not x1
- P has received both
Answers:
- P has not received and writes, returns y=null
- P has received x1, but not y1. x1 put in local store, returns y=null.
- P has received y1, but not x1. P buffers y1, returns y=null.
- P has received both, both put in local store, returns y=y1.
The fun case is 3. We could just go ahead and let P read y1, but we'd have to remember that, and stall the process if it subsequently reads x.
- this means we give up availability (consider network partition)
Overwritten Histories
The following is the motivation to NOT just use the underlying store to hold incoming writes directly.
Example (concurrent):
- P1 writes x1 -> y1 -> z1
- P2 writes y2
Assume shim stores immediate dependencies. Assume P has received z1
,
and then y2
. P knows did know that it needed y1
, but doesn't know
about x1
. The overwrite by y2
loses the dependency on x1
, and the
transitive dependency x1 -> z11
is also lost.
Causal Cuts
- Causal cut (paraphrase)
- A set of key-versions, at most one per key, such that all dependencies of any key are either part of the cut, or concurrent with the cut's version of that key.
Some causal cuts:
- w1, x1, y1
- yes
- x1, w2, y1
- yes
- w1, x1, y1, z1
- yes
- x1, z1
- NO: need y1, w1 or w2
- w2, x1, y1, z1
- yes
- w1
- yes
- w1, z1
- NO: need x1, y1
- w2
- yes
- w1, x1
- yes
- y1, z1
- NO: need x1, w1 or w2
Metadata
Excpected size of dependency set is number of unique keys int he causal history.
Errata
- Callback on covergence allows key-versions to be dropped from histories once seen everywhere (GC).
- store pruning: drop keys not needed any more from local store; might require dropping other writes that depend on it