# Don’t Settle for Eventual: Scalable Causal Consistency for Wide-Area Storage with COPS ## Contributions: - causal+ - linearizable within cluster - replicated keys using `chain replication` - causal across clusters - bolt-on-ish - use of explicit dependencies (before bolt-on) - get (read) transactions  ## COPS - individual accesses - explicit tracking of dependences - *nearest* - contexts (threads, or explicit) - Lamport Clocks - version is CLOCK.nodeID - convergent, not necessarily eventual  - "context" implemented incrementally as set of nearest neighbors - read: - return value added to context - write: - context added to write - context zero'd out - context set to new write - propagated with writes to remote clusters - write installed at new cluster by - `depCheck()`s to owner of each write in "nearest" - `depCheck()` blocks until true - `depCheck()` re-issued on time-out ## COPS-GT *causal not always enough* ### Straightforward problems Alice ``` s0: ACL == "world" p0:"all is good" s1: change ACL to exclude Boss p2: "hate my job" ``` Assume reads implemented by ACL check followed by read Boss: - check security: see s0, which includes Boss - read returns p2 - Alice is fired *TOCTOU* vulnerability (time-of-check-to-time-of-use) ### Another implementation Might consider re-implementing library to issue GETs in reverse order, checking "nearest" as you go - works for above example (p2 depends on s1) But if Alice adds a few more actions: ``` p3: delete last post s2: add Boss back to ACL ``` ...the reverse order doesn't help - could return p2, s2 (both possibly latest version, causally consistent individually when issued) ### A causally-consistent set of reads s0, p0: "world" / "all is good" s1, p0: "no-boss" / "all is good" s1, p2: "no-boss" / "hate job" s1, p3: "no-boss" / "all is good" s2, p3: "world" / "all is good" ## GT implementation - "all dependences" - really means one-per-key that have been locally modified - "nearest" still big win over this because there might be *many* keys - individual writes not coordinated - by keeping **all** dependences, not just nearest - clearly need GC  ## Notes - Kaitlyn - just measured against themselves - Elliot - ALPS - Huayang - CAP, what does paxos give up? - Nikhil - when not convergent? # Bolt-On Consistency ### Fun scenario - Sally posts "I think Billy is missing!" - Seconds later, Billy calls mom, says "Safe!", Sally posts "False alarm!" - James posts "What a relief!" Henry sees JUST original post and James', is confused. **Shim layer to support causal consistency over weaker (eventually-consistent) systems.**  ### Design and Implementation Write path: - Client write performed in **local store**, send to ECDS w/ metadata. Read path (causal): - Client read satistfied by local store Read path (pessimistic): - synchronous checking on data, and dependencies, in the ECDS, rather than relying on async *resolve* process to do this for us. ### Motivation - Several NoSQL stores have added stronger consistency, implying need: [Vogels: Choosing consistency.](http://www.allthingsdistributed.com/2010/02/). - HAT work shows causal even in the face of partitions - allows purely local reads (so does eventual!) - no production-ready store provides it ### Challenges - handling overwritten histories: - eventually-consistent stores often choose a single winning write, existing cc algs fail because the cc layer might not see all writes - maintaining availability ### Innovations - alg that buffers writes and ensures writes satisfy *causal cut*. ### Causality Differentiate between: - **potential causality**: what we usually think of, and *which they do not consider further*, and - **explicit causality**, i.e. tags provided by app ### Bolt-on *separation of concerns* - safety properties (causal *shim*) from - liveness (eventually consistent data store (ECDS)) ### Bolt-on Operations: - **get(key)** - **put(key, value, after)**, where *after* is any set of previously returned writes (identifiers) ### Consistency cases Let's assume x1 -> y1 (x1 existed in store when/where y1 was created). So any replica reading y1 should have (or have had) x1 as well. Assume x1, y1 created by different processes, and discuss what happens at P when it wants to read *y*: 1. P has not received any writes 2. P has received x1, but not y1 3. P has received y1, but not x1 4. P has received both <p> <p> <p> <p> <p> <p> <p> <p> Answers: 1. P has not received and writes, returns y=null 2. P has received x1, but not y1. x1 put in local store, returns y=null. 3. P has received y1, but not x1. P buffers y1, returns y=null. 4. P has received both, both put in local store, returns y=y1. The fun case is 3. We could just go ahead and let P read *y1*, but we'd have to remember that, and stall the process if it subsequently reads *x*. - this means we give up availability (consider network partition) ### Overwritten Histories The following is the motivation to NOT just use the underlying store to hold incoming writes directly. Example (concurrent): - P1 writes x1 -> y1 -> z1 - P2 writes y2 Assume shim stores immediate dependencies. Assume *P* has received `z1`, and then `y2`. *P* knows did know that it needed `y1`, but doesn't know about `x1`. The overwrite by `y2` loses the dependency on `x1`, and the transitive dependency `x1 -> z11` is also lost. ### Causal Cuts Causal cut (paraphrase) : A set of key-versions, at most one per key, such that all dependencies of any key are either part of the cut, or concurrent with the cut's version of that key. Example:  Some causal cuts: - w1, x1, y1 <p> - yes - x1, w2, y1 <p> - yes - w1, x1, y1, z1 <p> - yes - x1, z1 <p> - NO: need y1, w1 or w2 - w2, x1, y1, z1 <p> - yes - w1 <p> - yes - w1, z1 <p> - NO: need x1, y1 - w2 <p> - yes - w1, x1 <p> <p> - yes - y1, z1 <p> - NO: need x1, w1 or w2 ### Metadata Excpected size of dependency set is number of unique keys int he causal history. ### Errata - *Callback on covergence* allows key-versions to be dropped from histories once seen everywhere (GC). - store pruning: drop keys not needed any more from local store; might require dropping other writes that depend on it ## Notes/comments - Nikhil: first read is null, lots of action no reads. - Anubhav - assume stickiness which can slow things down under load - Anubhav - long-range dependencies can add up - does the shim give eventual consistency? - Is it possible for one to make a system that will never converge, but still allow me to lazily claim eventual consistency since it's impossible to prove otherwise? - has it caught on? - surprised that faster than eventual - fate-sharing, how does rebooting client know what happened? - scale-out: assume clients scale w/ replicas (especially since co-located) # Coordination Avoidance in Database Systems When is coordination strictly necessary to maintain application-level consistency? Ask programmers. "Coordination can only be avoided if all local commit decisions are globally valid" Factoids: - invariants can be expressed as part of schema DDL - replicas eventually consistent Issues: - sometimes programmers are bad - black-and-white model: some communication might be cheap, some expensive ### Invariant confluence Replica is *i-valid* iff invariant I is true. *Globally I-valid system* can execute a set of transactions T w/ coord freedom, trans avail, convergence iff T is I-confluent wrt I. A system provides coordination-free execution for a set of transactions T iff the progress of executing each *t in T* is only dependent on the versions of the items t reads (i.e., t's replica state). ## Example **PRIMARY KEY/UNIQUE constraints**. Assume userids should be unique. {Stat:5} and {Mary:5} are both I-T-reachable, but {Stan:5, Mary:5} is not I-valid. Therefore, uniqueness not I-confluent for inserts of unique values. **HOWEVER:** *reads* and *deletes* are both I-confluent under uniqueness. **AND FURTHER:** could add a function to generate unique IDs s.t. IDs from distinct replicas doesn't overlap, and then uniqueness would be I-confluent.