diff --git a/notes/byz.md b/notes/byz.md new file mode 100644 index 0000000000000000000000000000000000000000..a8b00dbffbb60f4957f60d3622ad2267e464aebc --- /dev/null +++ b/notes/byz.md @@ -0,0 +1,274 @@ +# Byzantine Consensus, and PBFT + +## Simple example: Two Generals + + + +- one is a decider: +- both need to attack same time +- need to agree on: + - time: (easy: msg and an ack) + - agreement to attack (hard) + + <p> + <p> + <p> + <p> + +**Example:** +A sends to B "attack at 10". +But did B get it? Can't go unless sure. +B sends an ack, +but did A get the ack? + + <p> + <p> + <p> + <p> + +### Impossibility + +Look at sequence of msg-ack-ack...... + +Assume there is some subset of i msgs that constitutes a proof, and +that both would attack. + +However, what if the last msg not delivered? + - receiver presumably would not attack + - sender, though, sees same msgs as the i-sequence, and so attacks.... + + <p> + <p> + <p> + <p> + +### Fix? +- A sends a whole bunch, assume one gets through +- A and B send and ack a while + +However, *provable agreement between even two parties in asynchronous +environment not possible*. + + <p> + <p> + <p> + <p> + +## Two Lieutenants Problem + +*Safety*: +- all loyal lieutenants make same decision +- all loyal lieutenants follow loyal general + + <p> + <p> + <p> + <p> + + + <p> +Clearly impossible for both lieutenants to always make same decision, as neither knows if the fault + lies with the general or the other lieutenant. **Therefore, no solution when *n = 3f*.** + + <p> + <p> + <p> + <p> + + + <p> +Each lieutenant decides based on majority of input, done! **So *3f+1* works, at least in the case of *f=1*.** + + <p> + <p> + <p> + <p> + + + + <p> + + + +Byzantine faults means byzantine protocols. *3f+1* nodes needed to tolerate *f* +faults. +- (Above) showed impossibility of *3* nodes to tolerate a single fault +- Proof by contradiction: + - Assume we have a protocol, "BYZ3", that works even if one third of its members are faulty. + - Have *3* Byzantine generals each simulate *m* Albanian generals via the above protocol. + - Since only one Byzantine general is faulty, the other *2m* + generals are simulated correctly, so the situation is identical to + having *2f* non-faulty generals and *f* faulty generals. + - If BYZ3 is correct, this simulation will reach the correct conclusion. + - However, there are **really** only *3* generals, meaning we have + solved the problem for *3* generals with a single faulty node, which we + know to be impossible. + - Contradiction! + + +# Practical Byzantine Fault Tolerence (Castro and Liskov) + +## Assumptions +- operations are deterministic +- replicas start in the same state +Means that correct replicas will return identical results. + +## Differences from fail-stop consensus +- 3*f*+1 replicas instead of 2*f*+1 +- 3 phases instead of two +- cryptographic signatures + +## Clients +- send requests to primary replica +- wait for f+1 identical replies + +## Lying Primary +- might start two distinct ops +- if 3*f*+1 replicas total + - overlap of two quorums is 2*f*+1 + + + +## Phases +- *pre-prepare* and *prepare* phases order request even if primary lying +- a replica is *prepared* when: + - has received a proper pre-prepare from primary, sig, current view + - 2*f* "prepares" from other replicas that match it + - Such a replica then: + - multicasts "commit" to other replicas + - a replica is committed-local iff: + - *committed*, and + - has accepted 2*f*+1 commit msgs (possibly including it's own) + - system is *committed* iff *prepared* true for *f*+1 replicas + - *committed* true iff *committed-local* true for some non-faulty replica + +## Garbage Collection +Messages must be retained until they have been executed by at least *f*+1 +non-faulty replicas, and that this can be proved on *view changes*. +- also some replicas may have missed msgs, must be brought up to date + +All this requires crypto (expensive), and therefore: +- proofs generated only occasionally (a "checkpoint") +- proofs prior to a valid checkpoint can be tossed + +## Optimizations +1. Most replies to clients just hashes of authenticated proof. + - a single replica sends entire proof +1. Replicas execute once *prepared* and reply to clients before commit + - client waits for 2*f*+1 matching replies + - message delays down from 5 to 4 +1. Read-only ops send directly to all replicas + - rep waits until prior ops committed + - client waits for 2*f*+1 replies +1. **authenticators**. Rather than public-key sig among replicas: + - assume symmetric key between each replica pair + - rep *r* signs by encrypting msg digest w/ each other rep + - vector of such sigs is the **authenticator** + +## Performance + +- BFS close to NFS + - but this is NFS3, w/ 4k block sizes + +## Other systems +- speculative execution (replying to client before confirming ordering) + reduces to 3 1-way latencies (Zyzzyva) + + +# Transactional Storage for Geo-Replicated Systems ("Walter") + +## Why? + +- snapshot isolation imposes a total ordering of the commit time of all transactions, even those that do not conflict +- writes of a committed transaction must be immediately visible to later transactions + - means commit happens only after writes propagated everywhere + +## Big things + +**parallel snapshot isolation** +- enables commit and timeline per site +- causal ordering among transactions across sites +- no write-write conflicts + + + +**sets** (counting sets) + - commutative, like multisets but allow negative numbers + +## Features +- per-transaction *site* + - isolation within a site +- per-object *preferred sites* + - vs *primary site* (have to be modified at primary. eh) + - no conflict-resolution logic (**but:** they serialize at preferred sites) +- two-phase commit across multiple preferred sites +- asynchronous propogation across sites + - efficient update-anywhere (kinda-sorta) + + + +Assumes *each user communicates with one site at a time*. +- user can modify *any* object, not just those w/ preferred at that site + - this is the difference with primary, which requires all writes to be done at that + site. Not all that different, though, as preferred must still be consulted. + +## PSI +- snapshot isolation locally +- different commit orderings across sites +- but xactions **with overlapping read sets ordered same everywhere** +- causal propagation across sites after-the-fact + + + + + + +### Transaction startup +- transaction has a *site* where it will commit +- transaction is assigned a vector timestamp *startVTS* when it starts. + - For example, if startVTS = `⟨2, 4, 5⟩` then the transaction reads from the snapshot + containing 2 transactions from site 1, 4 from site 2, and 5 from site 3. + - startVTS contains the sequence number of the latest transactions from *each site that were committed at the local site* + - A version number (or simply version) is a pair ⟨site, seqno⟩ assigned to a + transaction when it commits; it has the site where the transaction executed, and a + sequence number local to that site. +- Paxos configuration server maintains preferred sites, system info + +### Committing +- `startVTS` is vv at local site +- `fast commit` (all objects in write preferred set local) + - check that written objects are *unmodified* since start + - check *none locked* (by slow commit protocol) + - **only abort point** + - transaction then given *per-site sequence number* +- `slow commit` (at least one non-local preferred site) + - local site acts as coordinator in two-phase commit + - remote site says *yes* if object unmodified, unlocked + - and locks object + - second phase says to commit and unlock objects. + - Note: + - "unmodified" means since version used by the trans, which might be old + - entire transaction executed locally, lists of updates pushed at end + +## Performance + +- BerkeleyDB is a straw man, widely known to be very slow, especially across wide area +- *one site per data center!* + + + + + +# Comments +"Across sites, weaker consistency is acceptable, because users can tolerate a small delay for their actions to be seen by other users" +- They seem to be confusing consistency with *latency*. + +- "Sets are interesting and the authors discuss them a lot, but I don't really see their + point...are they just a data structure optimized and intended to be used to implement the applications in the paper?" + + + + + +>>>>>>> 8a47827a91f9801f7857a6ac84285f75fac33021 + diff --git a/notes/byzAlb.png b/notes/byzAlb.png new file mode 100644 index 0000000000000000000000000000000000000000..0088362282fe18438d82d8e096e0525e12987d36 Binary files /dev/null and b/notes/byzAlb.png differ diff --git a/notes/byzL.png b/notes/byzL.png new file mode 100644 index 0000000000000000000000000000000000000000..d0ec5da2eee4c14c263d21591e8268be6ed01ad5 Binary files /dev/null and b/notes/byzL.png differ diff --git a/notes/byzL3.png b/notes/byzL3.png new file mode 100644 index 0000000000000000000000000000000000000000..66a10a7d3f08e0f76fe7a025b41d484f777773dd Binary files /dev/null and b/notes/byzL3.png differ diff --git a/notes/byzMsgs.png b/notes/byzMsgs.png new file mode 100644 index 0000000000000000000000000000000000000000..fa6e30a983ea83942c0d7c300393fb2e84d42586 Binary files /dev/null and b/notes/byzMsgs.png differ diff --git a/notes/byzOld.md b/notes/byzOld.md new file mode 100644 index 0000000000000000000000000000000000000000..ffd11c830f8e9dc8b82b2fabbf2802ff78bb32b3 --- /dev/null +++ b/notes/byzOld.md @@ -0,0 +1,269 @@ +# Byzantine Consensus, and PBFT + +## Simple example: Two Generals + + + +- one is a decider: +- both need to attack same time +- need to agree on: + - time: (easy: msg and an ack) + - agreement to attack (hard) + + <p> + <p> + <p> + <p> + +**Example:** +A sends to B "attack at 10". +But did B get it? Can't go unless sure. +B sends an ack, +but did A get the ack? + + <p> + <p> + <p> + <p> + +### Impossibility + +Look at sequence of msg-ack-ack...... + +Assume there is some subset of i msgs that constitutes a proof, and +that both would attack. + +However, what if the last msg not delivered? + - receiver presumably would not attack + - sender, though, sees same msgs as the i-sequence, and so attacks.... + + <p> + <p> + <p> + <p> + +### Fix? +- A sends a whole bunch, assume one gets through +- A and B send and ack a while + +However, *provable agreement between even two parties in asynchronous +environment not possible*. + + <p> + <p> + <p> + <p> + +## Two Lieutenants Problem + +*Safety*: +- all loyal lieutenants make same decision +- all loyal lieutenants follow loyal general + + <p> + <p> + <p> + <p> + + + + + <p> + <p> + <p> + <p> + + + + + <p> + <p> + <p> + <p> + + + + + + + +Byzantine faults means byzantine protocols. *3f+1* nodes needed to tolerate *f* +faults. +- proved impossibility of *3* nodes to tolerate a single fault +- showed by contradiction that no protocol of *3f* nodes can be correct. + - Assume a solution with *3f* nodes + - Use *3* Byzantine generals to simulate, each Byzantine simulates *f* Albanian generals + - if a solution for *3f* exists, this simulation will be correct + - however, there are **really** only *3* generals, meaning we have solved for *3* + nodes w/ one faulty + - contradiction! + + +# Consensus +Problem is for processes to agree on a value that one or more has proposed. + +## Asynchronous vs synchronous systems: + +In synchronous settings, msgs arrive within fixed amount of time: + +- any longer delay means failure + +Asynchronous: + +- msgs arbitrarily slow (*communication asynchrony*) +- machines arbitrarily slow (*process asynchrony*) +- re-ordered messages (*message order asynchrony*) + +Distributed systems must be assumed `concurrent`, `asynchronous`, and +`subject to failure`. + +> Fischer et al. ('85) showed that no guarantee possible for dist agreement, but: + +Despite real systems being async: + +- they work anyway..... +- can reach agreement with "high probability" + - faults can be masked (flash storage) + - "perfect" failure detectors + - not really, but all agree to abide + - use randomization + - mal manipulates procs, comm so msgs arrive at just the wrong time + - random delays make this tougher + +# Fault Tolerance + +Faults may be: + +- transient +- intermittent +- permanent + +Failures: + +- **fail-stop failures** (also "fail-silent") processors fail by ceasing to +communicate. No incorrect communication is sent. +- **byzantine failures** allows faulty process to continue + communicating, but may send arbitrary messages, maybe collaborate + with other "faulty" messages, etc. + +Approaches to faults: + +- *avoidance* - Formal validations, code inspection, etc. +- *fault removal* - Encounter bugs fix, re-run. +- *fault tolerance* - correct operation in the presence of faults. + - information redundancy : replication, coding + - time redundancy : retries (no good for permanent failures) + - physical redundancy : backups, master-slaves, RAID disks + +Primary backup approach: + +- primary does all the work +- backups detect primary failure w/ heartbeat messages + - cold failover - need to restart everything, loses work + - warm failover - primary informs backups of changes (checkpoints, + or msgs), loses little or nothing + + + + +# Consensus System Model (sync) +### Have: + +- a collection of procs p_i (1 .. N) w/ msg-passing +- assume reliable communication, but processes might "fail" (fail-stop) +- maybe *digital signatures*, otherwise process can *lie* + +### Problem: + +- each proc begins "undecided", and proses a val v_i +- procs communicate w/ each other +- each "decides" a value d_i + +### Solution requirements : + +- *termination*: each correct proc sets a d_i +- *agreement*: decision values of all correct procs is same +- *integrity*: if correct processes all proposed same value, then any correct process has to decided this value + + +### How solve consensus w/ no failures? + +- each proc chooses +- reliably multicasts to everyone else +- decisions chosen through majority (or min, or max.....) + + +# Generalization is the byzantine generals +### (or General and the lieutenants) + +- a general decides, tells everyone + - (this different from consensus) +- lieutenants talk amongst themselves about what they have been told by general, other lieutenants + +### Solution requires: + +- (*termination*): finishes +- (*agreement*): all correct lieutenants decide the same +- (*integrity*): if general correct, all correct lieuts decide correctly + +Note +: easy to tell that lying is happening; *hard to tell who* + + +# First, let's show impossible with three: + +- A tells B go, C nogo. +- B and C tell each other what A said. +- Both B and C know someone is lying, but can't tell who. + + + +So no "agreement" or "integrity" + +# If 3 doesn't work, neither does 3f! +Sketch out extension to N = 3f+1; + +Relies on the impossibility of with one faulty, 3 total: + +Assume have an algorithm that works for N = 3f, where f > 1. + +Then: + +- let's have three procs, two correct, one not +- each internally simulates f generals + - the two correct procs each simulate f correct generals, the bad simulates f bad ones + +Assumption means: + +- consensus must be reached +- and all correct generals (i.e. those in the correct procs) reach the right solution + +But: + +- we really only have three procs +- could construct from this a way to solve for three, by: + - each proc "decides" by majority of its simulated generals + - whole system "decides" by majority of procs. + +But now we have solved the problem for three procs! Contradiction.... + +# But 4 works (oral message solution) + +- Oral implies that messages can not be faked, msg origin always known. +- Multiple rounds, recursively telling each other what they've heard +from other nodes. +- works if less than 1/3 faulty (system size must be 3f+1) + + + +Note that this is a rough approximation of the real algorithm, which +requires rounds proportional to the number of participants. + +# What if signatures? + +Solution for 3 exists if we are just worrying about equivocation: + +- each distributes received msgs to others. +- compare three msgs from command to see if faulty. +- relies on a default choice ("no", or "retreat") + diff --git a/notes/byzTwoG.png b/notes/byzTwoG.png new file mode 100644 index 0000000000000000000000000000000000000000..f1c58ec7b4671a82c23389eed73e32edb3ff881d Binary files /dev/null and b/notes/byzTwoG.png differ diff --git a/notes/pbft.md b/notes/pbft.md new file mode 100644 index 0000000000000000000000000000000000000000..7b3e17b3d9e7d96be3eb4264e73a50d120616728 --- /dev/null +++ b/notes/pbft.md @@ -0,0 +1,197 @@ +# Byzantine Consensus, and PBFT + +## Simple example: Two Generals + + + +- one is a decider: +- both need to attack same time +- need to agree on: + - time: (easy: msg and an ack) + - agreement to attack (hard) + + <p> + <p> + <p> + <p> + +**Example:** +A sends to B "attack at 10". +But did B get it? Can't go unless sure. +B sends an ack, +but did A get the ack? + + <p> + <p> + <p> + <p> + +### Impossibility + +Look at sequence of msg-ack-ack...... + +Assume there is some subset of i msgs that constitutes a proof, and +that both would attack. + +However, what if the last msg not delivered? + - receiver presumably would not attack + - sender, though, sees same msgs as the i-sequence, and so attacks.... + + <p> + <p> + <p> + <p> + +### Fix? +- A sends a whole bunch, assume one gets through +- A and B send and ack a while + +However, *provable agreement between even two parties in asynchronous +environment not possible*. + + <p> + <p> + <p> + <p> + +## Two Lieutenants Problem + +*Safety*: +- all loyal lieutenants make same decision +- all loyal lieutenants follow loyal general + + <p> + <p> + <p> + <p> + + + <p> +Clearly impossible for both lieutenants to always make same decision, as neither knows if the fault + lies with the general or the other lieutenant. **Therefore, no solution when *n = 3f*.** + + <p> + <p> + <p> + <p> + + + <p> +Each lieutenant decides based on majority of input, done! **So *3f+1* works, at least in the case of *f=1*.** + + <p> + <p> + <p> + <p> + + + + <p> + + + +Byzantine faults means byzantine protocols. *3f+1* nodes needed to tolerate *f* +faults. +- (Above) showed impossibility of *3* nodes to tolerate a single fault +- Proof by contradiction: + - Assume we have a protocol, "BYZ3", that works even if one third of its members are faulty. + - Have *3* Byzantine generals each simulate *m* Albanian generals via the above protocol. + - Since only one Byzantine general is faulty, the other *2m* + generals are simulated correctly, so the situation is identical to + having *2f* non-faulty generals and *f* faulty generals. + - If BYZ3 is correct, this simulation will reach the correct conclusion. + - However, there are **really** only *3* generals, meaning we have + solved the problem for *3* generals with a single faulty node, which we + know to be impossible. + - Contradiction! + + +## Different Problems +- Byz generals also called *byzantine broadcast*. + - sync: *f < n-1* + - async *f < n/3* +- Byz agreement: trying to decide the truth by quorum consensus ("sky is blue" vs "sky is red") + - sync: *f < n/2* + - async *f < n/3* + +## Castro's proof of *3n + 1* for Byz Agreement +- must be possible to make progress w/ only *n - f* replicas (because *f* might be faulty and choose not to respond) +- however, the *f* not responding might be correct but slow (asynchrony), so there might be *f* bad responses out of the *n - f* + - implies *(n - f - f) > f*, or *n >= 3f + 1* + + + +# Practical Byzantine Fault Tolerence (Castro and Liskov) + +## Assumptions +- operations are deterministic +- replicas start in the same state +Means that correct replicas will return identical results. + +## Differences from fail-stop consensus +- 3*f*+1 replicas instead of 2*f*+1 +- 3 phases instead of two +- cryptographic signatures + +**Properties** +- safety, liveness (*n >= 3f+1*) + - can't prove both in async environment + - safety guaranteed through protocol + - liveness "guaranteed" with reasonable assumption on msg delays + - performance + - one round trip for reads + - 2 round trips for read/write + +**Independent failure modes** +- different security provisions (root passwd etc) +- different implementations !! + + +**byzantine faults** +- adversary(s) can + - coordinate faulty nodes + - delay communication of non-faulty nodes (but not indefinitely) +- equivocation + - saying "yes" to Larry, "no" to "Moe" +- lying about received messages +- guarantees: + - safety and liveness (assuming only *floor(n-1/3)* faulty, *delay(t)* does not grow faster than *t* indefinitely + +**cool things** +- much more robust/secure than paxos +- one extra round, at most + - little other cost +- authenticators + + + + + +## Phases +- *pre-prepare* and *prepare* phases order request even if primary lying +- a replica is *prepared* when: + - has received a proper pre-prepare from primary, sig, current view + - 2*f* "prepares" from other replicas that match it + - Such a replica then: + - multicasts "commit" to other replicas + - a replica is committed-local iff: + - *committed*, and + - has accepted 2*f*+1 commit msgs (possibly including it's own) + - system is *committed* iff *prepared* true for *f*+1 replicas + - *committed* true iff *committed-local* true for some non-faulty replica + + +## Optimizations +- Most replies to clients just hashes of authenticated proof. + - a single replica sends entire proof +- Replicas execute once *prepared* and reply to clients before commit + - client waits for 2*f*+1 matching replies + - message delays down from 5 to 4 +- Read-only ops send directly to all replicas + - rep waits until prior ops committed + - client waits for 2*f*+1 replies +- **authenticators**. Rather than public-key sig among replicas: + - assume symmetric key between each replica pair + - rep *r* signs by encrypting msg digest w/ each other rep + - vector of such sigs is the **authenticator** + diff --git a/notes/pbftNormal.png b/notes/pbftNormal.png new file mode 100644 index 0000000000000000000000000000000000000000..5bf8613d05ca9224fe3654d78e7217c2c9f8b2c6 Binary files /dev/null and b/notes/pbftNormal.png differ