auto

b61ced18 · Peter J. Keleher · d31ce334 · b61ced18 · b61ced18 · b61ced18
Commit b61ced18 authored 3 years ago by Peter J. Keleher
--- a/notes/byz.md
+++ b/notes/byz.md
+#  Byzantine Consensus, and PBFT
+## Simple example: Two Generals
+![](byzTwoG.png)
+- one is a decider:
+- both need to attack same time
+- need to agree on:
+  - time: (easy: msg and an ack)
+  - agreement to attack (hard)
+&nbsp;<p>
+&nbsp;<p>
+&nbsp;<p>
+&nbsp;<p>
+**Example:**
+A sends to B "attack at 10".
+But did B get it? Can't go unless sure.
+B sends an ack,
+but did A get the ack?
+&nbsp;<p>
+&nbsp;<p>
+&nbsp;<p>
+&nbsp;<p>
+### Impossibility
+Look at sequence of msg-ack-ack......
+Assume there is some subset of i msgs that constitutes a proof, and
+that both would attack.  
+However, what if the last msg not delivered?
+  - receiver presumably would not attack
+  - sender, though, sees same msgs as the i-sequence, and so  attacks....
+&nbsp;<p>
+&nbsp;<p>
+&nbsp;<p>
+&nbsp;<p>
+### Fix?
+- A sends a whole bunch, assume one gets through
+- A and B send and ack a while
+However, *provable agreement between even two parties in asynchronous
+environment not possible*.
+&nbsp;<p>
+&nbsp;<p>
+&nbsp;<p>
+&nbsp;<p>
+## Two Lieutenants Problem
+*Safety*:
+- all loyal lieutenants make same decision
+- all loyal lieutenants follow loyal general
+&nbsp;<p>
+&nbsp;<p>
+&nbsp;<p>
+&nbsp;<p>
+![](byzL.png)
+&nbsp;<p>
+Clearly impossible for both lieutenants to always make same decision, as neither knows if the fault
+  lies with the general or the other lieutenant.  **Therefore, no solution when *n = 3f*.**
+&nbsp;<p>
+&nbsp;<p>
+&nbsp;<p>
+&nbsp;<p>
+![](byzL3.png)
+&nbsp;<p>
+Each lieutenant decides based on majority of input, done! **So *3f+1* works, at least in the case of *f=1*.**
+&nbsp;<p>
+&nbsp;<p>
+&nbsp;<p>
+&nbsp;<p>
+![](byzAlb.png)
+&nbsp;<p>
+Byzantine faults means byzantine protocols. *3f+1* nodes needed to tolerate *f*
+faults.
+- (Above) showed impossibility of *3* nodes to tolerate a single fault
+- Proof by contradiction:
+    - Assume we have a protocol, "BYZ3", that works even if one third of its members are faulty.
+    - Have *3* Byzantine generals each simulate *m* Albanian generals via the above protocol.
+      - Since only one Byzantine general is faulty, the other *2m*
+        generals are simulated correctly, so the situation is identical to
+        having *2f* non-faulty generals and *f* faulty generals.
+      - If BYZ3 is correct, this simulation will reach the correct conclusion.
+      - However, there are **really** only *3* generals, meaning we have
+        solved the problem for *3* generals with a single faulty node, which we
+        know to be impossible. 
+  - Contradiction!
+# Practical Byzantine Fault Tolerence  (Castro and Liskov)
+## Assumptions
+- operations are deterministic
+- replicas start in the same state
+Means that correct replicas will return identical results.
+## Differences from fail-stop consensus
+- 3*f*+1 replicas instead of 2*f*+1
+- 3 phases instead of two
+- cryptographic signatures
+## Clients
+- send requests to primary replica
+- wait for f+1 identical replies
+## Lying Primary
+- might start two distinct ops
+- if 3*f*+1 replicas total
+  - overlap of two quorums is 2*f*+1
+![](byzMsgs.png)
+## Phases
+- *pre-prepare* and *prepare* phases order request even if primary lying
+- a replica is *prepared* when:
+  - has received a proper pre-prepare from primary, sig, current view
+  - 2*f* "prepares" from other replicas that match it
+  - Such a replica then:
+    - multicasts "commit" to other replicas
+  - a replica is committed-local iff:
+	- *committed*, and 
+    - has accepted 2*f*+1 commit msgs (possibly including it's own)
+  - system is *committed* iff *prepared* true for *f*+1 replicas
+    - *committed* true iff *committed-local* true for some non-faulty replica
+## Garbage Collection
+Messages must be retained until they have been executed by at least *f*+1
+non-faulty replicas, and that this can be proved on *view changes*.
+- also some replicas may have missed msgs, must be brought up to date
+All this requires crypto (expensive), and therefore:
+- proofs generated only occasionally (a "checkpoint")
+- proofs prior to a valid checkpoint can be tossed
+## Optimizations
+1. Most replies to clients just hashes of authenticated proof.
+  - a single replica sends entire proof
+1. Replicas execute once *prepared* and reply to clients before commit
+  - client waits for 2*f*+1 matching replies
+  - message delays down from 5 to 4
+1. Read-only ops send directly to all replicas
+  - rep waits until prior ops committed
+  - client waits for 2*f*+1 replies
+1. **authenticators**. Rather than public-key sig among replicas:
+  - assume symmetric key between each replica pair
+  - rep *r* signs by encrypting msg digest w/ each other rep
+  - vector of such sigs is the **authenticator**
+## Performance
+- BFS close to NFS
+  - but this is NFS3, w/ 4k block sizes
+## Other systems
+- speculative execution (replying to client before confirming ordering)
+  reduces to 3 1-way latencies (Zyzzyva)
+# Transactional Storage for Geo-Replicated Systems  ("Walter")
+## Why?
+- snapshot isolation imposes a total ordering of the commit time of all transactions, even those that do not conflict
+- writes of a committed transaction must be immediately visible to later transactions
+   - means commit happens only after writes propagated everywhere
+## Big things
+**parallel snapshot isolation**
+- enables commit and timeline per site
+- causal ordering among transactions across sites
+- no write-write conflicts
+**sets** (counting sets)
+  - commutative, like multisets but allow negative numbers
+## Features
+- per-transaction *site*
+  - isolation within a site
+- per-object *preferred sites*
+  - vs *primary site* (have to be modified at primary. eh)
+  - no conflict-resolution logic (**but:** they serialize at preferred sites)
+- two-phase commit across multiple preferred sites
+- asynchronous propogation across sites
+  - efficient update-anywhere (kinda-sorta)
+![snapshot](walterIsolation.png)
+Assumes *each user communicates with one site at a time*.
+- user can modify *any* object, not just those w/ preferred at that site
+  - this is the difference with primary, which requires all writes to be done at that
+  site. Not all that different, though, as preferred must still be consulted.
+## PSI
+- snapshot isolation locally
+- different commit orderings across sites
+- but xactions **with overlapping read sets ordered same everywhere**
+- causal propagation across sites after-the-fact
+![PSI](walterParallel.png)
+### Transaction startup
+- transaction has a *site* where it will commit
+- transaction is assigned a vector timestamp *startVTS*  when it starts.
+   - For example, if startVTS = `⟨2, 4, 5⟩` then the transaction reads from the snapshot
+     containing 2 transactions from site 1, 4 from site 2, and 5 from site 3.
+      - startVTS contains the sequence number of the latest transactions from *each site that were committed at the local site*
+   - A version number (or simply version) is a pair ⟨site, seqno⟩ assigned to a
+     transaction when it commits; it has the site where the transaction executed, and a
+     sequence number local to that site.
+- Paxos configuration server maintains preferred sites, system info
+### Committing
+- `startVTS` is vv at local site
+- `fast commit` (all objects in write preferred set local)
+  - check that written objects are *unmodified* since start
+  - check *none locked* (by slow commit protocol)
+  - **only abort point**
+  - transaction then given *per-site sequence number*
+- `slow commit` (at least one non-local preferred site)
+  - local site acts as coordinator in two-phase commit
+    - remote site says *yes* if object unmodified, unlocked
+      - and locks object
+    - second phase says to commit and unlock objects.
+  - Note:
+    - "unmodified" means since version used by the trans, which might be old
+    - entire transaction executed locally, lists of updates pushed at end
+## Performance
+- BerkeleyDB is a straw man, widely known to be very slow, especially across wide area
+- *one site per data center!*
+![anomolies](walterAnomolies.png)
+# Comments
+"Across sites, weaker consistency is acceptable, because users can tolerate a small delay for their actions to be seen by other users"
+- They seem to be confusing consistency with *latency*.
+- "Sets are interesting and the authors discuss them a lot, but I don't really see their
+  point...are they just a data structure optimized and intended to be used to implement the applications in the paper?"
+>>>>>>> 8a47827a91f9801f7857a6ac84285f75fac33021
--- a/notes/byzAlb.png
+++ b/notes/byzAlb.png
--- a/notes/byzL.png
+++ b/notes/byzL.png
--- a/notes/byzL3.png
+++ b/notes/byzL3.png
--- a/notes/byzMsgs.png
+++ b/notes/byzMsgs.png
--- a/notes/byzOld.md
+++ b/notes/byzOld.md
+#  Byzantine Consensus, and PBFT
+## Simple example: Two Generals
+![](byzTwoG.png)
+- one is a decider:
+- both need to attack same time
+- need to agree on:
+  - time: (easy: msg and an ack)
+  - agreement to attack (hard)
+&nbsp;<p>
+&nbsp;<p>
+&nbsp;<p>
+&nbsp;<p>
+**Example:**
+A sends to B "attack at 10".
+But did B get it? Can't go unless sure.
+B sends an ack,
+but did A get the ack?
+&nbsp;<p>
+&nbsp;<p>
+&nbsp;<p>
+&nbsp;<p>
+### Impossibility
+Look at sequence of msg-ack-ack......
+Assume there is some subset of i msgs that constitutes a proof, and
+that both would attack.  
+However, what if the last msg not delivered?
+  - receiver presumably would not attack
+  - sender, though, sees same msgs as the i-sequence, and so  attacks....
+&nbsp;<p>
+&nbsp;<p>
+&nbsp;<p>
+&nbsp;<p>
+### Fix?
+- A sends a whole bunch, assume one gets through
+- A and B send and ack a while
+However, *provable agreement between even two parties in asynchronous
+environment not possible*.
+&nbsp;<p>
+&nbsp;<p>
+&nbsp;<p>
+&nbsp;<p>
+## Two Lieutenants Problem
+*Safety*:
+- all loyal lieutenants make same decision
+- all loyal lieutenants follow loyal general
+&nbsp;<p>
+&nbsp;<p>
+&nbsp;<p>
+&nbsp;<p>
+![](byzL.png)
+&nbsp;<p>
+&nbsp;<p>
+&nbsp;<p>
+&nbsp;<p>
+![](byzL3.png)
+&nbsp;<p>
+&nbsp;<p>
+&nbsp;<p>
+&nbsp;<p>
+![](byzAlb.png)
+Byzantine faults means byzantine protocols. *3f+1* nodes needed to tolerate *f*
+faults.
+- proved impossibility of *3* nodes to tolerate a single fault
+- showed by contradiction that no protocol of *3f* nodes can be correct.
+  - Assume a solution with *3f* nodes
+  - Use *3* Byzantine generals to simulate, each Byzantine simulates *f* Albanian generals
+  - if a solution for *3f* exists, this simulation will be correct
+  - however, there are **really** only *3* generals, meaning we have solved for *3*
+    nodes w/ one faulty
+  - contradiction!
+# Consensus
+Problem is for processes to agree on a value that one or more has proposed.
+## Asynchronous vs synchronous systems:
+In synchronous settings, msgs arrive within fixed amount of time:
+- any longer delay means failure
+Asynchronous:
+- msgs arbitrarily slow (*communication asynchrony*)
+- machines arbitrarily slow (*process asynchrony*)
+- re-ordered messages (*message order asynchrony*)
+Distributed systems must be assumed `concurrent`, `asynchronous`, and
+`subject to failure`.
+> Fischer et al. ('85) showed that no guarantee possible for dist agreement, but:
+Despite real systems being async:
+- they work anyway.....
+- can reach agreement with "high probability"
+   - faults can be masked (flash storage)
+   - "perfect" failure detectors
+      - not really, but all agree to abide
+   - use randomization
+      - mal manipulates procs, comm so msgs arrive at just the wrong time
+      - random delays make this tougher
+# Fault Tolerance
+Faults may be:
+- transient
+- intermittent
+- permanent
+Failures:
+- **fail-stop failures**  (also "fail-silent") processors fail by ceasing to
+communicate. No incorrect communication is sent.
+- **byzantine failures** allows faulty process to continue
+  communicating, but may send arbitrary messages, maybe collaborate
+  with other "faulty" messages, etc.
+Approaches to faults:
+- *avoidance* - Formal validations, code inspection, etc.
+- *fault removal* - Encounter bugs fix, re-run.
+- *fault tolerance* - correct operation in the presence of faults.
+  - information redundancy : replication, coding
+  - time redundancy : retries (no good for permanent failures)
+  - physical redundancy : backups, master-slaves, RAID disks
+Primary backup approach:
+- primary does all the work
+- backups detect primary failure w/ heartbeat messages
+  - cold failover - need to restart everything, loses work
+  - warm failover - primary informs backups of changes (checkpoints,
+    or msgs), loses little or nothing
+# Consensus System Model  (sync)
+### Have:
+- a collection of procs p_i (1 .. N) w/ msg-passing
+- assume reliable communication, but processes might "fail" (fail-stop)
+- maybe *digital signatures*, otherwise process can *lie*
+### Problem:
+- each proc begins "undecided", and proses a val v_i
+- procs communicate w/ each other
+- each "decides" a value d_i
+### Solution requirements :
+- *termination*: each correct proc sets a d_i
+- *agreement*: decision values of all correct procs is same
+- *integrity*: if correct processes all proposed same value, then any correct process has to decided this value
+### How solve consensus w/ no failures?
+- each proc chooses
+- reliably multicasts to everyone else
+- decisions chosen through majority (or min, or max.....)
+# Generalization is the byzantine generals
+### (or General and the lieutenants)
+- a general decides, tells everyone  
+   - (this different from consensus)
+- lieutenants talk amongst themselves about what they have been told by general, other lieutenants
+### Solution requires:
+- (*termination*): finishes
+- (*agreement*):  all correct lieutenants decide the same
+- (*integrity*): if general correct, all correct lieuts decide correctly
+Note
+: easy to tell that lying is happening; *hard to tell who*
+# First, let's show impossible with three:
+- A tells B go, C nogo. 
+- B and C tell each other what A said. 
+- Both B and C know someone is lying, but can't tell who.
+![3 generals](generals3-2.png)
+So no "agreement" or "integrity"
+# If 3 doesn't work, neither does 3f!
+Sketch out extension to N = 3f+1;
+Relies on the impossibility of with one faulty, 3 total:
+Assume have an algorithm that works for N = 3f, where f > 1. 
+Then:
+- let's have three procs, two correct, one not
+- each internally simulates f generals
+   - the two correct procs each simulate f correct generals, the bad simulates f bad ones
+Assumption means:
+- consensus must be reached
+- and all correct generals (i.e. those in the correct procs) reach the right solution
+But:
+- we really only have three procs
+- could construct from this a way to solve for three, by:
+   - each proc "decides" by majority of its simulated generals
+   - whole system "decides" by majority of procs.
+But now we have solved the problem for three procs! Contradiction....
+# But 4 works (oral message solution)
+- Oral implies that messages can not be faked, msg origin always known.
+- Multiple rounds, recursively telling each other what they've heard
+from other nodes.
+- works if less than 1/3 faulty  (system size must be 3f+1)
+![4 generals](generals4-2.png)
+Note that this is a rough approximation of the real algorithm, which
+requires rounds proportional to the number of participants.
+# What if signatures?
+Solution for 3 exists if we are just worrying about equivocation:
+- each distributes received msgs to others.
+- compare three msgs from command to see if faulty.
+- relies on a default choice ("no", or "retreat")
--- a/notes/byzTwoG.png
+++ b/notes/byzTwoG.png
--- a/notes/pbft.md
+++ b/notes/pbft.md
+#  Byzantine Consensus, and PBFT
+## Simple example: Two Generals
+![](byzTwoG.png)
+- one is a decider:
+- both need to attack same time
+- need to agree on:
+  - time: (easy: msg and an ack)
+  - agreement to attack (hard)
+&nbsp;<p>
+&nbsp;<p>
+&nbsp;<p>
+&nbsp;<p>
+**Example:**
+A sends to B "attack at 10".
+But did B get it? Can't go unless sure.
+B sends an ack,
+but did A get the ack?
+&nbsp;<p>
+&nbsp;<p>
+&nbsp;<p>
+&nbsp;<p>
+### Impossibility
+Look at sequence of msg-ack-ack......
+Assume there is some subset of i msgs that constitutes a proof, and
+that both would attack.  
+However, what if the last msg not delivered?
+  - receiver presumably would not attack
+  - sender, though, sees same msgs as the i-sequence, and so  attacks....
+&nbsp;<p>
+&nbsp;<p>
+&nbsp;<p>
+&nbsp;<p>
+### Fix?
+- A sends a whole bunch, assume one gets through
+- A and B send and ack a while
+However, *provable agreement between even two parties in asynchronous
+environment not possible*.
+&nbsp;<p>
+&nbsp;<p>
+&nbsp;<p>
+&nbsp;<p>
+## Two Lieutenants Problem
+*Safety*:
+- all loyal lieutenants make same decision
+- all loyal lieutenants follow loyal general
+&nbsp;<p>
+&nbsp;<p>
+&nbsp;<p>
+&nbsp;<p>
+![](byzL.png)
+&nbsp;<p>
+Clearly impossible for both lieutenants to always make same decision, as neither knows if the fault
+  lies with the general or the other lieutenant.  **Therefore, no solution when *n = 3f*.**
+&nbsp;<p>
+&nbsp;<p>
+&nbsp;<p>
+&nbsp;<p>
+![](byzL3.png)
+&nbsp;<p>
+Each lieutenant decides based on majority of input, done! **So *3f+1* works, at least in the case of *f=1*.**
+&nbsp;<p>
+&nbsp;<p>
+&nbsp;<p>
+&nbsp;<p>
+![](byzAlb.png)
+&nbsp;<p>
+Byzantine faults means byzantine protocols. *3f+1* nodes needed to tolerate *f*
+faults.
+- (Above) showed impossibility of *3* nodes to tolerate a single fault
+- Proof by contradiction:
+    - Assume we have a protocol, "BYZ3", that works even if one third of its members are faulty.
+    - Have *3* Byzantine generals each simulate *m* Albanian generals via the above protocol.
+      - Since only one Byzantine general is faulty, the other *2m*
+        generals are simulated correctly, so the situation is identical to
+        having *2f* non-faulty generals and *f* faulty generals.
+      - If BYZ3 is correct, this simulation will reach the correct conclusion.
+      - However, there are **really** only *3* generals, meaning we have
+        solved the problem for *3* generals with a single faulty node, which we
+        know to be impossible. 
+  - Contradiction!
+## Different Problems
+- Byz generals also called *byzantine broadcast*.
+  - sync: *f < n-1*
+  - async *f < n/3*
+- Byz agreement: trying to decide the truth by quorum consensus ("sky is blue" vs "sky is red")
+  - sync: *f < n/2*
+  - async *f < n/3*
+## Castro's proof of *3n + 1* for Byz Agreement
+- must be possible to make progress w/ only *n - f* replicas (because *f* might be faulty and choose not to respond)
+- however, the *f* not responding might be correct but slow (asynchrony), so there might be *f* bad responses out of the *n - f*
+  - implies *(n - f - f) > f*, or *n >= 3f + 1*
+# Practical Byzantine Fault Tolerence  (Castro and Liskov)
+## Assumptions
+- operations are deterministic
+- replicas start in the same state
+Means that correct replicas will return identical results.
+## Differences from fail-stop consensus
+- 3*f*+1 replicas instead of 2*f*+1
+- 3 phases instead of two
+- cryptographic signatures
+**Properties**
+- safety, liveness (*n >= 3f+1*)
+  - can't prove both in async environment
+  - safety guaranteed through protocol
+  - liveness "guaranteed" with reasonable assumption on msg delays
+  - performance
+    - one round trip for reads
+    - 2 round trips for read/write
+**Independent failure modes**
+- different security provisions (root passwd etc)
+- different implementations  !!
+**byzantine faults**
+- adversary(s) can
+  - coordinate faulty nodes
+  - delay communication of non-faulty nodes (but not indefinitely)
+- equivocation
+  - saying "yes" to Larry, "no" to "Moe"
+- lying about received messages
+- guarantees:
+  - safety and liveness (assuming only *floor(n-1/3)* faulty, *delay(t)* does not grow faster than *t* indefinitely
+**cool things**
+- much more robust/secure than paxos
+- one extra round, at most
+  - little other cost
+- authenticators
+![pbft normal case](pbftNormal.png)
+## Phases
+- *pre-prepare* and *prepare* phases order request even if primary lying
+- a replica is *prepared* when:
+  - has received a proper pre-prepare from primary, sig, current view
+  - 2*f* "prepares" from other replicas that match it
+  - Such a replica then:
+    - multicasts "commit" to other replicas
+  - a replica is committed-local iff:
+	- *committed*, and 
+    - has accepted 2*f*+1 commit msgs (possibly including it's own)
+  - system is *committed* iff *prepared* true for *f*+1 replicas
+    - *committed* true iff *committed-local* true for some non-faulty replica
+## Optimizations
+- Most replies to clients just hashes of authenticated proof.
+  - a single replica sends entire proof
+- Replicas execute once *prepared* and reply to clients before commit
+  - client waits for 2*f*+1 matching replies
+  - message delays down from 5 to 4
+- Read-only ops send directly to all replicas
+  - rep waits until prior ops committed
+  - client waits for 2*f*+1 replies
+- **authenticators**. Rather than public-key sig among replicas:
+  - assume symmetric key between each replica pair
+  - rep *r* signs by encrypting msg digest w/ each other rep
+  - vector of such sigs is the **authenticator**
--- a/notes/pbftNormal.png
+++ b/notes/pbftNormal.png