Skip to content
Snippets Groups Projects
Commit b61ced18 authored by Peter J. Keleher's avatar Peter J. Keleher
Browse files

auto

parent d31ce334
No related branches found
No related tags found
No related merge requests found
# Byzantine Consensus, and PBFT
## Simple example: Two Generals
![](byzTwoG.png)
- one is a decider:
- both need to attack same time
- need to agree on:
- time: (easy: msg and an ack)
- agreement to attack (hard)
&nbsp;<p>
&nbsp;<p>
&nbsp;<p>
&nbsp;<p>
**Example:**
A sends to B "attack at 10".
But did B get it? Can't go unless sure.
B sends an ack,
but did A get the ack?
&nbsp;<p>
&nbsp;<p>
&nbsp;<p>
&nbsp;<p>
### Impossibility
Look at sequence of msg-ack-ack......
Assume there is some subset of i msgs that constitutes a proof, and
that both would attack.
However, what if the last msg not delivered?
- receiver presumably would not attack
- sender, though, sees same msgs as the i-sequence, and so attacks....
&nbsp;<p>
&nbsp;<p>
&nbsp;<p>
&nbsp;<p>
### Fix?
- A sends a whole bunch, assume one gets through
- A and B send and ack a while
However, *provable agreement between even two parties in asynchronous
environment not possible*.
&nbsp;<p>
&nbsp;<p>
&nbsp;<p>
&nbsp;<p>
## Two Lieutenants Problem
*Safety*:
- all loyal lieutenants make same decision
- all loyal lieutenants follow loyal general
&nbsp;<p>
&nbsp;<p>
&nbsp;<p>
&nbsp;<p>
![](byzL.png)
&nbsp;<p>
Clearly impossible for both lieutenants to always make same decision, as neither knows if the fault
lies with the general or the other lieutenant. **Therefore, no solution when *n = 3f*.**
&nbsp;<p>
&nbsp;<p>
&nbsp;<p>
&nbsp;<p>
![](byzL3.png)
&nbsp;<p>
Each lieutenant decides based on majority of input, done! **So *3f+1* works, at least in the case of *f=1*.**
&nbsp;<p>
&nbsp;<p>
&nbsp;<p>
&nbsp;<p>
![](byzAlb.png)
&nbsp;<p>
Byzantine faults means byzantine protocols. *3f+1* nodes needed to tolerate *f*
faults.
- (Above) showed impossibility of *3* nodes to tolerate a single fault
- Proof by contradiction:
- Assume we have a protocol, "BYZ3", that works even if one third of its members are faulty.
- Have *3* Byzantine generals each simulate *m* Albanian generals via the above protocol.
- Since only one Byzantine general is faulty, the other *2m*
generals are simulated correctly, so the situation is identical to
having *2f* non-faulty generals and *f* faulty generals.
- If BYZ3 is correct, this simulation will reach the correct conclusion.
- However, there are **really** only *3* generals, meaning we have
solved the problem for *3* generals with a single faulty node, which we
know to be impossible.
- Contradiction!
# Practical Byzantine Fault Tolerence (Castro and Liskov)
## Assumptions
- operations are deterministic
- replicas start in the same state
Means that correct replicas will return identical results.
## Differences from fail-stop consensus
- 3*f*+1 replicas instead of 2*f*+1
- 3 phases instead of two
- cryptographic signatures
## Clients
- send requests to primary replica
- wait for f+1 identical replies
## Lying Primary
- might start two distinct ops
- if 3*f*+1 replicas total
- overlap of two quorums is 2*f*+1
![](byzMsgs.png)
## Phases
- *pre-prepare* and *prepare* phases order request even if primary lying
- a replica is *prepared* when:
- has received a proper pre-prepare from primary, sig, current view
- 2*f* "prepares" from other replicas that match it
- Such a replica then:
- multicasts "commit" to other replicas
- a replica is committed-local iff:
- *committed*, and
- has accepted 2*f*+1 commit msgs (possibly including it's own)
- system is *committed* iff *prepared* true for *f*+1 replicas
- *committed* true iff *committed-local* true for some non-faulty replica
## Garbage Collection
Messages must be retained until they have been executed by at least *f*+1
non-faulty replicas, and that this can be proved on *view changes*.
- also some replicas may have missed msgs, must be brought up to date
All this requires crypto (expensive), and therefore:
- proofs generated only occasionally (a "checkpoint")
- proofs prior to a valid checkpoint can be tossed
## Optimizations
1. Most replies to clients just hashes of authenticated proof.
- a single replica sends entire proof
1. Replicas execute once *prepared* and reply to clients before commit
- client waits for 2*f*+1 matching replies
- message delays down from 5 to 4
1. Read-only ops send directly to all replicas
- rep waits until prior ops committed
- client waits for 2*f*+1 replies
1. **authenticators**. Rather than public-key sig among replicas:
- assume symmetric key between each replica pair
- rep *r* signs by encrypting msg digest w/ each other rep
- vector of such sigs is the **authenticator**
## Performance
- BFS close to NFS
- but this is NFS3, w/ 4k block sizes
## Other systems
- speculative execution (replying to client before confirming ordering)
reduces to 3 1-way latencies (Zyzzyva)
# Transactional Storage for Geo-Replicated Systems ("Walter")
## Why?
- snapshot isolation imposes a total ordering of the commit time of all transactions, even those that do not conflict
- writes of a committed transaction must be immediately visible to later transactions
- means commit happens only after writes propagated everywhere
## Big things
**parallel snapshot isolation**
- enables commit and timeline per site
- causal ordering among transactions across sites
- no write-write conflicts
**sets** (counting sets)
- commutative, like multisets but allow negative numbers
## Features
- per-transaction *site*
- isolation within a site
- per-object *preferred sites*
- vs *primary site* (have to be modified at primary. eh)
- no conflict-resolution logic (**but:** they serialize at preferred sites)
- two-phase commit across multiple preferred sites
- asynchronous propogation across sites
- efficient update-anywhere (kinda-sorta)
![snapshot](walterIsolation.png)
Assumes *each user communicates with one site at a time*.
- user can modify *any* object, not just those w/ preferred at that site
- this is the difference with primary, which requires all writes to be done at that
site. Not all that different, though, as preferred must still be consulted.
## PSI
- snapshot isolation locally
- different commit orderings across sites
- but xactions **with overlapping read sets ordered same everywhere**
- causal propagation across sites after-the-fact
![PSI](walterParallel.png)
### Transaction startup
- transaction has a *site* where it will commit
- transaction is assigned a vector timestamp *startVTS* when it starts.
- For example, if startVTS = `⟨2, 4, 5⟩` then the transaction reads from the snapshot
containing 2 transactions from site 1, 4 from site 2, and 5 from site 3.
- startVTS contains the sequence number of the latest transactions from *each site that were committed at the local site*
- A version number (or simply version) is a pair ⟨site, seqno⟩ assigned to a
transaction when it commits; it has the site where the transaction executed, and a
sequence number local to that site.
- Paxos configuration server maintains preferred sites, system info
### Committing
- `startVTS` is vv at local site
- `fast commit` (all objects in write preferred set local)
- check that written objects are *unmodified* since start
- check *none locked* (by slow commit protocol)
- **only abort point**
- transaction then given *per-site sequence number*
- `slow commit` (at least one non-local preferred site)
- local site acts as coordinator in two-phase commit
- remote site says *yes* if object unmodified, unlocked
- and locks object
- second phase says to commit and unlock objects.
- Note:
- "unmodified" means since version used by the trans, which might be old
- entire transaction executed locally, lists of updates pushed at end
## Performance
- BerkeleyDB is a straw man, widely known to be very slow, especially across wide area
- *one site per data center!*
![anomolies](walterAnomolies.png)
# Comments
"Across sites, weaker consistency is acceptable, because users can tolerate a small delay for their actions to be seen by other users"
- They seem to be confusing consistency with *latency*.
- "Sets are interesting and the authors discuss them a lot, but I don't really see their
point...are they just a data structure optimized and intended to be used to implement the applications in the paper?"
>>>>>>> 8a47827a91f9801f7857a6ac84285f75fac33021
notes/byzAlb.png

17.4 KiB

notes/byzL.png

13.4 KiB

notes/byzL3.png

21.3 KiB

notes/byzMsgs.png

162 KiB

# Byzantine Consensus, and PBFT
## Simple example: Two Generals
![](byzTwoG.png)
- one is a decider:
- both need to attack same time
- need to agree on:
- time: (easy: msg and an ack)
- agreement to attack (hard)
&nbsp;<p>
&nbsp;<p>
&nbsp;<p>
&nbsp;<p>
**Example:**
A sends to B "attack at 10".
But did B get it? Can't go unless sure.
B sends an ack,
but did A get the ack?
&nbsp;<p>
&nbsp;<p>
&nbsp;<p>
&nbsp;<p>
### Impossibility
Look at sequence of msg-ack-ack......
Assume there is some subset of i msgs that constitutes a proof, and
that both would attack.
However, what if the last msg not delivered?
- receiver presumably would not attack
- sender, though, sees same msgs as the i-sequence, and so attacks....
&nbsp;<p>
&nbsp;<p>
&nbsp;<p>
&nbsp;<p>
### Fix?
- A sends a whole bunch, assume one gets through
- A and B send and ack a while
However, *provable agreement between even two parties in asynchronous
environment not possible*.
&nbsp;<p>
&nbsp;<p>
&nbsp;<p>
&nbsp;<p>
## Two Lieutenants Problem
*Safety*:
- all loyal lieutenants make same decision
- all loyal lieutenants follow loyal general
&nbsp;<p>
&nbsp;<p>
&nbsp;<p>
&nbsp;<p>
![](byzL.png)
&nbsp;<p>
&nbsp;<p>
&nbsp;<p>
&nbsp;<p>
![](byzL3.png)
&nbsp;<p>
&nbsp;<p>
&nbsp;<p>
&nbsp;<p>
![](byzAlb.png)
Byzantine faults means byzantine protocols. *3f+1* nodes needed to tolerate *f*
faults.
- proved impossibility of *3* nodes to tolerate a single fault
- showed by contradiction that no protocol of *3f* nodes can be correct.
- Assume a solution with *3f* nodes
- Use *3* Byzantine generals to simulate, each Byzantine simulates *f* Albanian generals
- if a solution for *3f* exists, this simulation will be correct
- however, there are **really** only *3* generals, meaning we have solved for *3*
nodes w/ one faulty
- contradiction!
# Consensus
Problem is for processes to agree on a value that one or more has proposed.
## Asynchronous vs synchronous systems:
In synchronous settings, msgs arrive within fixed amount of time:
- any longer delay means failure
Asynchronous:
- msgs arbitrarily slow (*communication asynchrony*)
- machines arbitrarily slow (*process asynchrony*)
- re-ordered messages (*message order asynchrony*)
Distributed systems must be assumed `concurrent`, `asynchronous`, and
`subject to failure`.
> Fischer et al. ('85) showed that no guarantee possible for dist agreement, but:
Despite real systems being async:
- they work anyway.....
- can reach agreement with "high probability"
- faults can be masked (flash storage)
- "perfect" failure detectors
- not really, but all agree to abide
- use randomization
- mal manipulates procs, comm so msgs arrive at just the wrong time
- random delays make this tougher
# Fault Tolerance
Faults may be:
- transient
- intermittent
- permanent
Failures:
- **fail-stop failures** (also "fail-silent") processors fail by ceasing to
communicate. No incorrect communication is sent.
- **byzantine failures** allows faulty process to continue
communicating, but may send arbitrary messages, maybe collaborate
with other "faulty" messages, etc.
Approaches to faults:
- *avoidance* - Formal validations, code inspection, etc.
- *fault removal* - Encounter bugs fix, re-run.
- *fault tolerance* - correct operation in the presence of faults.
- information redundancy : replication, coding
- time redundancy : retries (no good for permanent failures)
- physical redundancy : backups, master-slaves, RAID disks
Primary backup approach:
- primary does all the work
- backups detect primary failure w/ heartbeat messages
- cold failover - need to restart everything, loses work
- warm failover - primary informs backups of changes (checkpoints,
or msgs), loses little or nothing
# Consensus System Model (sync)
### Have:
- a collection of procs p_i (1 .. N) w/ msg-passing
- assume reliable communication, but processes might "fail" (fail-stop)
- maybe *digital signatures*, otherwise process can *lie*
### Problem:
- each proc begins "undecided", and proses a val v_i
- procs communicate w/ each other
- each "decides" a value d_i
### Solution requirements :
- *termination*: each correct proc sets a d_i
- *agreement*: decision values of all correct procs is same
- *integrity*: if correct processes all proposed same value, then any correct process has to decided this value
### How solve consensus w/ no failures?
- each proc chooses
- reliably multicasts to everyone else
- decisions chosen through majority (or min, or max.....)
# Generalization is the byzantine generals
### (or General and the lieutenants)
- a general decides, tells everyone
- (this different from consensus)
- lieutenants talk amongst themselves about what they have been told by general, other lieutenants
### Solution requires:
- (*termination*): finishes
- (*agreement*): all correct lieutenants decide the same
- (*integrity*): if general correct, all correct lieuts decide correctly
Note
: easy to tell that lying is happening; *hard to tell who*
# First, let's show impossible with three:
- A tells B go, C nogo.
- B and C tell each other what A said.
- Both B and C know someone is lying, but can't tell who.
![3 generals](generals3-2.png)
So no "agreement" or "integrity"
# If 3 doesn't work, neither does 3f!
Sketch out extension to N = 3f+1;
Relies on the impossibility of with one faulty, 3 total:
Assume have an algorithm that works for N = 3f, where f > 1.
Then:
- let's have three procs, two correct, one not
- each internally simulates f generals
- the two correct procs each simulate f correct generals, the bad simulates f bad ones
Assumption means:
- consensus must be reached
- and all correct generals (i.e. those in the correct procs) reach the right solution
But:
- we really only have three procs
- could construct from this a way to solve for three, by:
- each proc "decides" by majority of its simulated generals
- whole system "decides" by majority of procs.
But now we have solved the problem for three procs! Contradiction....
# But 4 works (oral message solution)
- Oral implies that messages can not be faked, msg origin always known.
- Multiple rounds, recursively telling each other what they've heard
from other nodes.
- works if less than 1/3 faulty (system size must be 3f+1)
![4 generals](generals4-2.png)
Note that this is a rough approximation of the real algorithm, which
requires rounds proportional to the number of participants.
# What if signatures?
Solution for 3 exists if we are just worrying about equivocation:
- each distributes received msgs to others.
- compare three msgs from command to see if faulty.
- relies on a default choice ("no", or "retreat")
notes/byzTwoG.png

10.9 KiB

# Byzantine Consensus, and PBFT
## Simple example: Two Generals
![](byzTwoG.png)
- one is a decider:
- both need to attack same time
- need to agree on:
- time: (easy: msg and an ack)
- agreement to attack (hard)
&nbsp;<p>
&nbsp;<p>
&nbsp;<p>
&nbsp;<p>
**Example:**
A sends to B "attack at 10".
But did B get it? Can't go unless sure.
B sends an ack,
but did A get the ack?
&nbsp;<p>
&nbsp;<p>
&nbsp;<p>
&nbsp;<p>
### Impossibility
Look at sequence of msg-ack-ack......
Assume there is some subset of i msgs that constitutes a proof, and
that both would attack.
However, what if the last msg not delivered?
- receiver presumably would not attack
- sender, though, sees same msgs as the i-sequence, and so attacks....
&nbsp;<p>
&nbsp;<p>
&nbsp;<p>
&nbsp;<p>
### Fix?
- A sends a whole bunch, assume one gets through
- A and B send and ack a while
However, *provable agreement between even two parties in asynchronous
environment not possible*.
&nbsp;<p>
&nbsp;<p>
&nbsp;<p>
&nbsp;<p>
## Two Lieutenants Problem
*Safety*:
- all loyal lieutenants make same decision
- all loyal lieutenants follow loyal general
&nbsp;<p>
&nbsp;<p>
&nbsp;<p>
&nbsp;<p>
![](byzL.png)
&nbsp;<p>
Clearly impossible for both lieutenants to always make same decision, as neither knows if the fault
lies with the general or the other lieutenant. **Therefore, no solution when *n = 3f*.**
&nbsp;<p>
&nbsp;<p>
&nbsp;<p>
&nbsp;<p>
![](byzL3.png)
&nbsp;<p>
Each lieutenant decides based on majority of input, done! **So *3f+1* works, at least in the case of *f=1*.**
&nbsp;<p>
&nbsp;<p>
&nbsp;<p>
&nbsp;<p>
![](byzAlb.png)
&nbsp;<p>
Byzantine faults means byzantine protocols. *3f+1* nodes needed to tolerate *f*
faults.
- (Above) showed impossibility of *3* nodes to tolerate a single fault
- Proof by contradiction:
- Assume we have a protocol, "BYZ3", that works even if one third of its members are faulty.
- Have *3* Byzantine generals each simulate *m* Albanian generals via the above protocol.
- Since only one Byzantine general is faulty, the other *2m*
generals are simulated correctly, so the situation is identical to
having *2f* non-faulty generals and *f* faulty generals.
- If BYZ3 is correct, this simulation will reach the correct conclusion.
- However, there are **really** only *3* generals, meaning we have
solved the problem for *3* generals with a single faulty node, which we
know to be impossible.
- Contradiction!
## Different Problems
- Byz generals also called *byzantine broadcast*.
- sync: *f < n-1*
- async *f < n/3*
- Byz agreement: trying to decide the truth by quorum consensus ("sky is blue" vs "sky is red")
- sync: *f < n/2*
- async *f < n/3*
## Castro's proof of *3n + 1* for Byz Agreement
- must be possible to make progress w/ only *n - f* replicas (because *f* might be faulty and choose not to respond)
- however, the *f* not responding might be correct but slow (asynchrony), so there might be *f* bad responses out of the *n - f*
- implies *(n - f - f) > f*, or *n >= 3f + 1*
# Practical Byzantine Fault Tolerence (Castro and Liskov)
## Assumptions
- operations are deterministic
- replicas start in the same state
Means that correct replicas will return identical results.
## Differences from fail-stop consensus
- 3*f*+1 replicas instead of 2*f*+1
- 3 phases instead of two
- cryptographic signatures
**Properties**
- safety, liveness (*n >= 3f+1*)
- can't prove both in async environment
- safety guaranteed through protocol
- liveness "guaranteed" with reasonable assumption on msg delays
- performance
- one round trip for reads
- 2 round trips for read/write
**Independent failure modes**
- different security provisions (root passwd etc)
- different implementations !!
**byzantine faults**
- adversary(s) can
- coordinate faulty nodes
- delay communication of non-faulty nodes (but not indefinitely)
- equivocation
- saying "yes" to Larry, "no" to "Moe"
- lying about received messages
- guarantees:
- safety and liveness (assuming only *floor(n-1/3)* faulty, *delay(t)* does not grow faster than *t* indefinitely
**cool things**
- much more robust/secure than paxos
- one extra round, at most
- little other cost
- authenticators
![pbft normal case](pbftNormal.png)
## Phases
- *pre-prepare* and *prepare* phases order request even if primary lying
- a replica is *prepared* when:
- has received a proper pre-prepare from primary, sig, current view
- 2*f* "prepares" from other replicas that match it
- Such a replica then:
- multicasts "commit" to other replicas
- a replica is committed-local iff:
- *committed*, and
- has accepted 2*f*+1 commit msgs (possibly including it's own)
- system is *committed* iff *prepared* true for *f*+1 replicas
- *committed* true iff *committed-local* true for some non-faulty replica
## Optimizations
- Most replies to clients just hashes of authenticated proof.
- a single replica sends entire proof
- Replicas execute once *prepared* and reply to clients before commit
- client waits for 2*f*+1 matching replies
- message delays down from 5 to 4
- Read-only ops send directly to all replicas
- rep waits until prior ops committed
- client waits for 2*f*+1 replies
- **authenticators**. Rather than public-key sig among replicas:
- assume symmetric key between each replica pair
- rep *r* signs by encrypting msg digest w/ each other rep
- vector of such sigs is the **authenticator**
notes/pbftNormal.png

69.8 KiB

0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment