# Consensus, Byzantine Consensus, and PBFT

## Simple consensus example: Two Generals

![](byzTwoG.png)

- one is a decider:
- both need to attack same time
- need to agree on:
  - time: (easy: msg and an ack)
  - agreement to attack (hard)

&nbsp;<p>
&nbsp;<p>
&nbsp;<p>
&nbsp;<p>

**Example:**
A sends to B "attack at 10".
But did B get it? Can't go unless sure.
B sends an ack,
but did A get the ack?

&nbsp;<p>
&nbsp;<p>
&nbsp;<p>
&nbsp;<p>

### Impossibility

Look at sequence of msg-ack-ack......

Assume there is some subset of *i* msgs that would constitute a proof to either general, and
that both would then attack.

However, what if the last msg not delivered?
  - receiver presumably would not attack (hasn't received the sequence)
  - sender, though, sees all msgs as the i-sequence, and attacks....
  - *consensus is hard*....
&nbsp;<p>
&nbsp;<p>
&nbsp;<p>
&nbsp;<p>

### Fix?
- A sends a whole bunch, assume one gets through
- A and B send and ack a while

However, *provable agreement between even two parties in asynchronous
environment not possible*. The same general approach as above was the
basis of Fischer, Lynch, and Paterson's  impossibility proof for
consensus w/ even a single (fail-stop!) process. Basically, at some
point 1 process does something to confirm the consensus, but it could
fail just before this happens.

&nbsp;<p>
&nbsp;<p>
&nbsp;<p>
&nbsp;<p>

# PAXOS

*Agree on a single result in network of unreliable processes.*

### Assumptions:
- non-byzantine (fail-stop) faults
- processors have arbitrary speed
- processors w/ stable storage may rejoin after failure

### Safety requirements:
- only a value that has been proposed may be chosen (non-triviality)
- only a single value chosen (consistency)
- process never learns a value is chosen unless it has been (correctness)
- *impossible*, at least for asynchronous systems (fischer '85)

### Roles
- clients
- proposers
- acceptors
- learners

### Invariants:
1. P1: Accept first proposal received
   - works even if only one
1. P2: If proposal w/ value v is chosen, *all higher-numbered chosen proposals choose v*

## Procedure:
1. Proposer issues PREPARE w/ proposal n and sends to requesters, returns:
   - **PROMISE never again to accept a proposal < n**, with
     - **highest number <n** it has accepted
     - Value **v** it has accepted previously
2. ACCEPT request w/ n, and value (possibly from prior proposal)
   - acceptor can accept a a proposal n iff it has not PROMISEd w/ num > n
   - response is ACCEPTED
   - Majority of ACCEPTED is the winner.  Safety proporties satisfied

Opt:
- don't respond to PREPARE w/ num less than one already seein in a PREPARE (cause won't accept it anyway)

Progress:
- not guaranteed because p and q can duel w/ ever-increasing n

### Paxos

```
  Client   Proposer      Acceptor     Learner
   |         |          |  |  |       |  |
   X-------->|          |  |  |       |  |  Request
   |         X--------->|->|->|       |  |  Prepare(1)
   |         |<---------X--X--X       |  |  Promise(1,{null,null,null})
   |         |          |  |  |       |  |
   |         X--------->|->|->|       |  |  Accept!(1,V)
   |         |<---------X--X--X------>|->|  Accepted(1,V)
   |         |          |  |  |       |  |
   |<---------------------------------X--X  Response
   |         |          |  |  |       |  |
```
### Dueling Proposers
```
Client   Leaders        Acceptor     Learner
   |      |  |          |  |  |       |  |
   X----->|  |          |  |  |       |  |  Request
   |      X------------>|->|  |       |  |  Prepare(1)
   |      |<------------X--X  |       |  |  Promise(1})
   |         |                        |  |
   |         X--------->|  |->|       |  |  Prepare(2)
   |         |<---------X  |--X       |  |  Promise(2)
   |                                  |  |
   |      X------------>|->|  |       |  |  Accept(1,  V1)
   |      |<------------X--X  |       |  |  NACK
   |      X------------>|->|  |       |  |  Prepare(3)
   |      |<------------X--X  |       |  |  Promise(3, {null, null, null})
   |                                  |  |
   |         X--------->|  |->|       |  |  Accept(2)
   |         |<---------X  |--X       |  |  NACK
   |         X--------->|  |->|       |  |  Prepare(4)
   |         |<---------X  |--X       |  |  Promise(4)
   |                                  |  |
   |      |  |          |  |  |       |  |  ... and so on ...
```

### Multi-Paxos (collapsed roles)
```
Leaders      Acceptors
   | |       |  |  | --- First Request ---
   X-------->X->|->|  Prepare(N)
   | |       |<-X--X  Promise(N, I, {Va, Vb...})
   | |       |  |  |
   | |       X->|->|  Accept!(N, I, V)
   | |       |<-X--X  Accepted(N, I)
   |<--------X  |  |  Response
   | |       X  |  |
   X-------->X->|->|  Accept!(N, I+1, W)
   | |       |<-X--X  Accepted(N, I+1)
   |<--------X  |  |  Response
   | |       |  |  |
   X-------->X->|->|  Accept!(N, I+2, Z)
   | |       |<-X--X  Accepted(N, I+2)
   |<--------X  |  |  Response
   | |       |  |  |
   | X------>X->|->|  Prepare(N+1)
   | |       |<-X--X  Promise(N+1, I, {Va, Vb...})
   | |       X->|->|  Accept!(N, I, V)
   | |       |<-X--X  Accepted(N, I)
   |<--------X  |  |  Response
   |
```
Clients are assumed to talk to different leaders.
### 

---------------------------------------------------------------------

# Chubby

- explicitly talk about replicated log as base of dist systems
- basically a key-value store

## Optimizations:
- multipaxos
- batching mult values in single instance
- master leases 
  - ensure reads do not have to go through consensus
- boosting leader seq numbers periodicaly to avoid old masters trying
  to re-take the lead
- snapshots: log is record of steps to achieve a state
  - that state can be used to substitute for the log
  - but logs are under control of the system, while snapshots under
    control of app
	


## Multi-op
- came out of compare-and-swap requirement
- guard: list of tests of various slots
- t_op: DB operations to execute if guard TRUE
- f_op: DB operations to execute if guard FALSE

---------------------------------------------------------------------




## Byzantine consensus: Two Lieutenants Problem

*Safety*:
- all loyal lieutenants make same decision
- all loyal lieutenants follow loyal general

&nbsp;<p>
&nbsp;<p>
&nbsp;<p>
&nbsp;<p>

![](byzL.png)
&nbsp;<p>
Clearly impossible for both lieutenants to always make same decision, as neither knows if the fault
  lies with the general or the other lieutenant.  **Therefore, no solution when *n = 3f*.**

&nbsp;<p>
&nbsp;<p>
&nbsp;<p>
&nbsp;<p>

![](byzL3.png)
&nbsp;<p>
Each lieutenant decides based on majority of input, done! **So *3f+1* works, at least in the case of *f=1*.**

&nbsp;<p>
&nbsp;<p>
&nbsp;<p>
&nbsp;<p>


![](byzAlb.png)
&nbsp;<p>




Byzantine faults means byzantine protocols. *3f+1* nodes needed to tolerate *f*
faults.
- Proof by contradiction:
    - Assume we have a protocol, "BYZ3", that works even if one third of its members are faulty.
    - Have *3* Byzantine generals each simulate *m* Albanian generals via the above protocol.
      - Since only one Byzantine general is faulty, the other *2m*
        generals are simulated correctly, so the situation is identical to
        having *2f* non-faulty generals and *f* faulty generals.
      - If BYZ3 is correct, this simulation will reach the correct conclusion.
      - However, there are **really** only *3* generals, meaning we have
        solved the problem for *3* generals with a single faulty node, which we
        know to be impossible. 
  - Contradiction!


# Practical Byzantine Fault Tolerence  (Castro and Liskov)

## Assumptions
- operations are deterministic
- replicas start in the same state
Means that correct replicas will return identical results.

## Differences from fail-stop consensus
- 3*f*+1 replicas instead of 2*f*+1
- 3 phases instead of two
- cryptographic signatures

## Clients
- send requests to primary replica
- wait for f+1 identical replies

## Lying Primary
- might start two distinct ops
- if 3*f*+1 replicas total
  - overlap of two quorums is 2*f*+1

![](byzMsgs.png)

## Phases
- *pre-prepare* and *prepare* phases order request even if primary lying
- a replica is *prepared* when:
  - has received a proper pre-prepare from primary, sig, current view
  - 2*f* "prepares" from other replicas that match it
  - Such a replica then:
    - multicasts "commit" to other replicas
  - a replica is committed-local iff:
	- *committed*, and 
    - has accepted 2*f*+1 commit msgs (possibly including it's own)
  - system is *committed* iff *prepared* true for *f*+1 replicas
    - *committed* true iff *committed-local* true for some non-faulty replica
	
## Garbage Collection
Messages must be retained until they have been executed by at least *f*+1
non-faulty replicas, and that this can be proved on *view changes*.
- also some replicas may have missed msgs, must be brought up to date

All this requires crypto (expensive), and therefore:
- proofs generated only occasionally (a "checkpoint")
- proofs prior to a valid checkpoint can be tossed

## Optimizations
1. Most replies to clients just hashes of authenticated proof.
  - a single replica sends entire proof
1. Replicas execute once *prepared* and reply to clients before commit
  - client waits for 2*f*+1 matching replies
  - message delays down from 5 to 4
1. Read-only ops send directly to all replicas
  - rep waits until prior ops committed
  - client waits for 2*f*+1 replies
1. **authenticators**. Rather than public-key sig among replicas:
  - assume symmetric key between each replica pair
  - rep *r* signs by encrypting msg digest w/ each other rep
  - vector of such sigs is the **authenticator**

## Performance

- BFS close to NFS
  - but this is NFS3, w/ 4k block sizes

## Other systems
- speculative execution (replying to client before confirming ordering)
  reduces to 3 1-way latencies (Zyzzyva)


# Transactional Storage for Geo-Replicated Systems  ("Walter")

## Why?

- snapshot isolation imposes a total ordering of the commit time of all transactions, even those that do not conflict
- writes of a committed transaction must be immediately visible to later transactions
   - means commit happens only after writes propagated everywhere

## Big things

**parallel snapshot isolation**
- enables commit and timeline per site
- causal ordering among transactions across sites
- no write-write conflicts



**sets** (counting sets)
  - commutative, like multisets but allow negative numbers

## Features
- per-transaction *site*
  - isolation within a site
- per-object *preferred sites*
  - vs *primary site* (have to be modified at primary. eh)
  - no conflict-resolution logic (**but:** they serialize at preferred sites)
- two-phase commit across multiple preferred sites
- asynchronous propogation across sites
  - efficient update-anywhere (kinda-sorta)

![snapshot](walterIsolation.png)

Assumes *each user communicates with one site at a time*.
- user can modify *any* object, not just those w/ preferred at that site
  - this is the difference with primary, which requires all writes to be done at that
  site. Not all that different, though, as preferred must still be consulted.

## PSI
- snapshot isolation locally
- different commit orderings across sites
- but xactions **with overlapping read sets ordered same everywhere**
- causal propagation across sites after-the-fact

![PSI](walterParallel.png)




### Transaction startup
- transaction has a *site* where it will commit
- transaction is assigned a vector timestamp *startVTS*  when it starts.
   - For example, if startVTS = `⟨2, 4, 5⟩` then the transaction reads from the snapshot
     containing 2 transactions from site 1, 4 from site 2, and 5 from site 3.
      - startVTS contains the sequence number of the latest transactions from *each site that were committed at the local site*
   - A version number (or simply version) is a pair ⟨site, seqno⟩ assigned to a
     transaction when it commits; it has the site where the transaction executed, and a
     sequence number local to that site.
- Paxos configuration server maintains preferred sites, system info

### Committing
- `startVTS` is vv at local site
- `fast commit` (all objects in write preferred set local)
  - check that written objects are *unmodified* since start
  - check *none locked* (by slow commit protocol)
  - **only abort point**
  - transaction then given *per-site sequence number*
- `slow commit` (at least one non-local preferred site)
  - local site acts as coordinator in two-phase commit
    - remote site says *yes* if object unmodified, unlocked
      - and locks object
    - second phase says to commit and unlock objects.
  - Note:
    - "unmodified" means since version used by the trans, which might be old
    - entire transaction executed locally, lists of updates pushed at end

## Performance

- BerkeleyDB is a straw man, widely known to be very slow, especially across wide area
- *one site per data center!*
	 
![anomolies](walterAnomolies.png)



# Comments
"Across sites, weaker consistency is acceptable, because users can tolerate a small delay for their actions to be seen by other users"
- They seem to be confusing consistency with *latency*.

- "Sets are interesting and the authors discuss them a lot, but I don't really see their
  point...are they just a data structure optimized and intended to be used to implement the applications in the paper?"





>>>>>>> 8a47827a91f9801f7857a6ac84285f75fac33021

