# State-Machine Replication for Planet-Scale Systems (ATLAS)

EPaxos but......faster.

- fastpath quorum: *floor(n/2) + f*
- can *sometimes* use the fast path w/ concurrent non-commuting ops
- *always* fast path for *f=1*, fast path minimal majority
- faster non-fault-tolerant reads "when the conflict relation between commands is transitive". 

![protocol](atlasFig1.png)

## Protocol
![protocol](atlasProtocol.png)


### Slow Path
size is only *f+1*!      (flexible paxos)

## Recovery

Background, fast path iff:

&nbsp;&nbsp;&nbsp;   *U<sub>Q</sub>dep = U(f)<sub>Q</sub>dep*        (line 15 in protocol)

![protocol](atlasFig2.png)


Easy to see that recovery of fast path dep proposal can be recovered
after coordinator and one other die, but how to tell whether the fast
path was to be taken?

How to distinguish 2a from this version:
- "2" didn't report "c", so fast path should not be taken
- assume (1) and (4) die

![noC](atlasNoC.png)

Slow path quorum is *f+1* = 3, *but all f+1 need to reply*, meaning
2,3,5 have to respond, and get a *D* that does not include *c*.



## Recovery, continued
Recovery quorum bigger (*n-f*, including sender) to compensate for smaller Consensus quorum (*f+1*)

**Property 2:** *Any fast-path proposal can be obtained by taking the union of the dependencies sent in MCOLLECTAck by at least* **floor(n/2)** *fast-quorum processes that are not the iniital coordinator*.

### How to recover *D* at *f=2* after just fast path w/ two failures?

1. One failure leaves at least one replica that reported the dep, so it will be seen afterwards. 
1. If the other failure is the coordinator, we can recover because coordinator appends any deps it knows to MCOLLECT messages. Therefore, if the coordinator was one of the *f* that reported a specific dependency, the other particplants of the collect phase will know about it.
1. Analogous for other *f*.

Recovery states:
- at least one non-failed replica has a commit: sends to all others
- none 

### What if the two failures


*The initial coordinator of a command can avoid consensus when it can ensure that any process performing recovery will propose the same set of dependencies to consensus.* 

![recovery](atlasRecovery.png?1)

## Performance

![recovery](atlasPerf1.png?1)
Note that Epaxos w/ *f=1* also always takes the fast path, though they
don't mention this.

![recovery](atlasPerf2.png?1)
Apples to oranges here a bit. Epaxos fault tolerance increasing w/
more sites, but Atlas not.


## Issues
- "violating the assumption ...number of failures may only compromise liveness, but never safety"
  - assumes replicas recoverable
