# CORFU: A Shared Log Design for Flash Clusters

# Key features
- many ssd's
  - must support *single-copy* semantics
  - **trim** indicates page no longer used
  - **seal** associate epoch numbers with pages, reject lower epoch write
  - "infinite address space"
- "hundreds of client machines append to the tail of a single log .. concurrently"
- flash written *and read* in page units, writes are usually blocks of multiple pages

![arch](corfu1.png)

## Why
-  not many SSD's attached to powerful machine?
   - bandwidth-limited

## What
- *shared log abstraction implemented over a cluster of flash units*
- each position in shared log mapped to a set of pages on different units
   - consistent mapping maintained at clients

### Functions:
- **to read:**
   - use local map to find a page w/ the position
   - directly issue the read
- **to append:**
   - get next avail position in log from *sequencer*
   - use local map to find set of pages mapped to it
   - write them
   - *ordering decoupled from I/O*

Or:
- a **mapping function**
- a **tail-finding function**
- a **replication protocol**

![arch](corfu2.png)

## Mapping

![arch](corfu3.png)

- *projection* splits log into disjoint ranges
- each range mapped to a set of extents on flash drives
- log positions mapped to extent pages determininistically (round robin)
- log position might be mapped to multiple pages (**replication**)

**Reconfigurations:**
- sealing current projection
   - seal some subset of flash units (if a page is no longer mapped to that
     unit in new epoch)
   - ensures rejection of inflight messages
- writing new projection in auxilary structure
   - contention happens, w/ backoff
- on flash failure
   - move to new projection using the remaining replica
   - copy data from the remaining replica to new replica 
   - move to new configuration w/ the replication
- maybe implemented w/ Paxos


## Sequencer
- dirty simple
- just an optimization

## Replication
- safety-under-contention
- durability
- chain replication to avoid conflicting writes to distinct replicas
- reads can be distributed across replicas if durablity known, otherwise chain end
- hole-filling might cause contention between slow host and fillers. What
  happens if filler wins?

## Garbage collection
- ever growning address space
- app "trims" sections no longer needed
- sparse address space
- maybe proactive movement for long-lived pages (keeps projection small)


### What's hard
- holes
- failures
   - *vertical PAXOS*
- consistent map

![arch](corfu4.png)


## Questions/Comments
- latency of chain repication (nao)
- why holes heed to be filled, and what with (katura)
- vertical what? (rebecca)
- latency of sequencer might be reduced (gang)


### PAXOS stuff

[PAXOS variants](http://paxos.systems/variants.html)

Vertical paxos: allows reconfiguration while the state machine is active.




# Tango Distributed Data Structures over a Shared Log

## Why?
Existing systems build abstractions for computing over massive data sets:
- hadoop
- Spark


Need **"application metadata"**, with *persistence* and *high availability*.
- maps
- counters
- queues
- graphs
- job assignments
- network topologies
.....


## How?

- client modifies object by appending an update to the log
- accesses the object by sync'ing local view w/ log
- *elasticity* - scaling throughput of linearizable reads by adding new views,
  w/o slowing write throughput.  ("until saturation")
- transaction **atomicity** and **isolation** from log
- *streams* to filter log seen at clients

![tango](tango1.png)

### Tranactions: ###
![tango](tangoTrans.png)
- optimistic concurrency control
  - writes entered in log as speculative
  - commit record contains a read set w/ versions.
  - transaction *succeeds* if read objects current at commit record.
  - each reader deterministically evaluates commit record
  - **read transactions**:
    - nothing inserted in log
    - local record of read versions
	- commit/abort by getting log tail from sequencer and checking for updates
      to readset
- can use fine-grained per-app versions
  - opaque key parameters in helper funcs
- crashed client's transaction aborted by others appending crash record


### Streams

- per-object stream
- transaction commits *multi-appended* to all relevant streams
- *remote-write* transactions
- decision records
- a client executing a transaction must insert a decision record for a transaction if there’s some other client in the system that hosts an object in its write set but not all the objects in its read set. "
- generating clients can not do a remote read in trans (would require RPCs...)
![tango](tangoDecisions.png)


### Comments/Questions

- (katura) key challenge is playback bottleneck
- objects can contain pointers to other objects in the log
  - `apply()` upcall has optional offset into the log of each new update
- multiple sequencers no longer tolerated because stream backpointers must be
  correct.
  - sequencer getting complicated
