# CalvinFS
- metadata (small-file) only
- linearizable
- shared-nothing


## Existing state-of-the-world
- ...by leveraging replicated block stores...rely on assumption that the
  average file size is very large (GoogleFS, HDFS)


Partitioning
- *horizontal partitioning* different chunks of rows at different sites
- *vertical parititioning* splitting table schema into multiple (some columns
  go left, some go right), normalization

## Calvin Design

- log
  - design
    - large number front-end servers
    - async-replicated block store
	- small paxos group of metalog servers
  - requests added by clients
    - send to a front end
    - front end batches w/ other requests and writes batch to block store
	- wait until block store sufficiently replicates (2 of 3) (how is this async?)
	- send block-id to paxos metalog
  - global totally-ordered sequence of transaction requests (*read-modify-write
    operation* on storage layer)
- storage layer
  - set of multi-version k-v stores	
  - consistent hashing for balance
- scheduling layer
  - deterministic locking: 2-phase locking but
    - all locks requested at trans start
    - in order needed
    - transaction ordering deterministic
  - results:
    - no deadlocks
    - no transaction aborts?
	- no distributed commit protocol
- distributed transaction at each site:
  - do all *local* reads
  - forward local read values to all sites
  - wait for needed remote reads
  - execute transaction (ignore writes to non-local data)

## Back to CalvinFS
- Optimistic Lock Location Prediction (OLLP)
  - needed because some ops (recursive ops) can't be completely annotated in advance
  - maybe static
  - maybe just execute trans w/o writes ("no isolation")
  - abort/retry if touch data not predicted

Reads:
- linearizable (go through log-ordering)
- slightly stale, maybe 400ms 
- client-specified stale read

### Block store
- chunkservers. done.

## Comments/Questions
- block stores become slow when lots of small files
- greg: "However, natural disasters, configuration errors, hunters, and
  squirrels sometimes render entire datacenters unavailable for spans of time
  ranging from minutes to days."
- guowei: assumption that we only have a WAN for connecting our nodes seems to
  be unrealistic

