
# File Systems Unfit (as distributed Storage Backends)

Why implement *BlueStore*:
- hard to implement transactions on existing FS
  - many transaction solutions for FS, but slow, limited functionality, interface or implementation complexity
  - other options slow:
    - implementing Write-Ahead logging in user space
    - key-value store
- local FS metadata performance very bad
  - enumerating directories w/ millions of entries
  - ordering
- rigidity of mature FS prevents them from adopting emerging storage hardware that doesn't use block interface
  - Increased capacity through shingled Magnetic Recording (SMR) works best w/ a *zone* interface
  - *Zoned namespace* to eliminate the long tail in SSDs caused by the translation layer.

# Bluestore
Build from  2015 - 2019 on raw devices.

Novelties
- Novelties:
  - key-value store
  - optimizing clone operations
  - user-space fs
  - space allocator w/ fixed memory usage per TB of disk space


But:
- none of the papers on systems implementing strong consistency later than 2008 [41, 74, 98, 102]
- "data and metadata are first inserted to RocksDB as promises of future I/O, and then asynchronously written to disk after the transaction commits."???


## Options for building transactions in a distributed back-end
### Leveraging in-kernel transaction framework
- designed for consistency of in-kernel FS, not usually available for users
- no rollback (not needed for FS)
  - snapshots for rollback expensive

### Write-Ahead Logging in user space
What?
- log is the ground truth, table storage just a snapshot
- Xtion commits after commit op written to log

**read-modify-write slow**
- three steps for each transaction:
  - write log
  - fsync()
  - push to tables
- read of second Xtion delayed until all three steps performed by first

**non-idempotent ops**
- didn't understand this

**double writes**
- log + tables means avail bandwidth halved

### Key-Value Store as WAL
Writing objects to tables and to log results in four flushes. Slow.

## New device technology
Zone interfaces vs block interfaces (overwrite in place)

## BlueStore
Goals:
- fast metadata operatoins
  - RocksDB, which is only copy of metadata
- no consistency overhead
  - writes directly to disk, single cache flush (vs 4 for WAL in user space)
- space-efficient checksums
  - per extent vs per-file

![cephTimeline](cephTimeline.png)

//=====================================================================

# Lineage Stash

Motivation
- large systems, failure common case
- customer-facing systems prioritize latency as much as throughput

Fault tolerance
- checkpoints:
  - fast during normal ops, slow in recovery (*all processes roll back*)
- lineage:
  - opposite  (*durable logging before message sent*)

![lineageCheckpoint](lineageCheckpoint.png)


**Contributions:**
1. analysis of nondeterministic events that must be logged
1. log storage architecture
1. lineage stash: causal logging w/ low runtime and recovery overheads


Lineage stash:
*decentralized causal logging technique that reduces runtime overhead of lineage approaches w/o impacting recovery speed*

Big idea:
*one can forward the full lineage along with every task invocation*
- asynchronously store the lineage at a central repository
- only forward the part that is not yet pushed to repository

Minimize asynchronousl events recorded:
- piecewise deterministic task execution helps
- needed to record all asynchronous/external events
  - clock reads
  - *order of task submission*
  
## Overview

**Guarantee:**
- if a task fails, any messagesit received since lsat checkpoint will be replayed in same order.

**Invariant:**
- if another task sees the result of a failed task, the tasks inputs and event ordering must be available

## Applicability

### Deterministic applications
![deterministic](deterministic.png)
- lineage not forwarded in normal operation because it can be deterministically recreated after failure.
- Each process just has to remember the tasks it submitted in local stash.
- On recovery, each task that had been submitted *to* the failed process is re-submitted.

### Nondeterministic applications

![nondeterministic](nondeterministic.png)
- lineage forwarded
- on recovery, any other proc that received message *from* failed proc has lineage (ordering info) that can let the failed proc recover.

### Centralized linage store
- during failure recovery, all other processes flush to central store (can be optimized)
- recovering process has all information in one place


-----

*Mootaz*
