# Archival Intermemory

Intermemory

- non-commercial
- self-organizing
- market incentive

## System

Erasure codes:
r * N pts of polynomial degree N-1. Any N pts suffices to recover.

Coding:

- r = 2  (inflates x2 in size)
- level 1: 32 shares, any 16 suffice
- level 2: 1024 shares
- failed processor detected, recovered by neighbors

Payment:
- each subscriber "invests" chunk of space for a specific time
- interest on the investment gives subscriber eternal storage of a
  smaller amount

## Formulae

- let P\_t be the number of units participating at time t
- P\_t = P\_{t-1} - deaths + subscriptions
- P\_t = P\_{t-1} - P\_{t-1} * Rd + P\_{t-1} * Rs
- P\_t = (1 - Rd + Rs)P\_{t-1}

Then:

- g = 1 - Rd + Rs
- P\_t = g * P\_{t-1}

## Bottom Line

Over a unit's expected lifetime it may spend:
(1/Rd) * ((g-1)/g)

Example:

- expect units to "survive" 1000 days (Rd = .001)
- assume Rs = 0.0015
- (1 - .001 + .0015)^365 = 1.20, so 20% annual growth
- So (g-1)/g  = .0005, so a unit accrues .5 units of perpetual storage
  space, after donating 1 unit for 1000 days.

## Final thoughts

- relies on expansion
- or increasing amounts of donated space

# Spanner

### Construction:

- a single *universe*
- universe has many *zones*, each either a server farm or a part of
one. The unit of physical replication.
- each zone has 100 .. several thousand *spanservers*
- a spanserver consists of multiple *replicas*

## Spanserver:

![Spanserver](http://dss.kelehers.me/talks/spanserverSoftwareStack.png)
- implements between 100 and 1000 *tablets*
- a tablet is a bag of mappings:

    (key:string, timestamp:int64) --> string

- runs a *Paxos* (long-lived leaders) instance on top of each tablet.


### Paxos group

- leader replica implements *lock table*
- two-phase locking
- leader also implements a *transaction manager*

### Transaction implementation

- If only a single Paxos instance (single tablet) involved,
transaction manager skipped.
- Otherwise, two-phase commit across the paxos leaders.

### Directories (buckets)

- tags in a tablet that have common prefixes
- smallest unit of placement by apps

### Data Model

- schematized semi-relational tables
- query language
- general-purpose read/write transactions

Motivated by:

- popularity of megastore, even though it's slow
- popularity of Dremel (data analysis tool)
- complaints about bigtable: even cons across data centers, no
  "cross-row" transactions (Percolator built to address this)

## TrueTime

![Truetime API](http://dss.kelehers.me/talks/trutime1.png)

Node time:

- *TimeMaster* machines per data center (GPS and/or atomic clocks)
- *timeslave daemon* per machine
- Marzullo's algorithm to reject outlying masters
- slowly increasing uncertainty between time syncs: &epsilon; between
1 and 7ms

Operations:

- *read-write transactions*
- *read-only transactions*
- *snapshot reads*
- transactions are `internally` re-tried (client not involved)
- implicit extension of leases by successfull writes


### Assigning timestamps to RW transactions

- two-phase locking
- timestamp can be any time after all locks acquired, before any locks
released
- spanners assigns trans the timestamp paxos gives the trans commit
write

> If start of T2 is after commit of T1, then the commit timestamp of
> T2 must be greater than that of T1.

**Commit wait**:
Coordinator leader ensure that clients cannot see any data committed
by Ti until TT.after(s\_i) is true

- expected wait is at least 2 * &Epsilon;
- wait is usually overlapped w/ Paxos execution



[The Talk](https://www.usenix.org/conference/osdi12/technical-sessions/presentation/corbett)
