# Elephant

### Goals:
- undo vs long-term history

### Files:
- read-only
- derived
- cached
- temporary
- user-modified  <-- need versioning

### Policies:
- keep one  (browser cache, core, /tmp)
- keep all  
- keep safe   (just undo)
- keep landmarks <--

### Metadata:
- imap
- ptr into
    - inode log (versioned per-file) or 
    - multiple non-versioned file inodes
- temperature
- policies
![imap](imap2.png)

### Directories
- ordinary file
    - versioning inside file, not outside
- basically a log of changes:
    - create
    - mutate
	- delete
- history partition
    - move old deletes into a second inode so as to not slow current operations	

![dirs](dirs.png)

### Useful:

    cd foo/@12-nov-1999:11:30
    tls      'ls @v'
    tgrep


### Application policies:
- user-level process called when the cleaner comes across high-temp file

### Downsides:
- less locality in inodes, data blocks
- pressure on buffer cache



# GFS

#### Motivating context:
- failures are norm
- files are huge
- most mutations by appends
- co-design w/ apps

![Architecture](gfs-arch.png)

#### Features:
- relaxed consistency
- atomic record append (without locking)
- no caches! anywhere (though linux does this underneath)! (type of app)
    - consistency not much of an issue
    - clients cache chunk servers, so out-of-date, but just prefixes, not wrong
#### By:
- single master
    - maintains all metadata in memory
    - big log that is checkpointed
    - shadow masters (for slightly stale read access)
- multiple chunkservers
    - chunks replicated
    - no caches (streaming!)

#### Guarantees:
- namespace mutations atomic
- client caches may return stale data
    - limited to timeout window
    - mostly appends anyway 
- file region "consistent" if same on all replicas
- file region "defined" if consistent and writes therein seen in entirety
    - files broken into multiple writes if too big, or across chunk boundaries
    - write might fail on some replicas, be re-tried
- writes sent to replicas in same order, so **consistency common**
- *Applications responsible for checking all data*
    - checksums to find undefined regions
    - unique identifiers for duplication
    - library for the above
	
#### Inconsistencies:
- concurrent writes or failed writes lead to *undefined regions*
- need to tell the difference between defined (each mutation in entirety) and undefined
- no need for consistency (a region is same on all machines)
- apps check for partial fragments, duplication (at-least-once)
- everything in protocol buffers, integrity built-in

#### Read op:
- ask master, which responds w/ chunkservers
- caches chunkservers, no need to talk to master 

#### Write op:
- master grants a per-chunk lease to one chunkserver
- client gets chunk replica list from master, 
    - sends data everywhere,  (stored in buffer caches until used)
    - waits for acks
- sends write request to primary, which
    - serializes, 
    - sends write records to secondaries, 
    - gathers acks
![Write](gfs-data.png)

### Snapshot
- revoke outstanding leases
- copy meta-data (including chunk names)

#### Locking -
- no per-dir state
- lookup table mapping full pathnames to metadata
- to modify a lead node (metainfo for a file)
    - start at root, read-locking all the way down
    - exclusive lock on the file
![Architecture](gfs-locks.png)

#### Errata
- New multi-master is "Colossus".
- Quinlan: "in retrospect I think the consensus is that it proved to be more painful than it was worth"
