# GFS

#### Motivating context:
- failures are norm
- files are huge
- most mutations by appends
- co-design w/ apps

![Architecture](gfs-arch.png)

#### Features:
- relaxed consistency
- atomic record append (without locking)
- no caches! anywhere (though linux does this underneath)! (type of app)
    - consistency not much of an issue
    - clients cache chunk servers, so out-of-date, but just prefixes, not wrong
#### By:
- single master
    - maintains all metadata in memory
    - big log that is checkpointed
    - shadow masters (for slightly stale read access)
    - chunkservers not persisted
- multiple chunkservers
    - chunks replicated
    - no caches (streaming!)

#### Read op:
- ask master, which responds w/ chunkservers
- caches chunkservers, no need to talk to master 

#### Write op:
- master grants a per-chunk lease to one chunkserver
- client gets chunk replica list from master, 
    - sends data everywhere,  (stored in buffer caches until used)
    - waits for acks
- sends write request to primary, which
    - serializes, 
    - sends write records to secondaries, 
    - gathers acks
![Write](gfs-data.png)


#### Guarantees:
- namespace mutations atomic
- client caches may return stale data
    - limited to timeout window
    - *mostly appends anyway*
- file region "consistent" if same on all replicas
- file region "defined" if consistent and writes therein seen in entirety
    - files broken into multiple writes if too big, or across chunk boundaries
    - write might fail on some replicas, be re-tried
- writes sent to replicas in same order, so **consistency common**
- *Applications responsible for checking all data*
    - checksums to find undefined regions
    - unique identifiers for duplication
    - library for the above
	
#### Inconsistencies:
- concurrent writes or failed writes lead to *undefined regions*
- need to tell the difference between defined (each mutation in entirety) and undefined
- no need for consistency (a region is same on all machines)
- apps check for partial fragments, duplication (at-least-once)
- everything in protocol buffers, integrity built-in

### Snapshot
- revoke outstanding leases
- copy meta-data (including chunk names)

#### Locking -
- no per-dir state
- lookup table mapping full pathnames to metadata
- to modify a lead node (metainfo for a file)
    - start at root, read-locking all the way down
    - exclusive lock on the file
![Architecture](gfs-locks.png)

#### Errata
- Multi-master is "Colossus" (CFS).
- Quinlan: "in retrospect I think the consensus is that it proved to be more painful than it was worth"

### Comments
- "global total ordering on all operations, across different clients, i.e. "logical time" vs. "real time""
- fixed-sized chunks
- lfs vs gfs
- *Borg*


### What went wrong?  (column)
- apps scaled from hundreds of terabytes (no worries....) to tens of petabytes
- `map-reduce`
- file counts:
  - each file had to have an identifier, and the set of backing
chunks
  - apps hit file count quotas before storage quotas
- YouTube
- BigTable (loosely consistent key-value store) build on top of logs, on top
of GFS ("sort of like a log-structured file system)
- majority of GFS data ends up being serialized in protocol buffers 
- "Generally, our approach is just to get things working reasonably well and
then turn our focus to scalability"
