# Replication, history, and grafting in the Ori file system

### Big Trends
"disk space has outgrown increase in wide-area bandwidth"
- 1990 60 MB disk, 9600 baud: 14 hours
- now  3TB disk, 1Mbps, 278 days!  (to be fair, maybe 10 Mbps, but still a month)
- devices?   
  (might have 16GB free on phone, 3.5 hours at 10 Mbps in my house, <40 minutes downlink)
  (1 TB disk 9.25 days, or 1.7)
- "sneakernet" : "Never underestimate the bandwidth of a station wagon full of tapes hurtling down the highway."

All this means that blindly broadcasting like Dropbox will not scale.

### Big files dominate
- metadata *space* becomes less important


### Main concepts:
- save everything
- sync P2P w/ other replicas of the same FS
- version, backup for free
- grafting other filesystems
- "shallow" mounts"

This is basically what we are building in this class.

![Ori Commands](oriCmds.png)

### Objects, and Packfiles

![Objects](oriObjects.png)
- commit objects: snapshot headers
- trees objects
  - both directory content names and inodes
- blobs, large blobs, file data
- packfile (git)
  - repository for all objects
  - log-structured
  - objects written in *clusters*, w/ headers at front
  - sub-file data-deduplication

### Takeaways:
- background fetches
- nearby fetches
- ori-sync
  - periodic broadcasts of root, lists of replicas with shared key
  - others hear, and initiate pulls
  - no consistency

### Consistency and Replication
- full replication of namespace
- optionally just current data
- eventual consistency

### Nice thought
"versioning subsumes backup"  (maybe for undo, not for replication)

### Borrowings:
- venti: Content-Addressable-Store
- git: history
- 3-way merges
- packfiles: git

### Design

![system](oriSys.png)

- commit objs: snapshots and context
- trees: directories+file inodes
- blobs: data
- eventual consistency
 
### Takeaways
- what if different users solve the conflicts in different ways?
- why the authors weren't able to implement individual file permissions for grafts?
- reference counting:  in GC is as old as the hills
- storing information on "unrelated but nearby repositories"
- many caseswhere a single modification to a file could entirely change the underlying
  data storage in the device, high network latencies
- forcing users to manually handle three-way merge conflicts
- who the intended audience is
- GC rarely done because of write amplification. Not really full versioning, only snapshots saved.

# WTF (Warp Transactional file system)

### Big ideas:
- zero-copy file-slicing
- transactions across dist FS

![System](wtfSys.png)

### Slices
- immutable
- references stored in key-value store (HyperDex)
- metadata also in HyperDex
- files are sequences of slice/offsets
- slice ptrs contain slice id, offset, length, id of server
  - coupled w/ offset into the file
- each prefix of meta-data list is a snapshot in time
![meta-data](wtfMeta.png)

### Smaller things
- pathname to inode maps: directories bypassed (a la [gfs](http://triffid.cs.umd.edu/dss/blog.rb?tag=gfs))
  - dirs still exist for enumeration
- append calls
  - bypass seeks, finding end of files
- locality aware slice placement (sequential)
- "collisions in the hash space do in- evitably occur"  WTF? SHA256!
- compaction/defrag of both metadata and slices, if necessary
- GC periodically scans entire system
- 30k lines for WTF, 85k for HyperDex, 37k more for libraries
- FUSE!

### Ugliness
- file metadata can be partitioned into *regions* for long-lived, heavily modified files
- multi-region transactions

### Transactions
- across mult file, mult keys, dist
- abort only when concurrent trans generate app-visible changes in FS
- only across metadata (data is immutable), which is a log

### Nits
- uses client library, no native interface  (like HDFS)


### Takeaways
- what is the intended usage?
- authors say that multiple concurrent writes will always succeed (sometimes with
  retrying), but they never discuss how they determine which write will succeed first
- larger metadata, impact performance?
- "call out the number of lines of code for each component"  
- doesn't seem like anyone is clamoring for HDFS with better consistency...
- "potential for data locality effects for co-located data processing applications
  (MR/Spark/etc). Does this not get mitigated to a degree with the slice approach?
- single node for metadata?       (no 3-node cluster, at least for WTF)
  
