
# The Case for RAMCloud

Claims:
- 100x or 1000x lower latency
- 100x or 1000x greater throughput
- 5-10 usec remote reads
- 1,000X gap between disk and RAM means 1% cache miss rate overall slowdown of 10X.
- replication needed only for durability, not performance.

![ram1](ram1.png)

Flash vs RAM boundary:
- flash limited by cost/query/sec
- ram side limited by cost/bit

For all three technologies, **cost/bit improving faster than cost/query**.

![ram2](ram2.png)


Flash latencies:
- 20-50 usecs for read
- >= 200 usecs for writes
- 5-10X slower for queries, probably also throughput

**All data kept in memory *all* the time.**

## Buffered Logging
- single copy stored in RAM
- object mods copied to multiple remote DRAMs before returning
- async flush to disk

![ram3](ram3.png)


----
----

# Fast Crash Recovery in RAMCloud

### Replication
Single copy, w/ backups on disk
- how to minimize impact on regular performance
- how to enable fast recovery

![ram4](ram4.png)

## Problems
- ram expensive, don't want to keep three copies
- disk slow, don't want in access fast path.

### Claim
- the system reconstructs the entire contents of a lost server’s memory (64 GB or more) from disk and resumes full service in 1-2 seconds.
  - backup data scattered across all machines
  - thousands of disks participate in recovery
  - hundreds of *recovery masters* work together
  - log-structured approach used both on disk *and in RAM*
  - randomized protocols
  - **tablet profiling** tracks dist of data in tables, helps w/ partitioning

![ramWrite](ramWrite.png)
- backup machines have power-backed-up RAM
- overall throughput limited to that of backup storage




![ramMasters](ramMasters.png)
- maybe 100 recovery masters

### Complications
- segment masters and backups should be on different racks
- each backup should be given data s.t. same time to read during recovery
- masters are writing simultaneously, should coordinate
- system membership constantly changing

So:
- each RAMCloud master decides independently where to place each replica,
  - randomization
  - refinement
- each segment has a primary
  - only primary read during recovery

## Recovery flow

Setup:
- find segment replicas must be located
  - no centralized map
  - broadcast to all backups
- Detecting incomplete logs
  - each log self-describing
  - each segment includes a log digest, a list of all segments in the log
- Starting partition recoverings
  - master pre-makes a *will*, i.e. even partition of assets

![ram5](ram5.png)
Replay, for each segment:
1.read from disk into backup memory
1. backup divides records according to partition
1. records for each partition sent to *recovery master* for that partition
1. recovery master incorporates into local log and hash table
1. recovery master replicates recovered segments to backups as normal ops
1. backups write new segments to disk

Concurrency:
- All stages *pipelined* (happen in parallel)
  - stalls avoided by having backups forward retrieval schedules to recovery masters
- Data parallelism: many recovery masters.

**Recovery master now responsible for assigned partition.**

## Performance

![ram6](ram6.png)

![ram7](ram7.png)

**End-to-end recovery times of 1-2 seconds for 35 GB.**


# Comments
- (katura) how locality w/ hashed keys?
- (katura) what when master reboots and rejoins?
- (rebecca) avail from fast recovery rather than weakened consistency
- "if crashes happen infrequently"
- need to keep that map (which would probably need anyway)
- cleaning


- tablet profiles
- it it used?
