# Ceph

Enormous  HPC file system:
- tens or hundreds of thousands of OSDs
- more than 250k metadata operations/sec
- scientific workloads, think GlusterFS or HadoopFS or (or GFS)

Terms:
- OSD: intelligent object store devices

Motivation:
- metadata operations make up as much as half of FS workloads
- metadata operations don't scale
- petabyte-scale system are inherently dynamic

![ceph system](cephSys.png)

What they did:
- huge engineering effort
- CRUSH data distribution function rather than state
- replication, failure, recovery handled by OSDs
- dynamic subtree partitioning


Consistency:
- generally strict, but
- dirs + inodes sent at same time, inodes cached briefly
- O_LAZY allows read and write buffering w/ multiple clients


Data layout w/ *CRUSH*:
- object names are just inumbers and *stripes*, which are sequences of objects
- objects assigned to *placement group* w/ hash
- placement groups mapped to OSDs w/ CRUSH and *OSD cluster map*

**Any party can find any object in distributed fashion**
- and cluster map updated rarely

## Dynamic Subtree Partitioning

![subtree](cephMDS.png)

## Replication

![subtree](cephWrite.png)


## EBOFS

"existing kernel interface limits our ability to understand when object updates are safely committed on disk"

So:
- EBOFS entirely in userspace on each OSD


