
# Map-Reduce

"the canonical"

![map-reduce execution](map.png)

Map Reduce makes numerous contributions:
- simple abstraction for huge data
- Fault tolerance is a first order consideration
   - coordinator restarts work chunks if necessary.
      - relies on *piecewise determinism*
- locality exploited:
   - in scheduling nodes
   - replacing stragglers
- *pipelining*
   
Big plusses:
- absolutely enormous datasets
  - both storage *and* computation distributed
- simple, flexible model, does not require specific data formats

Drawbacks:
- disk-based
- single coordinator
- this only works for *embarassingly-parallel* applications
- fine-tuning this beast for the environment / overall efficiency

# Spark

![spark execution](spark.png)

Big thing is (mostly) in-memory ResilientDistributedDatasets:
- immutable
- series of stackable transformations
  - lazy evaluation:
    - records operations as meta-data
    - reduces number of passes
- cached (fits into memory)
- *data lineage* for fault tolerance instead of storing all data

Advantages:
- hugely faster than MapReduce if enough memory, much faster even if not

Drawbacks:
- maybe scale

