Skip to content
Snippets Groups Projects
Commit ee50b786 authored by keleher's avatar keleher
Browse files

auto

parent 41e017d0
No related branches found
No related tags found
No related merge requests found
...@@ -12,6 +12,12 @@ the 2-stage Map-Reduce paradigm (originally proposed by Google and popularized b ...@@ -12,6 +12,12 @@ the 2-stage Map-Reduce paradigm (originally proposed by Google and popularized b
of items, that can be created in a variety of ways. Spark provides a set of operations to transform one or more RDDs into an output RDD, and analysis tasks are written as of items, that can be created in a variety of ways. Spark provides a set of operations to transform one or more RDDs into an output RDD, and analysis tasks are written as
chains of these operations. chains of these operations.
RDDs support two types of operations: transformations, which create a new dataset from an existing one, and actions, which return a value to the driver program after running a computation on the dataset. For example, map is a transformation that passes each dataset element through a function and returns a new RDD representing the results. On the other hand, reduce is an action that aggregates all the elements of the RDD using some function and returns the final result to the driver program (although there is also a parallel reduceByKey that returns a distributed dataset).
All transformations in Spark are lazy, in that they do not compute their results right away. Instead, they just remember the transformations applied to some base dataset (e.g. a file). The transformations are only computed when an action requires a result to be returned to the driver program. This design enables Spark to run more efficiently. For example, we can realize that a dataset created through map will be used in a reduce and return only the result of the reduce to the driver, rather than the larger mapped dataset.
By default, each transformed RDD may be recomputed each time you run an action on it. However, you may also persist an RDD in memory using the persist (or cache) method, in which case Spark will keep the elements around on the cluster for much faster access the next time you query it. There is also support for persisting RDDs on disk, or replicated across multiple nodes.
Spark can be used with the Hadoop ecosystem, including the HDFS file system and the YARN resource manager. Spark can be used with the Hadoop ecosystem, including the HDFS file system and the YARN resource manager.
### Installing Spark ### Installing Spark
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment