diff --git a/assign9.md b/assign9.md index e90fe1c490c1a392a5d0c1a64e0a4f9f1c08cc18..4eede1df317769f4180e167e203b31b2e6677328 100644 --- a/assign9.md +++ b/assign9.md @@ -6,7 +6,7 @@ tasks. For this assignment, we will use relatively small datasets and we won't run anything in distributed mode; however Spark can be easily used to run the same programs on much larger datasets. -### Setup +## Setup Download files for Assignment 9 <a href="https://ceres.cs.umd.edu/424/assign/assignment9Dist.tgz?1">here</a>. @@ -27,9 +27,6 @@ tasks are written as chains of these operations. Spark can be used with the Hadoop ecosystem, including the HDFS file system and the YARN resource manager. -Note that bash is the default shell everywhere, but the `.cshrc` is set up -correctly if you feel like dropping into `tcsh`. - ### Vagrant This is the **recommended** way to do this project. @@ -56,7 +53,8 @@ approach has been checked out (soon). ### Docker Probably before Thanksgiving. -### Spark and Python + +## Spark and Python Spark primarily supports three languages: Scala (Spark is written in Scala), Java, and Python. We will use Python here -- you can follow the instructions at @@ -118,7 +116,7 @@ The `lambda` representation is more compact and preferable, especially for small functions, but for large functions, it is better to separate out the definitions. -### Running it as an Application +### Running as an Application Instead of using a shell, you can also write your code as a python file, and *submit* that to the spark cluster. The `assignment9` directory contains a