We encourage you to look at the [Spark Programming
Guide](https://spark.apache.org/docs/latest/programming-guide.html) and play
with the other RDD manipulation commands. You should also try out the Scala and
Java interfaces.
## Assignment Details
We have provided a Python file: `assignment.py`, that initializes the folllowing
RDDs:
* An RDD consisting of lines from a Shakespeare play (`play.txt`)
* An RDD consisting of lines from a log file (`NASA_logs_sample.txt`)
* An RDD consisting of 2-tuples indicating user-product ratings from Amazon
Dataset (`amazon-ratings.txt`)
* An RDD consisting of JSON documents pertaining to all the Noble Laureates over
last few years (`prize.json`)
The file also contains some examples of operations on these RDDs.
Your tasks are to fill out the 5 functions that are defined in the
`functions.py` file (starting with `task`). The amount of code that you write
would typically be small (several would be one-liners), with the exception of
the last one.
First 4 tasks are worth 1 point
each, and task 5 is worth 2 points.
-**Task 1**: This function takes as input the amazonInputRDD and calculate the
proportion of 1.0 rating review out of all reviews made by each customer. The
output will be an RDD where the key is the customer's user id, and the value
is the proportion in decimal. This can be completed by using `aggregateByKey`
or `reduceByKey` along with `map`.
-**Task 2**: Write just the flatmap function (`task2_flatmap`) that takes in a parsed JSON document (from `prize.json`) and returns the surnames of the Nobel Laureates. In other words, the following command should create an RDD with all the surnames. We will use `json.loads` to parse the JSONs (this is already done). Make sure to look at what it returns so you know how to access the information inside the parsed JSONs (these are basically nested dictionaries). (https://docs.python.org/2/library/json.html)