24/08/23 17:28:04 INFO ResourceProfile: Limiting resource is cpu
24/08/23 17:28:04 INFO ResourceProfileManager: Added ResourceProfile id: 0
24/08/23 17:28:04 INFO SecurityManager: Changing view acls to: root
....
root@c509f18fe2e3:/424# ls -l output
total 4
-rw-r--r-- 1 root root 0 Aug 23 17:25 _SUCCESS
-rw-r--r-- 1 root root 1477 Aug 23 17:25 part-00000
root@c509f18fe2e3:/424# cat output/part-00000
('', 35)
('"alias', 2)
('#', 9)
('&&', 5)
('-C', 1)
('-rf', 1)
('-xzf', 1)
('-y', 1)
('/', 1)
('/424', 1)
('/root/.bashrc', 2)
....
root@c509f18fe2e3:/424#
```
### More...
We encourage you to look at the [Spark Programming
Guide](https://spark.apache.org/docs/latest/programming-guide.html) and play
with the other RDD manipulation commands. You should also try out the Scala and
Java interfaces.
## Assignment Details
We have provided a Python file: `assignment.py`, that initializes the folllowing
RDDs:
* An RDD consisting of lines from a Shakespeare play (`play.txt`)
* An RDD consisting of lines from a log file (`NASA_logs_sample.txt`)
* An RDD consisting of 2-tuples indicating user-product ratings from Amazon
Dataset (`amazon-ratings.txt`)
* An RDD consisting of JSON documents pertaining to all the Noble Laureates over
last few years (`prize.json`)
Your tasks are to fill out the six functions defined in
`functions.py` (starting with `task`). The amount of code that you write
will typically be small (several would be one-liners), with the exception of
the last one.
All tasks are worth a single point each.
-**Task 1**: This function takes as input the amazonInputRDD and calculate the
proportion of 1.0 rating review out of all reviews made by each customer. The
output will be an RDD where the key is the customer's user id, and the value
is the proportion in decimal. This can be completed by using `aggregateByKey`
or `reduceByKey` along with `map`.
-**Task 2**: Write just the flatmap function (`task2_flatmap`) that takes in a parsed JSON document (from `prize.json`) and returns the surnames of the Nobel Laureates. In other words, the following command should create an RDD with all the surnames. We will use `json.loads` to parse the JSONs (this is already done). Make sure to look at what it returns so you know how to access the information inside the parsed JSONs (these are basically nested dictionaries). (https://docs.python.org/2/library/json.html)