Skip to content
Snippets Groups Projects
Commit c2d1645c authored by Zhichao's avatar Zhichao
Browse files

add wordcount and edit README

parent a6a83a09
No related branches found
No related tags found
No related merge requests found
......@@ -103,20 +103,20 @@ write would typically be small (several would be one-liners), with the exception
stop_words = ['is','am','are','the','for','a','of','in']
```
- **Task 2 (4pt)**: Write just the flatmap function (`task2_flatmap`) that takes in a parsed JSON document (from `prize.json`) and returns the surnames of the Nobel Laureates. In other words, the following command should create an RDD with all the surnames. We will use `json.loads` to parse the JSONs (this is already done). Make sure to look at what it returns so you know how to access the information inside the parsed JSONs (these are basically nested dictionaries). (https://docs.python.org/2/library/json.html)
- **Task 2 (4pt)**: Write just the flatmap function (`task2_flatmap`) that takes in a parsed JSON document (from `prize.json`) and returns the surnames of the Nobel Laureates. In other words, the following command should create an RDD with all the surnames (the distinct() is to remove duplicated empty surnames). We will use `json.loads` to parse the JSONs (this is already done). Make sure to look at what it returns so you know how to access the information inside the parsed JSONs (these are basically nested dictionaries). (https://docs.python.org/2/library/json.html)
```
task2_result = nobelRDD.map(json.loads).flatMap(task2_flatmap)
task2_result = nobelRDD.map(json.loads).flatMap(task2_flatmap).distinct()
```
- **Task 3 (4pt)**: Write with the flatmap function that takes in a parsed JSON document (from prize.json) and returns the surnames of the Nobel Laureates in each year from 2004 to 2016. It should create an RDD the key is the years, and the value is a list of all the surnames of the Nobel Laureates in the year.
We will use json.loads to parse the JSONs (this is already done). Make sure to look at what it returns so you know how to access the information inside the parsed JSONs (these are basically nested dictionaries). (https://docs.python.org/2/library/json.html)
- **Task 4 (4pt)**: This function operates on the logsRDD. It takes as input a list of logs and returns an RDD where the key is the hosts and the value is the latest dates in the log when the hosts are visited.
- **Task 4 (4pt)**: This function operates on the logsRDD. It takes as input a list of logs and returns an RDD where the key is the hosts and the value is the latest dates and time (no time zone) in the log when the hosts are visited.
The format of the log entries should be self-explanatory, but here are more details if you need: [NASA Logs](http://ita.ee.lbl.gov/html/contrib/NASA-HTTP.html).
- **Task 5 (4pt)**: Complete a function to group all ratings of products and calculate the degree distribution of product nodes in the Amazon graph. In other words, calculate the degree of each product node (i.e., number of ratings each product has gotten). Use a groupByKey to find the list of ratings each product has got and reduceByKey (or aggregateByKey) to find the degree of each product rating. The output should be a RDD where the key is the product, and the values are the degree and a list of all ratings the product has gotten. Make sure to make all the ratings a list and join the two RDDs.
- **Task 6 (4pt)**: On the logsRDD, for two given days (provided as input analogous to Task 4 above), use a 'cogroup' to create the following RDD: the key of the RDD will be a host, and the value will be a 2-tuple, where the first element is a list of all URLs fetched from that host before the first time, and the second element is the list of all URLs fetched from that host after the second time. Use filter to first create two RDDs from the input logsRDD.
- **Task 6 (4pt)**: On the logsRDD, for two given times, use a 'cogroup' to create the following RDD: the key of the RDD will be a host, and the value will be a 2-tuple, where the first element is a list of all URLs fetched from that host before the first time, and the second element is the list of all URLs fetched from that host after the second time. Use filter to first create two RDDs from the input logsRDD.
- **Task 7 (8pt)**: [Bigrams](http://en.wikipedia.org/wiki/Bigram) are sequences of two consecutive words. For example, the previous sentence contains the following bigrams: "Bigrams are", "are simply", "simply sequences", "sequences of", etc. Your task is to write a bigram counting application for counting the bigrams in the `motivation`s of the Nobel Prizes (i.e., the reason they were given the Nobel Prize). The return value should be a PairRDD where the key is a bigram, and the value is its count, i.e., in how many different `motivations` did it appear. Don't assume 'motivation' is always present.
......
from pyspark import SparkContext
sc = SparkContext("local", "Simple App")
textFile = sc.textFile("README.md")
counts = textFile.flatMap(lambda line: line.split(" ")).map(lambda word: (word, 1)).reduceByKey(lambda a, b: a + b)
print counts.sortByKey().take(100)
#counts.saveAsTextFile("output")
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment