From 028a1ec14c4d1b64af7724663867ac409ad8c1d8 Mon Sep 17 00:00:00 2001 From: Zhichao <liuzceecs@gmail.com> Date: Fri, 25 Oct 2019 15:49:07 -0400 Subject: [PATCH] add hint --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index e6a1c56..4bcc127 100644 --- a/README.md +++ b/README.md @@ -118,7 +118,7 @@ The format of the log entries should be self-explanatory, but here are more deta Note: In this question, both list of ratings and users can be accepted as the values in the answer RDD(the result.txt is taking users as the value). We highly recommend you to implement the solution with list of users, since the input of Task5 function is a tuple of (user, product), which means you don't need to make any change to assignment.py -- **Task 6 (4pt)**: On the logsRDD, for two given times, use a 'cogroup' to create the following RDD: the key of the RDD will be a host, and the value will be a 2-tuple, where the first element is a list of all URLs fetched from that host before the first time, and the second element is the list of all URLs fetched from that host after the second time. Use filter to first create two RDDs from the input logsRDD. +- **Task 6 (4pt)**: On the logsRDD, for two given times, use a 'cogroup'(function in Spark) to create the following RDD: the key of the RDD will be a host, and the value will be a 2-tuple, where the first element is a list of all URLs fetched from that host before the first time, and the second element is the list of all URLs fetched from that host after the second time. Use filter to first create two RDDs from the input logsRDD. - **Task 7 (8pt)**: Your task is to write a name counting application for counting the first names of the Nobel Prizes for each category. The return value should be a PairRDD where the key is a string, of which the format is "[Category]:[Firstname]" (i.e., "chemistry:Michael"), and the value is its count, i.e., in how many times did that combination appear. -- GitLab