From 0e319f8f4b9a51d0e8cafe8bd91dfe2e7035588b Mon Sep 17 00:00:00 2001 From: "Peter J. Keleher" <keleher@cs.umd.edu> Date: Sat, 2 Dec 2023 07:45:45 -0500 Subject: [PATCH] auto --- assign9.md | 29 +++++++++++++++++++++++++++-- 1 file changed, 27 insertions(+), 2 deletions(-) diff --git a/assign9.md b/assign9.md index e6b592d..6980a35 100644 --- a/assign9.md +++ b/assign9.md @@ -82,7 +82,7 @@ a bunch of stuff about what Spark is doing). The relevant variables are initialized in this python shell, but otherwise it is just a standard Python shell. -2. `>>> textFile = sc.textFile("README.md")`: This creates a new RDD, called +2. `>>> textFile = sc.textFile("Dockerfile")`: This creates a new RDD, called `textFile`, by reading data from a local file. The `sc.textFile` commands create an RDD containing one entry per line in the file. @@ -97,12 +97,37 @@ the Word Count application. #### Word Count Application The following command (in the pyspark shell) does a word count, i.e., it counts -the number of times each word appears in the file `README.md`. Use +the number of times each word appears in the file `Dockerfile`. Use `counts.take(5)` to see the output. `>>> counts = textFile.flatMap(lambda line: line.split(" ")).map(lambda word: (word, 1)).reduceByKey(lambda a, b: a + b)` +In more detail: +``` +root@d36910b1feb0:/assign9# $SPARKHOME/bin/pyspark +Python 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] on linux +Type "help", "copyright", "credits" or "license" for more information. +Setting default log level to "WARN". +To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). +23/12/02 12:35:16 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable +Welcome to + ____ __ + / __/__ ___ _____/ /__ + _\ \/ _ \/ _ `/ __/ '_/ + /__ / .__/\_,_/_/ /_/\_\ version 3.5.0 + /_/ + +Using Python version 3.10.12 (main, Jun 11 2023 05:26:28) +Spark context Web UI available at http://d36910b1feb0:4040 +Spark context available as 'sc' (master = local[*], app id = local-1701520517201). +SparkSession available as 'spark'. +>>> textFile = sc.textFile("Dockerfile") +>>> counts = textFile.flatMap(lambda line: line.split(" ")).map(lambda word: (word, 1)).reduceByKey(lambda a, b: a + b) +>>> counts.take(5) +[('#', 9), ('Use', 1), ('as', 1), ('image', 1), ('', 35)] +``` + Here is the same code without the use of `lambda` functions. ``` -- GitLab