Contents

Data parallel computing with Spark

Contents

Data parallel computing with Spark#

Hands-on: Data analytics in Spark#

Download Move Dataset
Unzip the movie data file.
Open a terminal.
Activate the pyspark conda environment, then launch Jupyter notebook

$ conda activate pyspark
$ jupyter notebook

Create a new notebook using the pyspark kernel, then change the notebook’s name to spark-2.
Copy the code from spark-1 to setup and launch a Spark application.