Data parallel computing with Spark#

Hands-on: Data analytics in Spark#

  • Download Move Dataset

  • Unzip the movie data file.

  • Open a terminal.

  • Activate the pyspark conda environment, then launch Jupyter notebook

$ conda activate pyspark
$ jupyter notebook
  • Create a new notebook using the pyspark kernel, then change the notebook’s name to spark-2.

  • Copy the code from spark-1 to setup and launch a Spark application.