Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.


Table of Contents

Description

Apache Spark is a fast and general-purpose cluster computing system. It provides high-level APIs in Java, Scala, Python and R, and an optimized engine that supports general execution graphs. It also supports a rich set of higher-level tools including Spark SQL for SQL and structured data processing, MLlib for machine learning, GraphX for graph processing, and Spark Streaming.

Versionmodulename
3.2.0spark/3.2.0

...

Code Block
languagebash
SPARK_MASTER=$(grep "Starting Spark master" ${SPARK_LOG_DIR/master.err} | cut -d " " -f 9)


Connect to the master using the Spark interactive shell in 

Scala
Code Block
languagebash
spark-shell --master ${SPARK_MASTER}

...

  • Use the dropdown menu under Interactive Apps and select Spark + Jupyter or Spark + RStudio
  • Spark + Jupyter: uses the spark environment provided by the anaconda3/2020.07 module.
    • Select a different conda environment OR
    • Enter the commands to launch a conda environment of your choice in the text box
    • If you prefer Jupyter Lab, check the radio button for "Use JupyterLab instead of Jupyter Notebook?"
  • Spark + Rstudio: As of Nov 2021, only R 4.1.2 is available for use.
  • Enter the resources you wish to use for your Spark Job
    • By default, one task will be launched per node.
    • Enter the number of cpus you want per task
    • For e.g. to run a 2 node Spark job on the health partition with 36 workers on each node,
      • Enter 36 in the "Number of cpus per task" box, and
      • 2 in the "Number of nodes" box.
  • Click the "Launch" button to submit your job and wait for resources to become available.
  • When resources are available, a standalone spark cluster will be created for you. Setting up a spark cluster will take a while - go grab a coffee.
  • If you want to monitor the Spark cluster only
    • Click the link next to Session ID
    • Open the output.log file (this will be created only when your job starts and it may take upto a few minutes) to see the information on the Spark Master, Master WebUI and History WebUI. 
  • Once the cluster is setup, it could take a few minutes
  • Spark + Jupyter: Click on the "Connect to Jupyter" button to start Jupyter Lab or Notebook.
  • Spark + Rstudio: Click on the "Connect to RStudio Server" button to start RStudio.
    • Due to the way RStudio handles user created environmental variables, only the following variables are available from the RStudio session using the Sys.getenv command
      • SPARK_HOME, HIVE_HOME, JAVA_HOME, HADOOP_HOME, SPARK_MASTER_HOST, SPARK_MASTER_PORT, SPARK_WORKER_CORES,  SPARK_WORKER_MEMORY,  SPARK_EXECUTOR_MEMORY, SPARK_HISTORY_OPTS
Expand
titleJupyter Lab/NoteBooks

This video is in real time, expect similar startup times for your jobs. Startup times will increase with number of nodes requested. Either speed up playback speed or skip ahead.



Expand
titleRStudio Server

This video is in real time, expect similar startup times for your jobs. Startup times will increase with number of nodes requested. Either speed up playback speed or skip ahead.