Content Comparison

Table of Contents

Description

Apache Spark is a fast and general-purpose cluster computing system. It provides high-level APIs in Java, Scala, Python and R, and an optimized engine that supports general execution graphs. It also supports a rich set of higher-level tools including Spark SQL for SQL and structured data processing, MLlib for machine learning, GraphX for graph processing, and Spark Streaming.

Version	modulename
3.2.0	spark/3.2.0

...

Code Block

language	bash

SPARK_MASTER=$(grep "Starting Spark master" ${SPARK_LOG_DIR/master.err} | cut -d " " -f 9)

Connect to the master using the Spark interactive shell in

Scala

Code Block

language	bash

spark-shell --master ${SPARK_MASTER}

...

Use the dropdown menu under Interactive Apps and select Spark + Jupyter or Spark + RStudio
Spark + Jupyter: uses the spark environment provided by the anaconda3/2020.07 module.
- Select a different conda environment OR
- Enter the commands to launch a conda environment of your choice in the text box
- If you prefer Jupyter Lab, check the radio button for "Use JupyterLab instead of Jupyter Notebook?"
Spark + Rstudio: As of Nov 2021, only R 4.1.2 is available for use.
Enter the resources you wish to use for your Spark Job
- By default, one task will be launched per node.
- Enter the number of cpus you want per task
- For e.g. to run a 2 node Spark job on the health partition with 36 workers on each node,
  - Enter 36 in the "Number of cpus per task" box, and
  - 2 in the "Number of nodes" box.
Click the "Launch" button to submit your job and wait for resources to become available.
When resources are available, a standalone spark cluster will be created for you. Setting up a spark cluster will take a while - go grab a coffee.
If you want to monitor the Spark cluster only
- Click the link next to Session ID
- Open the output.log file (this will be created only when your job starts and it may take upto a few minutes) to see the information on the Spark Master, Master WebUI and History WebUI.
Once the cluster is setup, it could take a few minutes
Spark + Jupyter: Click on the "Connect to Jupyter" button to start Jupyter Lab or Notebook.
Spark + Rstudio: Click on the "Connect to RStudio Server" button to start RStudio.
- Due to the way RStudio handles user created environmental variables, only the following variables are available from the RStudio session using the Sys.getenv command
  - SPARK_HOME, HIVE_HOME, JAVA_HOME, HADOOP_HOME, SPARK_MASTER_HOST, SPARK_MASTER_PORT, SPARK_WORKER_CORES, SPARK_WORKER_MEMORY, SPARK_EXECUTOR_MEMORY, SPARK_HISTORY_OPTS

Expand

title	Jupyter Lab/NoteBooks

This video is in real time, expect similar startup times for your jobs. Startup times will increase with number of nodes requested. Either speed up playback speed or skip ahead.

Expand

title	RStudio Server

This video is in real time, expect similar startup times for your jobs. Startup times will increase with number of nodes requested. Either speed up playback speed or skip ahead.

Version	Old Version 28	New Version 29
Changes made by	Former user	Former user
Saved on	Nov 23, 2021	Nov 24, 2021

Versions Compared

Key

Description

Scala