Description
Apache Spark is a fast and general-purpose cluster computing system. It provides high-level APIs in Java, Scala, Python and R, and an optimized engine that supports general execution graphs. It also supports a rich set of higher-level tools including Spark SQL for SQL and structured data processing, MLlib for machine learning, GraphX for graph processing, and Spark Streaming.
Version | modulename |
---|
3.2.0 | spark/3.2.0 |
...
Code Block |
---|
|
SPARK_MASTER=$(grep "Starting Spark master" ${SPARK_LOG_DIR/master.err} | cut -d " " -f 9) |
Connect to the master using the Spark interactive shell in
Scala
Code Block |
---|
|
spark-shell --master ${SPARK_MASTER} |
...
- Use the dropdown menu under Interactive Apps and select Spark + Jupyter or Spark + RStudio
- Spark + Jupyter: uses the spark environment provided by the anaconda3/2020.07 module.
- Select a different conda environment OR
- Enter the commands to launch a conda environment of your choice in the text box
- If you prefer Jupyter Lab, check the radio button for "Use JupyterLab instead of Jupyter Notebook?"
- Spark + Rstudio: As of Nov 2021, only R 4.1.2 is available for use.
- Enter the resources you wish to use for your Spark Job
- By default, one task will be launched per node.
- Enter the number of cpus you want per task
- For e.g. to run a 2 node Spark job on the health partition with 36 workers on each node,
- Enter 36 in the "Number of cpus per task" box, and
- 2 in the "Number of nodes" box.
- Click the "Launch" button to submit your job and wait for resources to become available.
- When resources are available, a standalone spark cluster will be created for you. Setting up a spark cluster will take a while - go grab a coffee.
- If you want to monitor the Spark cluster only
- Click the link next to Session ID
- Open the output.log file (this will be created only when your job starts and it may take upto a few minutes) to see the information on the Spark Master, Master WebUI and History WebUI.
- Once the cluster is setup, it could take a few minutes
- Spark + Jupyter: Click on the "Connect to Jupyter" button to start Jupyter Lab or Notebook.
- Spark + Rstudio: Click on the "Connect to RStudio Server" button to start RStudio.
- Due to the way RStudio handles user created environmental variables, only the following variables are available from the RStudio session using the Sys.getenv command
- SPARK_HOME, HIVE_HOME, JAVA_HOME, HADOOP_HOME, SPARK_MASTER_HOST, SPARK_MASTER_PORT, SPARK_WORKER_CORES, SPARK_WORKER_MEMORY, SPARK_EXECUTOR_MEMORY, SPARK_HISTORY_OPTS
Expand |
---|
title | Jupyter Lab/NoteBooks |
---|
|
This video is in real time, expect similar startup times for your jobs. Startup times will increase with number of nodes requested. Either speed up playback speed or skip ahead.
|
Expand |
---|
|
This video is in real time, expect similar startup times for your jobs. Startup times will increase with number of nodes requested. Either speed up playback speed or skip ahead.
|