SLURM
The SLURM scheduler (Simple Linux Utility for Resource Management) manages and allocates all of Sol's compute nodes. All of your computing must be done on Sol's compute nodes. The following is an abbreviated user guide for SLURM. Please visit the SLURM website for a more detailed documentation of tools and capabilities.
- 1 Partitions
- 1.1 Limitations
- 1.2 Priorities
- 2 Current Status
- 3 Usage
- 4 File Systems
- 5 Running Jobs on Sol and Hawk
- 5.1 Special Instructions
- 5.2 Interactive Jobs
- 5.3 Batch Jobs
- 5.3.1 Sample Scripts for Batch Jobs
- 5.3.1.1 Serial Job
- 5.3.1.2 OpenMP Job
- 5.3.1.3 MPI Job
- 5.3.1.4 GPU Job
- 5.3.2 Submitting Jobs
- 5.3.1 Sample Scripts for Batch Jobs
- 5.4 Submitting Dependency jobs
- 5.5 Abbreviated Notations
- 6 Monitoring Jobs
- 7 Monitoring Queues
Partitions
SLURM uses the term partition instead of queue. There are several partitions available on Sol and Hawk for running jobs:
lts : 20-core nodes purchased as part of the original cluster by LTS.
Two 2.3GHz 10-core Intel Xeon E5-2650 v3, 25M Cache, 128GB 2133MHz RAM
lts-gpu: 1 core per lts node is reserved for launching gpu jobs
im1080 : 24-core nodes purchased by Wonpil Im, Department of Biological Sciences. Users can request a max of 20 cores per node.
im1080-gpu : 2 cores per im1080 node is reserved for launching gpu jobs.
Two 2.3GHz 12-core Intel Xeon E5-2670 v3, 30M Cache, 128GB 2133MHz RAM, Two EVGA Geforce GTX 1080 PCIE 8GB GDDR5
eng : 24-core nodes purchased by various RCEAS faculty.
eng-gpu : 2 cores per eng node is reserved for launching gpu jobs i.e. 1 core for each gpu.
Two 2.3GHz 12-core Intel Xeon E5-2670 v3, 30M Cache, 128GB 2133MHz RAM, EVGA Geforce GTX 1080 PCIE 8GB GDDR5. Four nodes have two cards while other nodes have one card
engc : 24-core nodes based on Broadwell CPUs purchased by ChemE Faculty. Users can request a max of 24 cores per node until GPUs are added to these nodes.
Two 2.2GHz 12-core Intel Xeon E5-2650 v4, 30M Cache, 64GB 2133MHz RAM
himem : 16-core node purchased by Economics Faculty with 512GB RAM.
Two 2.6GHz 8-core Intel Xeon E5-2640 v3, 20M Cache, 512GB 2400MHz RAM
Users utilizing this node will be charged a higher rate of SU consumption ( 3 SU/core hour). Please evaluate memory consumption of your job before submitting jobs to this partition. If you need to use this partition, please contact Ryan Bradley.
enge,engi: 36-core node purchased by MEM faculty and ISE Department
Two 2.3GHz 18-core Intel Xeon Gold 6140, 24.75M Cache, 192GB 2666MHz RAM
This node features the newer AVX512 vector extension that provides twice the FLOPS of earlier generation Haswell/Broadwell CPUs at the expense of CPU speed.
im2080: 36-core nodes purchased by Wonpil Im, Department of Biological Sciences. Users can request a max of 28 cores per node.
im2080-gpu : 8 cores per im2080 node is reserved for launching gpu jobs i.e. 2 cores per gpu
Two 2.3GHz 18-core Intel Xeon Gold 6140, 24.75M Cache, 192GB 2666MHz RAM, Four ASUS GeForce RTX 2080TI PCIE 11GB GDDR6
chem: 36-core Sklyake (2) and Cascade Lake (4) nodes purchased by Lisa Fredin, Department of Chemistry
(2) Two 2.3GHz 18-core Intel Xeon Gold 6140, 24.75M Cache, 192GB 2666MHz RAM
(4) Two 2.6GHz 18-core Intel Xeon Gold 6240, 24.75M Cache, 192GB 2933MHz RAM
health: 36-core nodes purchased by the College of Health
Two 2.6GHz 18-core Intel Xeon Gold 6240, 24.75M Cache, 192GB 2933MHz RAM
hawkcpu: CPU nodes on Hawk
Two 2.1GHz 26-core Intel Xeon Gold 6230R, 384GB RAM
hawkgpu: GPU nodes on Hawk
Two 2.2GHz 24-core Intel Xeon Gold 5220R, 192GB RAM, 8 nVIDIA Tesla T4
hawkmem: Big Memory nodes on Hawk
Two 2.1GHz 26-core Intel Xeon Gold 6230R, 1536GB RAM
infolab: 2 52-core Cascade Lake refresh nodes purchased by Brian Chen, CSE faculty (identical to Hawk CPU nodes)
Two 2.1GHz 26-core Intel Xeon Gold 6230R, 384GB RAM
pisces: 48-core node with A100 GPUs purchased by Keith Moored, Department of Mechanical Engineering and Mechanics.
Two 3.0GHz 24-core Intel Xeon Gold 6248R, 35.75M Cache, 192GB RAM, 5 NVIDIA A100 40GB HBM2 GPUs
Each A100 GPU is charged 48SUs/hour. A maximum of 10CPUs can be requested per A100.
ima40-gpu: 32-core nodes purchased by Wonpil Im, Department of Biological Sciences.
Two 3.0GHz 16-core AMD EPYC 7302, 128M Cache, 256GB RAM, 8 NVIDIA A40 48GB GDDR6 GPUs
Each A40 GPU is charged 24SUs/hour. A maximum of 4CPUs can be requested per A40.
Limitations
Partition | Max Wallclock in hours | Min/Max Cores/Node per Job | Max SUs/Node consumed per hour | Max memory in GB per core |
|---|---|---|---|---|
lts | 72 | 1/19 | 19 | 6 |
lts-gpu | 72 | 1/20 | 20 | 6 |
im1080 | 48 | 1/20 | 20 | 5 |
im1080-gpu | 48 | 1/24 | 24 | 5 |
eng | 72 | 1/22 | 22 | 5 |
eng-gpu | 72 | 1/24 | 24 | 5 |
engc | 72 | 1/24 | 24 | 2.5 |
enge | 72 | 1/36 | 36 | 5 |
engi | 72 | 1/36 | 36 | 5 |
himem | 72 | 1/16 | 48 | 32 |
im2080 | 48 | 1/28 | 28 | 5 |
im2080-gpu | 48 | 1/36 | 36 | 5 |
chem | 48 | 1/36 | 36 | 5 |
health | 48 | 1/36 | 36 | 5 |
hawkcpu | 72 | 1/52 | 52 | 7.3 |
hawkmem | 72 | 1/52 | 52 | 29.3 |
hawkgpu | 72 | 1/48 | 48 | 4.0 |
infolab | 72 | 1/52 | 52 | 7.3 |
pisces (GPU only) | 24 | 1/10 | 58 | 4.0 |
ima40-gpu | 48 | 1/4 | 28 | 8.0 |
rapids | 72 | 1/64 | 64 | 8.0 |
The himem partition is for running high memory jobs i.e. those requiring more than 6GB/core or for using the Artelys Knitro software. Do not submit jobs to the himem partition for running jobs that require lower memory per core. All jobs in the himem partition are charged 3 SUs per core hour of computing irrespective of how many cores or memory you consume.
For hawkgpu, ideally request a max of 6 CPUs for every GPU you want to consume. We will not be allowing single core workflows on hawkgpu. You have to take a minimum of 1 GPU with 6 CPUs per GPU. i.e. a minimum of 6SUs will be consumed per hour. This is not implemented in the user friendly phase, so feel free to test how your application scales.
Priorities
To ensure investors receive their allocation of resources while still maintaining a shared resources, each investor receives a priority boost on his/her investment. Every investor hotel or condo receives a base priority of 1 on all partitions. A priority boost of 100 is provided to investors and their collaborators on their investment. This ensures that an investors job will always start before other users. Jobs accumulate a priority of 1 for each day in the queue. A non investors job in a different partition would have to be in queue for 100 days before it can have a higher priority than an investors job. Below is a table listing the various investors and the partitions where they have priority. All Hotel investors get priority access on the lts partition.
Investor | Partition |
|---|---|
Hotel | lts |
Dimitrios Vavylonis | lts |
Wonpil Im | im1080, im1080-gpu, im2080,im2080-gpu,ima40-gpu |
Anand Jagota | eng |
Brian Chen | eng, infolab |
Edmund Webb III | eng |
Alparslan Oztekin | eng |
Jeetain Mittal | lts-gpu,eng-gpu |
Srinivas Rangarajan | engc |
Seth Richards-Shubik | himem |
Ganesh Balasubramanian | enge |
Industrial and Systems Engineering | engi |
Lisa Fredin | chem |
Paolo Bocchini | engc |
Hannah Dailey | enge |
Keith Moored | pisces |
Current Status
Please use idlecores on the cluster to see an estimate the current traffic. We will replace the status page at the earliest opportunity, but in the meantime, this command provides most of the same information.
Current status of partitions and load on nodes is updated every 15 mins. Do not bookmark for off campus use, accessible on campus and VPN.
Usage
Usage reports for current and past allocation cycles. Do not bookmark for off campus use, accessible on campus and VPN.
Detailed Annual Reports with consumption of resources by users and research groups. Do not bookmark for off campus use, accessible on campus and VPN. Some pages may take a while a load due to amount of data reported.
File Systems
There are three distinct file spaces on Sol and Hawk.
HOME, your home directory.
SCRATCH, scratch storage on the local disk associated with your running job.
CEPHFS, global parallel scratch for running jobs with a lifetime of 7 days.
CEPH, Ceph project space for research groups that have purchased a minimum 1TB Ceph project
HOME Storage
All users are provided with a 150GB storage quota at /home/username and accessible using the environmental variable $HOME. Home storage is a large Ceph project that is not backed up. It is the users responsibility to maintain backups of their data in $HOME. $HOME directories are not deleted as long as annual user account fees are paid by the HPC PIs.
SCRATCH Storage
SCRATCH provides a 500GB storage on the local disk on the nodes associated with running jobs. This space is not backed up or snapshotted and is deleted when jobs are completed. A user can access this space while running jobs at /scratch/$SLURM_JOB_USER/$SLURM_JOB_ID. Since compute nodes are shared among different users, the available disk space could be less than 500GB. Users who use the SCRATCH space need to make sure that data is copied back at the end of their jobs. Since the scheduler purges the SCRATCH storage at the end of a job, data that hasn't been copied cannot be recovered. See below for a sample script using SCRATCH storage.
All modules define the variable LOCAL_SCRATCH to point to SCRATCH when loaded within your submit script.
Using Local Scratch for MD simulation
CEPHFS global parallel scratch
CEPHFS provides a 22TB global parallel scratch storage. This space is not backed up or snapshotted and all files older than 7 days are deleted. A user can access this space at /share/ceph/scratch/$USER/$SLURM_JOB_ID for running jobs and for 7 days after the job has completed. The SLURM scheduler automatically creates this directory. Users can use this space for writing parallel job output that needs a longer lifetime than that provided by SCRATCH. Since this storage is serviced by SSDs on the Ceph storage cluster, using CEPHFS provides better read/write performance than HOME and CEPH storage spaces. It is the users responsibility to backup data within 7 days of your job completing.
All modules define the variable CEPHFS_SCRATCH to point to CEPHFS when loaded within your submit script.
CEPH Storage
Lehigh Research Computing provides Ceph projects for research groups that require more storage than the 150GB provided to each HPC account. HPC PIs can add their collaborators to their Ceph project that can be used a storage space located at /share/ceph/projectname on Sol. Users should keep in mind that all Ceph projects including $HOME is a networked file system and writing job output to these filesystem could affect the performance of your jobs. Ceph projects should be used for storage and all workloads that contain intense Input and Output should use the SCRATCH or CEPHFS global scratch storage.
Running Jobs on Sol and Hawk
You must be allocated at least one compute node by SLURM to run jobs. Running compute intensive workload (i.e. anything other than editing files, submitting and monitoring jobs) on the head/login node is strictly prohibited. Users will need to write a script requesting desired resources from SLURM.
Special Instructions
To run jobs, add one of the following lines to your submit script to load modules that are optimized for the underlying CPU (for debug, enge, chem, im2080, health, hawk and infolab partitions)
source /etc/profile.d/zlmod.sh
#OR
source /share/Apps/compilers/etc/lmod/zlmod.shIf you use tcsh, then you need to add the following line to add LMOD to your path before loading any modules
source /share/Apps/compilers/etc/lmod/zlmod.cshThere are two types of job that can be run on Sol
Interactive Jobs
Batch Jobs
Interactive Jobs
These are jobs that provide an interactive environment or command line prompt on which users can enter commands to run simulations. These are best when used for testing and debugging and are not appropriate for long running production jobs. Resources can be requested using the srun command with at least one option to launch a pseudo terminal --pty /bin/bash. Other options include partition, number of nodes, tasks per node and time
Interactive Job on lts partition requesting 1 cpu for 1 hour
srun --partition=lts --nodes=1 --ntasks-per-node=1 --time=60 --pty /bin/bashWhen a resource becomes available, SLURM will provide you with a command prompt on the compute node you are allocated. Until resource is available, you will have no access to use the command prompt on the shell where the above command is executed. If you cancel the command using CNTRL-C, your interactive job request will be cancelled. Depending on how busy the cluster is, your wait could be a few minutes to a few days.
All compute nodes have a naming convention sol-[a-e][1-6][00-18], for e.g. sol-a104. Do not run jobs on the head/login node i.e. sol.
Batch Jobs
These are jobs that require writing a series of command in a shell script that SLURM will execute on the compute node. Resources can be requested in the script or as options to the command, sbatch, while submitting the script to the SLURM scheduler.
Sample Scripts for Batch Jobs
Serial Job
#!/bin/bash
# Sample script for submitting a serial job
# on lts partition using 1 core per node
# for 1 hour.
# Use all-cpu Partition (using PBS convention, lts queue)
#SBATCH --partition=lts
# Request 1 hour of computing time
#SBATCH --time=1:00:00
# Request 1 core Serial jobs cannot use more than 1 core
# However, if the memory required exceeds RAM/core then request
# more tasks but do not use more than 1 core
# Partition:Max RAM/Core in Gb
# lts: 6.4
# eng/im1080/im1080-gpu: 5.3
# engc: 2.6
#SBATCH --ntasks=1
# Give a name to your job to aid in monitoring
#SBATCH --job-name myjob
# Write Standard Output and Error
#SBATCH --output="myjob.%j.%N.out"
# Notify user at events
#SBATCH --mail-type=ALL
#SBATCH --mail-user=<username>@lehigh.edu
# cd to directory where you submitted the job
# or directory where you want to run the job
cd ${SLURM_SUBMIT_DIR}
# launch job
./myjob < filename.in > filename.out
# Alternatively, you can run your jobs through srun
# However, if your serial job requires more memory than
# that allotted per core and you have requested > 1 core,
# then add -n 1 flag to srun to avoid running multiple copies
srun -n 1 ./myjob < filename.in > filename.out
exitOpenMP Job
#!/bin/bash
# Sample script for submitting OpenMP job
# on im1080 partition using 1 nodes, 12 cores per node
# for 1 hour. Users can request upto 20 cores per node
# in the im1080 partition
# Use im1080 Partition (using PBS convention, im1080 queue)
#SBATCH --partition=im1080
# Request 1 hour of computing time
#SBATCH --time=1:00:00
# Request 1 node, OpenMP cannot use more than 1 node
#SBATCH --nodes=1
# Request upto 20 cores on the node
# The im1080 partition has 2 GPUs per node and
# one core is reserved for each GPU
# You can use up to 19 cores on the lts partition and
# up to 22 cores on the eng partition
#SBATCH --ntasks-per-node=12
# Give a name to your job to aid in monitoring
#SBATCH --job-name=myjob
# Write Standard Output and Error
#SBATCH --output="myjob.%j.%N.out"
# Notify user at events
#SBATCH --mail-type=ALL
#SBATCH --mail-user=<username>@lehigh.edu
# Setup Environment for OpenMP
# Specify number of OpenMP Threads
export OMP_NUM_THREADS=12
# cd to directory where you submitted the job
# or directory where you want to run the job
cd /scratch/${SLURM_JOB_USER}/${SLURM_JOB_ID}
# launch job assuming myjob is present at ${SLURM_SUBMIT_DIR}
${SLURM_SUBMIT_DIR}/myjob < ${SLURM_SUBMIT_DIR}/filename.in > filename.out
# Alternatively, you can specify number of OpenMP Threads at launch
OMP_NUM_THREADS=12 ${SLURM_SUBMIT_DIR}/myjob < ${SLURM_SUBMIT_DIR}/filename.in > filename.out
# Copy output file at the end of your job
# For jobs that contain only one output
cp filename.out ${SLURM_SUBMIT_DIR}
# If you are creating multiple output files,
# you can use wildcards or rsync at the end of your job
rsync -avtz * ${SLURM_SUBMIT_DIR}/
exitMPI Job
#!/bin/bash
# Sample script for submitting MPI job
# on lts partition using 2 nodes, 19 cores per node
# for 1 hour
# Use lts Partition (using PBS convention, lts queue)
#SBATCH --partition=lts
# Request 1 hour of computing time
#SBATCH --time=1:00:00
# Request 2 nodes
#SBATCH --nodes=2
# Request all 20 cores on the node
#SBATCH --ntasks-per-node=19
# Give a name to your job to aid in monitoring
#SBATCH --job-name myjob
# Write Standard Output and Error
#SBATCH --output="myjob.%j.%N.out"
# Notify user at events
#SBATCH --mail-type=ALL
#SBATCH --mail-user=<username>@lehigh.edu
# By default mvapich2 is loaded on infiniband nodes (i.e. except infolab and hawk partitions)
module load mvapich2
# cd to directory where you submitted the job
# or directory where you want to run the job
cd ${SLURM_SUBMIT_DIR}
# Use SLURM's srun command.
# It contains information about allocated nodes and processors
# and can launch job without the need to specify them
srun ./myjob < filename.in > filename.out
exitGPU Job
#!/bin/tcsh
#SBATCH --partition=im1080-gpu
# Directives can be combined on one line
#SBATCH --time=1:00:00
#SBATCH --nodes=1
# 1 CPU can be be paired with only 1 GPU
# GPU jobs can request all 24 CPUs
#SBATCH --ntasks-per-node=1
# Request one GPU for your workload
#SBATCH --gres=gpu:1
# Need both GPUs, use --gres=gpu:2
#SBATCH --job-name myjob
# Source zlmod.csh script to get LMOD in your path
source /share/Apps/compilers/etc/lmod/zlmod.csh
# Copy input and miscellaneous files to run directory
cp ${SLURM_SUBMIT_DIR}/* .
# Load LAMMPS Module
module load lammps
# Most modules set LOCAL_SCRATCH to /scratch/${SLURM_JOB_USER}/${SLURM_JOB_ID}
# and CEPHFS_SCRATCH to /share/ceph/scratch/${USER}/${SLURM_JOB_ID}
cd ${CEPHFS_SCRATCH}
# Run LAMMPS for input file in.lj
srun $(which lammps) -in in.lj -sf gpu -pk gpu 1 gpuID ${CUDA_VISIBLE_DEVICES} ${CUDA_VISIBLE_DEVICES}
# Copy output back to ${SLURM_SUBMIT_DIR} in a subfolder
cd ${SLURM_SUBMIT_DIR}/
mv /share/ceph/scratch/${USER}/${SLURM_JOB_ID} .
# Note that there is no guarantee which device will be assigned to your job.
# If you use 0 or 1 instead of ${CUDA_VISIBLE_DEVICE}, your jobs will be utilizing
# GPUs assigned to another user
# NAMD: Add "+devices ${CUDA_VISIBLE_DEVICE}" as a command line flag to charmrun
# GROMACS: Add "-gpu_id ${CUDA_VISIBLE_DEVICE}" as a command line flag to mdrun
# If you request both GPUs, then
# LAMMPS: -pk gpu 2 gpuID 0 1
# NAMD: +devices 0,1
# GROMACS: -gpu_id 01Submitting Jobs
To submit a job, run the command
sbatch slurmjob.shsbatch can take command line arguments that would otherwise be added to the submit script. For example, to request a job for 12 hours and 4 nodes on the lts partition
sbatch --time=12:00:00 --partition=lts --nodes=4 --ntasks-per-node=19 slurmjob.shCommand line options to sbatch override #SBATCH commands in the submit script.
Submitting Dependency jobs
You want to run a long simulation that is split into multiple sequential runs to fit within the maximum walltimes of the partitions. One common method is to create job submission script for each of the sequential steps that will be submitted by the previous job or submitted manually when the previous job is complete. The former method is not recommended since some systems do not allow job submission from the compute nodes (you might encounter the same issues on national resources as very few systems have queue walltimes larger than 7 days) or if you run out of walltime, then the subsequent job may not be submitted. In the latter method, you lose valuable time if you are not monitoring your jobs and are not available to submit the subsequent job.
The recommended method is to submit jobs with a dependency attribute for the second and subsequent jobs. On Sol and any system that uses the SLURM job scheduler, dependency jobs are created by adding the --dependency=... flag to the sbatch command.
sbatch --dependency=afterok:<JobID> <Submit Script>Here, you are submitting a SLURM script <Submit Script> that depends on a previous job with ID <JobID>. Options that can be added to the dependency argument are
afterok:<JobID> Job will be scheduled to run only if Job <JobID> had completed with no errors
afternotok:<JobID> Job will be scheduled to run only if Job <JobID> has completed with errors
afterany:<JobID> Job will be scheduled to run after Job <JobID> has completed, with or without errors
Abbreviated Notations
SLURM also accepts abbreviated notation for sbatch command