SLURM


The SLURM scheduler (Simple Linux Utility for Resource Management) manages and allocates all of Sol's compute nodes. All of your computing must be done on Sol's compute nodes. The following is an abbreviated user guide for SLURM. Please visit the SLURM website for a more detailed documentation of tools and capabilities.

Partitions

SLURM uses the term partition instead of queue. There are several partitions available on Sol and Hawk for running jobs:

  • lts : 20-core nodes purchased as part of the original cluster by LTS.
    • Two 2.3GHz 10-core Intel Xeon E5-2650 v3, 25M Cache, 128GB 2133MHz RAM
  • lts-gpu: 1 core per lts node is reserved for launching gpu jobs
  • im1080 : 24-core nodes purchased by Wonpil Im, Department of Biological Sciences. Users can request a max of 20 cores per node.
  • im1080-gpu : 2 cores per im1080 node is reserved for launching gpu jobs.
    • Two 2.3GHz 12-core Intel Xeon E5-2670 v3, 30M Cache, 128GB 2133MHz RAM, Two EVGA Geforce GTX 1080 PCIE 8GB GDDR5
  • eng : 24-core nodes purchased by various RCEAS faculty.
  • eng-gpu : 2 cores per eng node is reserved for launching gpu jobs i.e. 1 core for each gpu.
    • Two 2.3GHz 12-core Intel Xeon E5-2670 v3, 30M Cache, 128GB 2133MHz RAM, EVGA Geforce GTX 1080 PCIE 8GB GDDR5. Four nodes have two cards while other nodes have one card
  • engc : 24-core nodes based on Broadwell CPUs purchased by ChemE Faculty. Users can request a max of 24 cores per node until GPUs are added to these nodes.
    • Two 2.2GHz 12-core Intel Xeon E5-2650 v4, 30M Cache, 64GB 2133MHz RAM
  • himem : 16-core node purchased by Economics Faculty with 512GB RAM.
    • Two 2.6GHz 8-core Intel Xeon E5-2640 v3, 20M Cache, 512GB 2400MHz RAM
    • Users utilizing this node will be charged a higher rate of SU consumption ( 3 SU/core hour). Please evaluate memory consumption of your job before submitting jobs to this partition. If you need to use this partition, please contact Ryan Bradley.
  • enge,engi: 36-core node purchased by MEM faculty and ISE Department
    • Two 2.3GHz 18-core Intel Xeon Gold 6140, 24.75M Cache, 192GB 2666MHz RAM
    • This node features the newer AVX512 vector extension that provides twice the FLOPS of earlier generation Haswell/Broadwell CPUs at the expense of CPU speed.
  • im2080: 36-core nodes purchased by Wonpil Im, Department of Biological Sciences. Users can request a max of 28 cores per node.
  • im2080-gpu : 8 cores per im2080 node is reserved for launching gpu jobs i.e. 2 cores per gpu
    • Two 2.3GHz 18-core Intel Xeon Gold 6140, 24.75M Cache, 192GB 2666MHz RAM, Four ASUS GeForce RTX 2080TI PCIE 11GB GDDR6
  • chem: 36-core Sklyake (2) and Cascade Lake (4) nodes purchased by Lisa Fredin, Department of Chemistry
    • (2) Two 2.3GHz 18-core Intel Xeon Gold 6140, 24.75M Cache, 192GB 2666MHz RAM
    • (4) Two 2.6GHz 18-core Intel Xeon Gold 6240, 24.75M Cache, 192GB 2933MHz RAM
  • health: 36-core nodes purchased by the College of Health
    • Two 2.6GHz 18-core Intel Xeon Gold 6240, 24.75M Cache, 192GB 2933MHz RAM
  • hawkcpu: CPU nodes on Hawk
    • Two 2.1GHz 26-core Intel Xeon Gold 6230R, 384GB RAM
  • hawkgpu: GPU nodes on Hawk
    • Two 2.2GHz 24-core Intel Xeon Gold 5220R, 192GB RAM, 8 nVIDIA Tesla T4
  • hawkmem: Big Memory nodes on Hawk
    • Two 2.1GHz 26-core Intel Xeon Gold 6230R, 1536GB RAM
  • infolab: 2 52-core Cascade Lake refresh nodes purchased by Brian Chen, CSE faculty (identical to Hawk CPU nodes)
    • Two 2.1GHz 26-core Intel Xeon Gold 6230R, 384GB RAM
  • pisces: 48-core node with A100 GPUs purchased by Keith Moored, Department of Mechanical Engineering and Mechanics.
    • Two 3.0GHz 24-core Intel Xeon Gold 6248R, 35.75M Cache, 192GB RAM, 5 NVIDIA A100 40GB HBM2 GPUs
      • Each A100 GPU is charged 48SUs/hour. A maximum of 10CPUs can be requested per A100.
  • ima40-gpu: 32-core nodes purchased by Wonpil Im, Department of Biological Sciences. 
    • Two 3.0GHz 16-core AMD EPYC 7302, 128M Cache, 256GB RAM, 8 NVIDIA A40 48GB GDDR6 GPUs
      • Each A40 GPU is charged 24SUs/hour. A maximum of 4CPUs can be requested per A40.

Limitations

Partition

Max Wallclock in hours

Min/Max Cores/Node per Job

Max SUs/Node consumed per hour

Max memory in GB per core

lts

72

1/19

19

6

lts-gpu721/20206

im1080

48

1/20

20

5

im1080-gpu

48

1/24

24

5

eng

72

1/22

22

5

eng-gpu

72

1/24

24

5

engc

72

1/24

24

2.5

enge

72

1/36

36

5

engi

72

1/36

36

5

himem

72

1/16

48

32

im2080

48

1/28

28

5

im2080-gpu

48

1/36

36

5

chem481/36365
health481/36365
hawkcpu721/52527.3
hawkmem721/525229.3
hawkgpu721/48484.0
infolab721/52527.3
pisces (GPU only)241/10584.0
ima40-gpu481/4288.0
rapids721/64648.0

The himem partition is for running high memory jobs i.e. those requiring more than 6GB/core or for using the Artelys Knitro software. Do not submit jobs to the himem partition for running jobs that require lower memory per core. All jobs in the himem partition are charged 3 SUs per core hour of computing irrespective of how many cores or memory you consume.

For hawkgpu, ideally request a max of 6 CPUs for every GPU you want to consume. We will not be allowing single core workflows on hawkgpu. You have to take a minimum of 1 GPU with 6 CPUs per GPU. i.e. a minimum of 6SUs will be consumed per hour. This is not implemented in the user friendly phase, so feel free to test how your application scales.

Priorities

To ensure investors receive their allocation of resources while still maintaining a shared resources, each investor receives a priority boost on his/her investment. Every investor hotel or condo receives a base priority of 1 on all partitions. A priority boost of 100 is provided to investors and their collaborators on their investment. This ensures that an investors job will always start before other users. Jobs accumulate a priority of 1 for each day in the queue. A non investors job in a different partition would have to be in queue for 100 days before it can have a higher priority than an investors job. Below is a table listing the various investors and the partitions where they have priority. All Hotel investors get priority access on the lts partition.


InvestorPartition
Hotellts
Dimitrios Vavylonislts
Wonpil Imim1080, im1080-gpu, im2080,im2080-gpu,ima40-gpu
Anand Jagotaeng
Brian Cheneng, infolab
Edmund Webb IIIeng
Alparslan Oztekineng
Jeetain Mittallts-gpu,eng-gpu
Srinivas Rangarajanengc
Seth Richards-Shubikhimem
Ganesh Balasubramanianenge
Industrial and Systems Engineeringengi
Lisa Fredinchem
Paolo Bocchiniengc
Hannah Daileyenge
Keith Moored
pisces

   

Current Status

Current status of partitions and load on nodes is updated every 15 mins. Do not bookmark for off campus use, accessible on campus and VPN.

Usage

Usage reports for current and past allocation cycles. Do not bookmark for off campus use, accessible on campus and VPN.

Detailed Annual Reports with consumption of resources by users and research groups. Do not bookmark for off campus use, accessible on campus and VPN. Some pages may take a while a load due to amount of data reported.


File Systems

There are three distinct file spaces on Sol and Hawk.

  • HOME, your home directory.
  • SCRATCH, scratch storage on the local disk associated with your running job.
  • CEPHFS, global parallel scratch for running jobs with a lifetime of 7 days.
  • CEPH, Ceph project space for research groups that have purchased a minimum 1TB Ceph project

HOME Storage

All users are provided with a 150GB storage quota at /home/username and accessible using the environmental variable $HOME. Home storage is a large Ceph project that is not backed up. It is the users responsibility to maintain backups of their data in $HOME. $HOME directories are not deleted as long as annual user account fees are paid by the HPC PIs. 

SCRATCH Storage

SCRATCH provides a 500GB storage on the local disk on the nodes associated with running jobs. This space is not backed up or snapshotted and is deleted when jobs are completed. A user can access this space while running jobs at /scratch/$SLURM_JOB_USER/$SLURM_JOB_ID. Since compute nodes are shared among different users, the available disk space could be less than 500GB. Users who use the SCRATCH space need to make sure that data is copied back at the end of their jobs. Since the scheduler purges the SCRATCH storage at the end of a job, data that hasn't been copied cannot be recovered. See below for a sample script using SCRATCH storage.

All modules define the variable LOCAL_SCRATCH to point to SCRATCH when loaded within your submit script.

Using Local Scratch for MD simulation

CEPHFS global parallel scratch

CEPHFS provides a 22TB global parallel scratch storage. This space is not backed up or snapshotted and all files older than 7 days are deleted. A user can access this space at /share/ceph/scratch/$USER/$SLURM_JOB_ID for running jobs and for 7 days after the job has completed. The SLURM scheduler automatically creates this directory. Users can use this space for writing parallel job output that needs a longer lifetime than that provided by SCRATCH. Since this storage is serviced by SSDs on the Ceph storage cluster, using CEPHFS provides better read/write performance than HOME and CEPH storage spaces. It is the users responsibility to backup data within 7 days of your job completing.

All modules define the variable CEPHFS_SCRATCH to point to CEPHFS when loaded within your submit script.

CEPH Storage

Lehigh Research Computing provides Ceph projects for research groups that require more storage than the 150GB provided to each HPC account. HPC PIs can add their collaborators to their Ceph project that can be used a storage space located at /share/ceph/projectname on Sol. Users should keep in mind that all Ceph projects including $HOME is a networked file system and writing job output to these filesystem could affect the performance of your jobs. Ceph projects should be used for storage and all workloads that contain intense Input and Output should use the SCRATCH or CEPHFS global scratch storage.


Running Jobs on Sol and Hawk

You must be allocated at least one compute node by SLURM to run jobs. Running compute intensive workload (i.e. anything other than editing files, submitting and monitoring jobs) on the head/login node is strictly prohibited. Users will need to write a script requesting desired resources from SLURM.

Special Instructions

To run jobs, add one of the following lines to your submit script to load modules that are optimized for the underlying CPU (for debug, enge, chem, im2080, health, hawk and infolab partitions)

source /etc/profile.d/zlmod.sh
#OR
source /share/Apps/compilers/etc/lmod/zlmod.sh

If you use tcsh, then you need to add the following line to add LMOD to your path before loading any modules

source /share/Apps/compilers/etc/lmod/zlmod.csh

There are two types of job that can be run on Sol

  1. Interactive Jobs
  2. Batch Jobs


Interactive Jobs

These are jobs that provide an interactive environment or command line prompt on which users can enter commands to run simulations. These are best when used for testing and debugging and are not appropriate for long running production jobs. Resources can be requested using the srun command  with at least one option to launch a pseudo terminal --pty /bin/bash. Other options include partition, number of nodes,  tasks per node and time

Interactive Job on lts partition requesting 1 cpu for 1 hour
srun --partition=lts --nodes=1 --ntasks-per-node=1 --time=60 --pty /bin/bash

When a resource becomes available, SLURM will provide you with a command prompt on the compute node you are allocated. Until resource is available, you will have no access to use the command prompt on the shell where the above command is executed. If you cancel the command using CNTRL-C, your interactive job request will be cancelled. Depending on how busy the cluster is, your wait could be a few minutes to a few days.

All compute nodes have a naming convention sol-[a-e][1-6][00-18], for e.g. sol-a104. Do not run jobs on the head/login node i.e. sol.

Batch Jobs

These are jobs that require writing a series of command in a shell script that SLURM will execute on the compute node. Resources can be requested in the script or as options to the command, sbatch, while submitting the script to the SLURM scheduler.

Sample Scripts for Batch Jobs


Serial Job
#!/bin/bash
# Sample script for submitting a serial job 
# on lts partition using 1 core per node
#  for 1 hour. 

# Use all-cpu Partition (using PBS convention, lts queue)
#SBATCH --partition=lts

# Request 1 hour of computing time
#SBATCH --time=1:00:00

# Request 1 core Serial jobs cannot use more than 1 core
# However, if the memory required exceeds RAM/core then request
# more tasks but do not use more than 1 core
# Partition:Max RAM/Core in Gb
# lts: 6.4
# eng/im1080/im1080-gpu: 5.3
# engc: 2.6
#SBATCH --ntasks=1


# Give a name to your job to aid in monitoring
#SBATCH --job-name myjob

# Write Standard Output and Error
#SBATCH --output="myjob.%j.%N.out"

# Notify user at events
#SBATCH --mail-type=ALL
#SBATCH --mail-user=<username>@lehigh.edu


# cd to directory where you submitted the job
# or directory where you want to run the job
cd ${SLURM_SUBMIT_DIR}

# launch job
./myjob < filename.in > filename.out

# Alternatively, you can run your jobs through srun
# However, if your serial job requires more memory than 
# that allotted per core and you have requested > 1 core,
# then add -n 1 flag to srun to avoid running multiple copies
srun -n 1 ./myjob < filename.in > filename.out


exit
OpenMP Job
#!/bin/bash
# Sample script for submitting OpenMP job 
# on im1080 partition using 1 nodes, 12 cores per node
#  for 1 hour. Users can request upto 20 cores per node
#  in the im1080 partition

# Use im1080 Partition (using PBS convention, im1080 queue)
#SBATCH --partition=im1080

# Request 1 hour of computing time
#SBATCH --time=1:00:00

# Request 1 node, OpenMP cannot use more than 1 node
#SBATCH --nodes=1

# Request upto 20 cores on the node
# The im1080 partition has 2 GPUs per node and 
#   one core is reserved for each GPU
# You can use up to 19 cores on the lts partition and
#   up to 22 cores on the eng partition  
#SBATCH --ntasks-per-node=12


# Give a name to your job to aid in monitoring
#SBATCH --job-name=myjob

# Write Standard Output and Error
#SBATCH --output="myjob.%j.%N.out"

# Notify user at events
#SBATCH --mail-type=ALL
#SBATCH --mail-user=<username>@lehigh.edu

# Setup Environment for OpenMP
# Specify number of OpenMP Threads
export OMP_NUM_THREADS=12

# cd to directory where you submitted the job
# or directory where you want to run the job
cd /scratch/${SLURM_JOB_USER}/${SLURM_JOB_ID}

# launch job assuming myjob is present at ${SLURM_SUBMIT_DIR}
${SLURM_SUBMIT_DIR}/myjob < ${SLURM_SUBMIT_DIR}/filename.in > filename.out


# Alternatively, you can specify number of OpenMP Threads at launch
OMP_NUM_THREADS=12 ${SLURM_SUBMIT_DIR}/myjob < ${SLURM_SUBMIT_DIR}/filename.in > filename.out


# Copy output file at the end of your job
# For jobs that contain only one output
cp filename.out ${SLURM_SUBMIT_DIR}
# If you are creating multiple output files, 
# you can use wildcards or rsync at the end of your job
rsync -avtz * ${SLURM_SUBMIT_DIR}/


exit
MPI Job
#!/bin/bash
# Sample script for submitting MPI job 
# on lts partition using 2 nodes, 19 cores per node
#  for 1 hour

# Use lts Partition (using PBS convention, lts queue)
#SBATCH --partition=lts

# Request 1 hour of computing time
#SBATCH --time=1:00:00

# Request 2 nodes
#SBATCH --nodes=2

# Request all 20 cores on the node
#SBATCH --ntasks-per-node=19

# Give a name to your job to aid in monitoring
#SBATCH --job-name myjob

# Write Standard Output and Error
#SBATCH --output="myjob.%j.%N.out"

# Notify user at events
#SBATCH --mail-type=ALL
#SBATCH --mail-user=<username>@lehigh.edu

# By default mvapich2 is loaded on infiniband nodes (i.e. except infolab and hawk partitions)
module load mvapich2

# cd to directory where you submitted the job
# or directory where you want to run the job
cd ${SLURM_SUBMIT_DIR}

# Use SLURM's srun command. 
# It contains information about allocated nodes and processors
# and can launch job without the need to specify them
srun ./myjob < filename.in > filename.out


exit
GPU Job
#!/bin/tcsh
#SBATCH --partition=im1080-gpu
# Directives can be combined on one line
#SBATCH --time=1:00:00
#SBATCH --nodes=1
# 1 CPU can be be paired with only 1 GPU
# GPU jobs can request all 24 CPUs
#SBATCH --ntasks-per-node=1
# Request one GPU for your workload
#SBATCH --gres=gpu:1
# Need both GPUs, use --gres=gpu:2
#SBATCH --job-name myjob

# Source zlmod.csh script to get LMOD in your path
source /share/Apps/compilers/etc/lmod/zlmod.csh
# Copy input and miscellaneous files to run directory
cp ${SLURM_SUBMIT_DIR}/* .

# Load LAMMPS Module
module load lammps 

# Most modules set LOCAL_SCRATCH to /scratch/${SLURM_JOB_USER}/${SLURM_JOB_ID}
# and CEPHFS_SCRATCH to /share/ceph/scratch/${USER}/${SLURM_JOB_ID}
cd ${CEPHFS_SCRATCH} 
# Run LAMMPS for input file in.lj
srun $(which lammps) -in in.lj -sf gpu -pk gpu 1 gpuID ${CUDA_VISIBLE_DEVICES} ${CUDA_VISIBLE_DEVICES}

# Copy output back to ${SLURM_SUBMIT_DIR} in a subfolder
cd ${SLURM_SUBMIT_DIR}/
mv /share/ceph/scratch/${USER}/${SLURM_JOB_ID} .

# Note that there is no guarantee which device will be assigned to your job.
# If you use 0 or 1 instead of ${CUDA_VISIBLE_DEVICE}, your jobs will be utilizing
#  GPUs assigned to another user
# NAMD: Add "+devices ${CUDA_VISIBLE_DEVICE}" as a command line flag to charmrun
# GROMACS: Add "-gpu_id ${CUDA_VISIBLE_DEVICE}" as a command line flag to mdrun
# If you request both GPUs, then
# LAMMPS: -pk gpu 2 gpuID 0 1
# NAMD: +devices 0,1
# GROMACS: -gpu_id 01

Submitting Jobs

To submit a job, run the command 
sbatch slurmjob.sh
sbatch can take command line arguments that would otherwise be added to the submit script. For example, to request a job for 12 hours and 4 nodes on the lts partition
sbatch --time=12:00:00 --partition=lts --nodes=4 --ntasks-per-node=19 slurmjob.sh

Command line options to sbatch override #SBATCH commands in the submit script.


Submitting Dependency jobs

You want to run a long simulation that is split into multiple sequential runs to fit within the maximum walltimes of the partitions. One common method is to create job submission script for each of the sequential steps that will be submitted by the previous job or submitted manually when the previous job is complete. The former method is not recommended since some systems do not allow job submission from the compute nodes (you might encounter the same issues on national resources as very few systems have queue walltimes larger than 7 days) or if you run out of walltime, then the subsequent job may not be submitted. In the latter method, you lose valuable time if you are not monitoring your jobs and are not available to submit the subsequent job.

The recommended method is to submit jobs with a dependency attribute for the second and subsequent jobs. On Sol and any system that uses the SLURM job scheduler, dependency jobs are created by adding the --dependency=... flag to the sbatch command.

sbatch --dependency=afterok:<JobID> <Submit Script>

Here, you are submitting a SLURM script <Submit Script> that depends on a previous job with ID <JobID>. Options that can be added to the dependency argument are

  • afterok:<JobID> Job will be scheduled to run only if Job <JobID> had completed with no errors
  • afternotok:<JobID> Job will be scheduled to run only if Job <JobID> has completed with errors
  • afterany:<JobID> Job will be scheduled to run after Job <JobID> has completed, with or without errors

Abbreviated Notations

SLURM also accepts abbreviated notation for sbatch command

Long Format

Short Format

--paritition=name

-p name

--time=mm:ss

-t mm:ss

--nodes=number

-N number

--ntasks=total procs

-n total procs

--dependency=attributes

-d attributes

Monitoring Jobs

SLURM provides various tools for monitoring and manipulating jobs

Check queue status

squeue <Options>


Options

  • -u <username>: show status of all jobs for a particular user
  • -j <jobid>: show status for jobid
  • -l: show long format of queue status
  • -p <name>: show status of all jobs in paritition name
  • -s: show estimated start time

Use --help option to see a full list of allowed options and usage

checkq is a script accessible through the soltools modules which provides squeue with some useful defaults and can accept the above options.

Cancel/delete a job

You can only delete only your jobs that are in queue or already running

scancel <jobid>

Manipulate Jobs in Queue

A user or admin can manipulate jobs that are in queue i.e. not running yet.

Hold a job
scontrol hold <jobid>
Release a held job
scontrol release <jobid>

You can only release jobs that you have held. If an admin has held your job, only the admin can release it.

Show job details
scontrol show job <jobid>
Modify a job after submission
scontrol update SPECIFICATION jobid=<jobid>


Examples of SPECIFICATION are

  • add dependency after a job has been submitted: dependency=<attributes>
  • change job name: jobname=<name>
  • change partition: partition=<name>
  • modify requested runtime: timelimit=<hh:mm:ss>
  • request gpus (when changing to one of the gpu partitions): gres=gpu:<1,2,3 or 4>
SPECIFICATIONs can be combined for e.g. command to move a queued job to im1080 partition and change timelimit to 48 hours for a job 123456 is
scontrol update partition=im1080 timelimit=48:00:00 jobid=123456


Monitoring Queues

Display queue/partition names, runtimes and available nodes
alp514.sol(511): sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
lts* up 3-00:00:00 9 idle sol-a101-109
im1080 up 2-00:00:00 24 alloc sol-b401-413,501-511
im1080 up 2-00:00:00 1 idle sol-b512
Display runtimes and available nodes for a particular queue/partition
alp514.sol(512): sinfo -p lts,im1080
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
lts* up 3-00:00:00 9 idle sol-a101-109
im1080 up 2-00:00:00 24 alloc sol-b401-413,501-511
im1080 up 2-00:00:00 1 idle sol-b512


checkload is a script accessible through the soltools modules which provides sinfo with some useful defaults and can accept the above options.

Click Here for status of Sol partitions - updated every 15 mins, accessible at Lehigh and VPN only. This page is generated from output of checkq and checkload for partition status and node usage respectively.