SLURM
The SLURM scheduler (Simple Linux Utility for Resource Management) manages and allocates all of Sol's compute nodes. All of your computing must be done on Sol's compute nodes. The following is an abbreviated user guide for SLURM. Please visit the SLURM website for a more detailed documentation of tools and capabilities.
Partitions
SLURM uses the term partition instead of queue. There are several partitions available on Sol and Hawk for running jobs:
- lts : 20-core nodes purchased as part of the original cluster by LTS.
- Two 2.3GHz 10-core Intel Xeon E5-2650 v3, 25M Cache, 128GB 2133MHz RAM
- lts-gpu: 1 core per lts node is reserved for launching gpu jobs
- im1080 : 24-core nodes purchased by Wonpil Im, Department of Biological Sciences. Users can request a max of 20 cores per node.
- im1080-gpu : 2 cores per im1080 node is reserved for launching gpu jobs.
- Two 2.3GHz 12-core Intel Xeon E5-2670 v3, 30M Cache, 128GB 2133MHz RAM, Two EVGA Geforce GTX 1080 PCIE 8GB GDDR5
- eng : 24-core nodes purchased by various RCEAS faculty.
- eng-gpu : 2 cores per eng node is reserved for launching gpu jobs i.e. 1 core for each gpu.
- Two 2.3GHz 12-core Intel Xeon E5-2670 v3, 30M Cache, 128GB 2133MHz RAM, EVGA Geforce GTX 1080 PCIE 8GB GDDR5. Four nodes have two cards while other nodes have one card
- engc : 24-core nodes based on Broadwell CPUs purchased by ChemE Faculty. Users can request a max of 24 cores per node until GPUs are added to these nodes.
- Two 2.2GHz 12-core Intel Xeon E5-2650 v4, 30M Cache, 64GB 2133MHz RAM
- himem : 16-core node purchased by Economics Faculty with 512GB RAM.
- Two 2.6GHz 8-core Intel Xeon E5-2640 v3, 20M Cache, 512GB 2400MHz RAM
- Users utilizing this node will be charged a higher rate of SU consumption ( 3 SU/core hour). Please evaluate memory consumption of your job before submitting jobs to this partition. If you need to use this partition, please contact Ryan Bradley.
- enge,engi: 36-core node purchased by MEM faculty and ISE Department
- Two 2.3GHz 18-core Intel Xeon Gold 6140, 24.75M Cache, 192GB 2666MHz RAM
- This node features the newer AVX512 vector extension that provides twice the FLOPS of earlier generation Haswell/Broadwell CPUs at the expense of CPU speed.
- im2080: 36-core nodes purchased by Wonpil Im, Department of Biological Sciences. Users can request a max of 28 cores per node.
- im2080-gpu : 8 cores per im2080 node is reserved for launching gpu jobs i.e. 2 cores per gpu
- Two 2.3GHz 18-core Intel Xeon Gold 6140, 24.75M Cache, 192GB 2666MHz RAM, Four ASUS GeForce RTX 2080TI PCIE 11GB GDDR6
- chem: 36-core Sklyake (2) and Cascade Lake (4) nodes purchased by Lisa Fredin, Department of Chemistry
- (2) Two 2.3GHz 18-core Intel Xeon Gold 6140, 24.75M Cache, 192GB 2666MHz RAM
- (4) Two 2.6GHz 18-core Intel Xeon Gold 6240, 24.75M Cache, 192GB 2933MHz RAM
- health: 36-core nodes purchased by the College of Health
- Two 2.6GHz 18-core Intel Xeon Gold 6240, 24.75M Cache, 192GB 2933MHz RAM
- hawkcpu: CPU nodes on Hawk
- Two 2.1GHz 26-core Intel Xeon Gold 6230R, 384GB RAM
- hawkgpu: GPU nodes on Hawk
- Two 2.2GHz 24-core Intel Xeon Gold 5220R, 192GB RAM, 8 nVIDIA Tesla T4
- hawkmem: Big Memory nodes on Hawk
- Two 2.1GHz 26-core Intel Xeon Gold 6230R, 1536GB RAM
- infolab: 2 52-core Cascade Lake refresh nodes purchased by Brian Chen, CSE faculty (identical to Hawk CPU nodes)
- Two 2.1GHz 26-core Intel Xeon Gold 6230R, 384GB RAM
- pisces: 48-core node with A100 GPUs purchased by Keith Moored, Department of Mechanical Engineering and Mechanics.
- Two 3.0GHz 24-core Intel Xeon Gold 6248R, 35.75M Cache, 192GB RAM, 5 NVIDIA A100 40GB HBM2 GPUs
- Each A100 GPU is charged 48SUs/hour. A maximum of 10CPUs can be requested per A100.
- Two 3.0GHz 24-core Intel Xeon Gold 6248R, 35.75M Cache, 192GB RAM, 5 NVIDIA A100 40GB HBM2 GPUs
- ima40-gpu: 32-core nodes purchased by Wonpil Im, Department of Biological Sciences.
- Two 3.0GHz 16-core AMD EPYC 7302, 128M Cache, 256GB RAM, 8 NVIDIA A40 48GB GDDR6 GPUs
- Each A40 GPU is charged 24SUs/hour. A maximum of 4CPUs can be requested per A40.
- Two 3.0GHz 16-core AMD EPYC 7302, 128M Cache, 256GB RAM, 8 NVIDIA A40 48GB GDDR6 GPUs
Limitations
Partition | Max Wallclock in hours | Min/Max Cores/Node per Job | Max SUs/Node consumed per hour | Max memory in GB per core |
---|---|---|---|---|
lts | 72 | 1/19 | 19 | 6 |
lts-gpu | 72 | 1/20 | 20 | 6 |
im1080 | 48 | 1/20 | 20 | 5 |
im1080-gpu | 48 | 1/24 | 24 | 5 |
eng | 72 | 1/22 | 22 | 5 |
eng-gpu | 72 | 1/24 | 24 | 5 |
engc | 72 | 1/24 | 24 | 2.5 |
enge | 72 | 1/36 | 36 | 5 |
engi | 72 | 1/36 | 36 | 5 |
himem | 72 | 1/16 | 48 | 32 |
im2080 | 48 | 1/28 | 28 | 5 |
im2080-gpu | 48 | 1/36 | 36 | 5 |
chem | 48 | 1/36 | 36 | 5 |
health | 48 | 1/36 | 36 | 5 |
hawkcpu | 72 | 1/52 | 52 | 7.3 |
hawkmem | 72 | 1/52 | 52 | 29.3 |
hawkgpu | 72 | 1/48 | 48 | 4.0 |
infolab | 72 | 1/52 | 52 | 7.3 |
pisces (GPU only) | 24 | 1/10 | 58 | 4.0 |
ima40-gpu | 48 | 1/4 | 28 | 8.0 |
rapids | 72 | 1/64 | 64 | 8.0 |
The himem partition is for running high memory jobs i.e. those requiring more than 6GB/core or for using the Artelys Knitro software. Do not submit jobs to the himem partition for running jobs that require lower memory per core. All jobs in the himem partition are charged 3 SUs per core hour of computing irrespective of how many cores or memory you consume.
For hawkgpu, ideally request a max of 6 CPUs for every GPU you want to consume. We will not be allowing single core workflows on hawkgpu. You have to take a minimum of 1 GPU with 6 CPUs per GPU. i.e. a minimum of 6SUs will be consumed per hour. This is not implemented in the user friendly phase, so feel free to test how your application scales.
Priorities
To ensure investors receive their allocation of resources while still maintaining a shared resources, each investor receives a priority boost on his/her investment. Every investor hotel or condo receives a base priority of 1 on all partitions. A priority boost of 100 is provided to investors and their collaborators on their investment. This ensures that an investors job will always start before other users. Jobs accumulate a priority of 1 for each day in the queue. A non investors job in a different partition would have to be in queue for 100 days before it can have a higher priority than an investors job. Below is a table listing the various investors and the partitions where they have priority. All Hotel investors get priority access on the lts partition.
Investor | Partition |
---|---|
Hotel | lts |
Dimitrios Vavylonis | lts |
Wonpil Im | im1080, im1080-gpu, im2080,im2080-gpu,ima40-gpu |
Anand Jagota | eng |
Brian Chen | eng, infolab |
Edmund Webb III | eng |
Alparslan Oztekin | eng |
Jeetain Mittal | lts-gpu,eng-gpu |
Srinivas Rangarajan | engc |
Seth Richards-Shubik | himem |
Ganesh Balasubramanian | enge |
Industrial and Systems Engineering | engi |
Lisa Fredin | chem |
Paolo Bocchini | engc |
Hannah Dailey | enge |
Keith Moored | pisces |
Current Status
Current status of partitions and load on nodes is updated every 15 mins. Do not bookmark for off campus use, accessible on campus and VPN.
- lts
- im1080
- im1080-gpu
- eng
- eng-gpu
- engc
- himem
- enge
- engi
- im2080
- im2080-gpu
- chem
- health
- hawkcpu
- hawkgpu
- hawkmem
- infolab
- pisces
- ima40-gpu
Usage
Usage reports for current and past allocation cycles. Do not bookmark for off campus use, accessible on campus and VPN.
- Last 2 weeks
- Current Month
- Previous Month
- Allocation Year 2021-22 Report
- Allocation Year 2020-21 Report
- Allocation Year 2019-20 Report
- Allocation Year 2018-19 Report
- Allocation Year 2017-18 Report
- Allocation Year 2016-17 Report
Detailed Annual Reports with consumption of resources by users and research groups. Do not bookmark for off campus use, accessible on campus and VPN. Some pages may take a while a load due to amount of data reported.
- Allocation Year 2021-22 Report
- Allocation Year 2020-21 Report
- Allocation Year 2019-20 Report
- Allocation Year 2018-19 Report
- Allocation Year 2017-18 Report
- Allocation Year 2016-17 Report
File Systems
There are three distinct file spaces on Sol and Hawk.
- HOME, your home directory.
- SCRATCH, scratch storage on the local disk associated with your running job.
- CEPHFS, global parallel scratch for running jobs with a lifetime of 7 days.
- CEPH, Ceph project space for research groups that have purchased a minimum 1TB Ceph project
HOME Storage
All users are provided with a 150GB storage quota at /home/username and accessible using the environmental variable $HOME. Home storage is a large Ceph project that is not backed up. It is the users responsibility to maintain backups of their data in $HOME. $HOME directories are not deleted as long as annual user account fees are paid by the HPC PIs.
SCRATCH Storage
SCRATCH provides a 500GB storage on the local disk on the nodes associated with running jobs. This space is not backed up or snapshotted and is deleted when jobs are completed. A user can access this space while running jobs at /scratch/$SLURM_JOB_USER/$SLURM_JOB_ID. Since compute nodes are shared among different users, the available disk space could be less than 500GB. Users who use the SCRATCH space need to make sure that data is copied back at the end of their jobs. Since the scheduler purges the SCRATCH storage at the end of a job, data that hasn't been copied cannot be recovered. See below for a sample script using SCRATCH storage.
All modules define the variable LOCAL_SCRATCH to point to SCRATCH when loaded within your submit script.
Using Local Scratch for MD simulation
CEPHFS global parallel scratch
CEPHFS provides a 22TB global parallel scratch storage. This space is not backed up or snapshotted and all files older than 7 days are deleted. A user can access this space at /share/ceph/scratch/$USER/$SLURM_JOB_ID for running jobs and for 7 days after the job has completed. The SLURM scheduler automatically creates this directory. Users can use this space for writing parallel job output that needs a longer lifetime than that provided by SCRATCH. Since this storage is serviced by SSDs on the Ceph storage cluster, using CEPHFS provides better read/write performance than HOME and CEPH storage spaces. It is the users responsibility to backup data within 7 days of your job completing.
All modules define the variable CEPHFS_SCRATCH to point to CEPHFS when loaded within your submit script.
CEPH Storage
Lehigh Research Computing provides Ceph projects for research groups that require more storage than the 150GB provided to each HPC account. HPC PIs can add their collaborators to their Ceph project that can be used a storage space located at /share/ceph/projectname on Sol. Users should keep in mind that all Ceph projects including $HOME is a networked file system and writing job output to these filesystem could affect the performance of your jobs. Ceph projects should be used for storage and all workloads that contain intense Input and Output should use the SCRATCH or CEPHFS global scratch storage.
Running Jobs on Sol and Hawk
You must be allocated at least one compute node by SLURM to run jobs. Running compute intensive workload (i.e. anything other than editing files, submitting and monitoring jobs) on the head/login node is strictly prohibited. Users will need to write a script requesting desired resources from SLURM.
Special Instructions
To run jobs, add one of the following lines to your submit script to load modules that are optimized for the underlying CPU (for debug, enge, chem, im2080, health, hawk and infolab partitions)
source /etc/profile.d/zlmod.sh #OR source /share/Apps/compilers/etc/lmod/zlmod.sh
If you use tcsh, then you need to add the following line to add LMOD to your path before loading any modules
source /share/Apps/compilers/etc/lmod/zlmod.csh
There are two types of job that can be run on Sol
- Interactive Jobs
- Batch Jobs
Interactive Jobs
These are jobs that provide an interactive environment or command line prompt on which users can enter commands to run simulations. These are best when used for testing and debugging and are not appropriate for long running production jobs. Resources can be requested using the srun command with at least one option to launch a pseudo terminal --pty /bin/bash. Other options include partition, number of nodes, tasks per node and time
srun --partition=lts --nodes=1 --ntasks-per-node=1 --time=60 --pty /bin/bash
When a resource becomes available, SLURM will provide you with a command prompt on the compute node you are allocated. Until resource is available, you will have no access to use the command prompt on the shell where the above command is executed. If you cancel the command using CNTRL-C, your interactive job request will be cancelled. Depending on how busy the cluster is, your wait could be a few minutes to a few days.
All compute nodes have a naming convention sol-[a-e][1-6][00-18], for e.g. sol-a104. Do not run jobs on the head/login node i.e. sol.
Batch Jobs
These are jobs that require writing a series of command in a shell script that SLURM will execute on the compute node. Resources can be requested in the script or as options to the command, sbatch, while submitting the script to the SLURM scheduler.
Sample Scripts for Batch Jobs
Submitting Jobs
sbatch slurmjob.sh
sbatch --time=12:00:00 --partition=lts --nodes=4 --ntasks-per-node=19 slurmjob.sh
Command line options to sbatch override #SBATCH commands in the submit script.
Submitting Dependency jobs
You want to run a long simulation that is split into multiple sequential runs to fit within the maximum walltimes of the partitions. One common method is to create job submission script for each of the sequential steps that will be submitted by the previous job or submitted manually when the previous job is complete. The former method is not recommended since some systems do not allow job submission from the compute nodes (you might encounter the same issues on national resources as very few systems have queue walltimes larger than 7 days) or if you run out of walltime, then the subsequent job may not be submitted. In the latter method, you lose valuable time if you are not monitoring your jobs and are not available to submit the subsequent job.
The recommended method is to submit jobs with a dependency attribute for the second and subsequent jobs. On Sol and any system that uses the SLURM job scheduler, dependency jobs are created by adding the --dependency=... flag to the sbatch command.
sbatch --dependency=afterok:<JobID> <Submit Script>
Here, you are submitting a SLURM script <Submit Script> that depends on a previous job with ID <JobID>. Options that can be added to the dependency argument are
- afterok:<JobID> Job will be scheduled to run only if Job <JobID> had completed with no errors
- afternotok:<JobID> Job will be scheduled to run only if Job <JobID> has completed with errors
- afterany:<JobID> Job will be scheduled to run after Job <JobID> has completed, with or without errors
Abbreviated Notations
SLURM also accepts abbreviated notation for sbatch command
Long Format | Short Format |
---|---|
--paritition=name | -p name |
--time=mm:ss | -t mm:ss |
--nodes=number | -N number |
--ntasks=total procs | -n total procs |
--dependency=attributes | -d attributes |
Monitoring Jobs
SLURM provides various tools for monitoring and manipulating jobs
Check queue status
squeue <Options>
Options
- -u <username>: show status of all jobs for a particular user
- -j <jobid>: show status for jobid
- -l: show long format of queue status
- -p <name>: show status of all jobs in paritition name
- -s: show estimated start time
Use --help option to see a full list of allowed options and usage
checkq is a script accessible through the soltools modules which provides squeue with some useful defaults and can accept the above options.
Cancel/delete a job
You can only delete only your jobs that are in queue or already running
scancel <jobid>
Manipulate Jobs in Queue
A user or admin can manipulate jobs that are in queue i.e. not running yet.
scontrol hold <jobid>
scontrol release <jobid>
You can only release jobs that you have held. If an admin has held your job, only the admin can release it.
scontrol show job <jobid>
scontrol update SPECIFICATION jobid=<jobid>
Examples of SPECIFICATION are
- add dependency after a job has been submitted: dependency=<attributes>
- change job name: jobname=<name>
- change partition: partition=<name>
- modify requested runtime: timelimit=<hh:mm:ss>
- request gpus (when changing to one of the gpu partitions): gres=gpu:<1,2,3 or 4>
scontrol update partition=im1080 timelimit=48:00:00 jobid=123456
Monitoring Queues
alp514.sol(511): sinfo PARTITION AVAIL TIMELIMIT NODES STATE NODELIST lts* up 3-00:00:00 9 idle sol-a101-109 im1080 up 2-00:00:00 24 alloc sol-b401-413,501-511 im1080 up 2-00:00:00 1 idle sol-b512
alp514.sol(512): sinfo -p lts,im1080 PARTITION AVAIL TIMELIMIT NODES STATE NODELIST lts* up 3-00:00:00 9 idle sol-a101-109 im1080 up 2-00:00:00 24 alloc sol-b401-413,501-511 im1080 up 2-00:00:00 1 idle sol-b512
checkload is a script accessible through the soltools modules which provides sinfo with some useful defaults and can accept the above options.
Click Here for status of Sol partitions - updated every 15 mins, accessible at Lehigh and VPN only. This page is generated from output of checkq and checkload for partition status and node usage respectively.