This user guide is designed to help the Lehigh HPC community continue their work during and after our clusters Sol and Hawk undergo an operating system upgrade in Spring, 2025.
This upgrade guide contains specific details about the timeline and usability of the cluster in the Spring, while our new quickstart guide is designed to help users familiarize themselves with the new system.
Summary
Any new users, or users building new calculations on our cluster, should rely on our modules system (Lmod) and scheduler (SLURM) to locate software and schedule their calculations on either the hawkcpu or rapids partitions, because these are the largest on our system. If you are using our legacy software or partitions, you should read below to learn how to migrate your workflow to our new software modules and upgraded nodes. Our team can provide consultation and answer questions if you send them in an HPC ticket.
Schedule
The following table reports our upgrade progress and anticipated schedule. The "Lmod architecture" refers to the highest-possible architecture module that you can load in Lmod when using the new modules system explained in the quickstart guide.
Upgrade Date | Old Partition | New Partition | Lmod Architecture | Comments |
|---|---|---|---|---|
Completed | – | | | – |
Completed | – | | | – |
Completed | – | | | – |
Completed | – | | | – |
Completed | – | | | – |
Completed | – | | | – |
Completed | | - | | - |
Completed | | - | | - |
Completed | | | | - |
12 May | | | | - |
12 May | | | | - |
12 May | | | | absorbed into Haswell |
14 May | | | | |
19 May | | | | |
21 May | | | | - |
21 May | | | | absorbed into Hawkcpu |
21 May | | | | - |
21 May | | | | - |
21 May | | | | - |
21 May | | | | |
When the upgrade is complete, the vast majority of the older hardware on the cluster will be centralized in a single haswell partition with a corresponding Lmod architecture module called haswell25v1. Users will be able to target specific hardware within this partition by using the typical SLURM flags to select the number of nodes, cores, and amount of memory, if necessary.
Using a single, large partition will make it possible to increase our utilization rate, since the vast majority of computations can run on a single node in this partition.
We will update the table above through the end of the transition period. A complete explanation of the Lmod architecture scheme is provided on the quickstart guide.
Milestones
Since we started the upgrade in January, we have reached the following milestones:
We upgraded our modules system (Lmod) and installed dozens of upgraded software packages using Spack.
The new software can be accessed exclusively through the modules system, which now spans three architectures (Haswell, Cascade Lake, and Ice Lake). We have eliminated a bug in which the modules system would sometimes fail silently. This improves the usability of the module system.
SLURM has been upgraded to a more recent version (23.02.8). All historical accounting data were retained.
The head node has been upgraded to a Cascade Lake architecture. We are now using a virtual machine (VM) in order to add more resiliency sometime in 2025. Please note that the login node has fewer resources for each user (50% of one core, 512MB RAM), but this can be sidestepped by using our high-availability partition (see item 5).
We reserved a full node for high-availability access. Users can submit interactive jobs to the
hawkcpu-expresspartition with virtually no wait time. Users are restricted to running one job at a time, with up to 6 cores for up to 6 hours. This partition is now the default for applications on the web portal (see item 6).We upgraded the Open OnDemand web portal at
hpcportal.cc.lehigh.eduto the latest version (4.0), and installed new Jupyter, Matlab, and Virtual Desktop applications. When used with thehawkcpu-expresspartition, users should experience virtually no wait time for interactive jobs, prototyping, and development.
Last Call for Legacy Software
In our midpoint update message, we explained that we will be upgrading all of the remaining partitions on the cluster soon. This means that anyone who is still using our legacy software explained below MUST migrate their work to an upgraded partition.
Before the "last call", the date at which our schedule concludes, any users who are not yet running their calculations on the already-upgraded partitions will need to take two steps:
Migrate your work to an upgraded partition, for example
hawkcpu.If you want to use a partition marked
haswellon the schedule, open an HPC ticket to ask us to add a software module (see instructions for using Haswell below).
For many users, step one is simple: you can change the partition specified by #SBATCH -p in your SLURM scripts and update your module load commands to find new software. To explain further, we will work through an example.
How can you migrate to an upgraded partition?
Imagine that you are using the enge partition to run a LAMMPS calculation. Your SLURM job script might include these two lines (along with other commands):
#!/bin/bash
#SBATCH -p enge
module load lammps/20200303The first line tells SLURM to send this job to the enge partition. This partition is not yet upgraded. The second line loads a piece of our legacy software, which was installed before the upgrade.
To upgrade this workflow, you can search for the lammps module: