Upgrade

Upgrade

This content is archived.
The upgrade is complete.

This user guide is designed to help the Lehigh HPC community continue their work during and after our clusters Sol and Hawk undergo an operating system upgrade in Spring, 2025.

This upgrade guide contains specific details about the timeline and usability of the cluster in the Spring, while our new quickstart guide is designed to help users familiarize themselves with the new system.

Summary

Any new users, or users building new calculations on our cluster, should rely on our modules system (Lmod) and scheduler (SLURM) to locate software and schedule their calculations on either the hawkcpu or rapids partitions, because these are the largest on our system. If you are using our legacy software or partitions, you should read below to learn how to migrate your workflow to our new software modules and upgraded nodes. Our team can provide consultation and answer questions if you send them in an HPC ticket.

Schedule

The following table reports our upgrade progress and anticipated schedule. The "Lmod architecture" refers to the highest-possible architecture module that you can load in Lmod when using the new modules system explained in the quickstart guide.

Upgrade Date

Old Partition

New Partition

Lmod Architecture

Comments

Upgrade Date

Old Partition

New Partition

Lmod Architecture

Comments

Completed

rapids

ice24v2

Completed

lake-gpu

ice24v2

Completed

hawkcpu

cascade24v2

Completed

hawkcpu

cascade24v2

Completed

hawkgpu

cascade24v2

Completed

hawkmem

cascade24v2

Completed

ima40-gpu

-

haswell24v2

-

Completed

pisces

-

cascade24v2

-

Completed

eng-gpu

haswell

haswell24v2

-

12 May

chem (part)

hawkcpu

cascade24v2

-

12 May

chem (part)

haswell

haswell24v2

-

12 May

health

haswell

haswell24v2

absorbed into Haswell

14 May

im2080-gpu

haswell

haswell24v2

#SBATCH --constraint=gpu:2080

19 May

im1080-gpu

haswell

haswell24v2

#SBATCH --constraint=gpu:1080

21 May

engi

haswell

haswell24v2

-

21 May

engece

hawkcpu

cascade24v2

absorbed into Hawkcpu

21 May

infolab

haswell

haswell24v2

-

21 May

engc

haswell

haswell24v2

-

21 May

enge

haswell

haswell24v2

-

21 May

lts-gpu

haswell

haswell24v2

#SBATCH --constraint=gpu:1080

When the upgrade is complete, the vast majority of the older hardware on the cluster will be centralized in a single haswell partition with a corresponding Lmod architecture module called haswell25v1. Users will be able to target specific hardware within this partition by using the typical SLURM flags to select the number of nodes, cores, and amount of memory, if necessary.

Using a single, large partition will make it possible to increase our utilization rate, since the vast majority of computations can run on a single node in this partition.

We will update the table above through the end of the transition period. A complete explanation of the Lmod architecture scheme is provided on the quickstart guide.

Milestones

Since we started the upgrade in January, we have reached the following milestones:

  1. We upgraded our modules system (Lmod) and installed dozens of upgraded software packages using Spack.

  2. The new software can be accessed exclusively through the modules system, which now spans three architectures (Haswell, Cascade Lake, and Ice Lake). We have eliminated a bug in which the modules system would sometimes fail silently. This improves the usability of the module system.

  3. SLURM has been upgraded to a more recent version (23.02.8). All historical accounting data were retained.

  4. The head node has been upgraded to a Cascade Lake architecture. We are now using a virtual machine (VM) in order to add more resiliency sometime in 2025. Please note that the login node has fewer resources for each user (50% of one core, 512MB RAM), but this can be sidestepped by using our high-availability partition (see item 5).

  5. We reserved a full node for high-availability access. Users can submit interactive jobs to the hawkcpu-express partition with virtually no wait time. Users are restricted to running one job at a time, with up to 6 cores for up to 6 hours. This partition is now the default for applications on the web portal (see item 6).

  6. We upgraded the Open OnDemand web portal at hpcportal.cc.lehigh.edu to the latest version (4.0), and installed new Jupyter, Matlab, and Virtual Desktop applications. When used with the hawkcpu-express partition, users should experience virtually no wait time for interactive jobs, prototyping, and development.

Last Call for Legacy Software

In our midpoint update message, we explained that we will be upgrading all of the remaining partitions on the cluster soon. This means that anyone who is still using our legacy software explained below MUST migrate their work to an upgraded partition.

Before the "last call", the date at which our schedule concludes, any users who are not yet running their calculations on the already-upgraded partitions will need to take two steps:

  1. Migrate your work to an upgraded partition, for example hawkcpu.

  2. If you want to use a partition marked haswell on the schedule, open an HPC ticket to ask us to add a software module (see instructions for using Haswell below).

For many users, step one is simple: you can change the partition specified by #SBATCH -p in your SLURM scripts and update your module load commands to find new software. To explain further, we will work through an example.

How can you migrate to an upgraded partition?

Imagine that you are using the enge partition to run a LAMMPS calculation. Your SLURM job script might include these two lines (along with other commands):

#!/bin/bash #SBATCH -p enge module load lammps/20200303

The first line tells SLURM to send this job to the enge partition. This partition is not yet upgraded. The second line loads a piece of our legacy software, which was installed before the upgrade.

To upgrade this workflow, you can search for the lammps module: