Upgrade
This user guide is designed to help the Lehigh HPC community continue their work during and after our clusters Sol and Hawk undergo an operating system upgrade in Spring, 2025.
This upgrade guide contains specific details about the timeline and usability of the cluster in the Spring, while a new quickstart guide is designed to help users familiarize themselves with the new system.
Summary
Any new users, or users building new calculations on our cluster should rely on our modules system (Lmod) and scheduler (SLURM) to locate software and schedule their calculations on either the hawkcpu
or rapids
partitions, because these are the largest on our system. If you are using our legacy software or partitions, you should read below to learn how to migrate your workflow to our new software modules and upgraded nodes. Our team can provide consultation and answer questions if you send them in an HPC ticket.
Schedule
The following table reports our upgrade progress and anticipated schedule. The "Lmod architecture" refers to the highest-possible architecture module that you can load in Lmod when using the new modules system explained in the quickstart guide.
Upgrade Date |
Old Partition |
New Partition |
Lmod Architecture |
Comments |
---|---|---|---|---|
6 January |
– |
|
|
– |
6 January |
– |
|
|
– |
6 January |
– |
|
|
– |
6 January |
– |
|
|
– |
6 January |
– |
|
|
– |
6 January |
– |
|
|
– |
10 March |
|
- |
|
- |
10 March |
|
- |
|
- |
10 March |
|
|
|
- |
17 March |
|
|
|
- |
20 March |
|
|
|
- |
20 March |
|
|
|
- |
20 March |
|
|
|
- |
24 March |
|
|
|
absorbed into Hawkcpu |
24 March |
|
|
|
absorbed into Haswell |
24 March |
|
|
|
- |
27 March |
|
|
|
- |
31 March |
|
|
|
- |
3 April |
|
|
|
- |
7 April |
|
|
|
- |
When the upgrade is complete, the vast majority of the older hardware on the cluster will be centralized in a single haswell
partition with a corresponding Lmod architecture module called haswell25v1
. Users will be able to target specific hardware within this partition using feature flags that we will add after each upgrade. The use of a single shared partition will make it possible to increase our utilization rate, since the vast majority of computations can run on a single node in this partition.
We will update the table above through the end of the transition period. A complete explanation of the Lmod architecture scheme is provided on the quickstart guide.
Milestones
Since we started the upgrade in January, we have reached the following milestones:
- We upgraded our modules system (Lmod) and installed dozens of upgraded software packages using Spack.
- The new software can be accessed exclusively through the modules system, which now spans three architectures (Haswell, Cascade Lake, and Ice Lake). We have eliminated a bug in which the modules system would sometimes fail silently. This improves the usability of the module system.
- SLURM has been upgraded to a more recent version (23.02.8). All historical accounting data were retained.
- The head node has been upgraded to a Cascade Lake architecture. We are now using a virtual machine (VM) in order to prepare to implement failover. This will make it easier to update the login node without downtime. Please note that the login node has fewer resources for each user (50% of one core, 512MB RAM), but this can be sidestepped by using our high-availability partition (see below).
- We reserved a full node for high-availability access. Users can submit interactive jobs to the
hawkcpu-express
partition with nearly zero wait time. Users are restricted to running one job at a time, with up to 6 cores for up to 6 hours. This partition is now the default for applications on the web portal (see below). - We upgraded the Open OnDemand web portal at
hpcportal.cc.lehigh.edu
to the latest version (4.0), and installed new Jupyter, Matlab, and Virtual Desktop applications. When used with thehawkcpu-express
partition, users should experience virtually no wait time for interactive jobs, prototyping, and development.
Last Call for Legacy Software
In our midpoint update message, we explained that we will be upgrading all of the remaining partitions on the cluster soon. This means that anyone who is still using our legacy software explained below MUST migrate their work to an upgraded partition.
Before the "last call", the date at which our schedule concludes, any users who are not yet running their calculations on the already-upgraded partitions will need to take two steps:
- Migrate your work to an upgraded partition, for example
hawkcpu
. - If you want to use a partition marked
haswell
on the schedule, open an HPC ticket to ask us to add a software module (see instructions for using Haswell below).
For many users, step one is simple: you can change the partition specified by #SBATCH -p
in your SLURM scripts and update your module load
commands to find new software. To explain further, we will work through an example.
How can you migrate to an upgraded partition?
Imagine that you are using the enge
partition to run a LAMMPS calculation. Your SLURM job script might include these two lines (along with other commands):
#!/bin/bash #SBATCH -p enge module load lammps/20200303
The first line tells SLURM to send this job to the enge
partition. This partition is not yet upgraded. The second line loads a piece of so-called legacy software, installed before the upgrade.
To upgrade this workflow, you can search for the lammps
module:
$ module spider lammps You will need to load all module(s) on any one of the lines below before the "lammps/20240829.1" module is available to load. arch/cascade24v2 gcc/12.4.0 openmpi/5.0.5
Note that the arch/cascade24v2
module indicates that the other modules are compiled for the cascade
architecture. According to the list of partitions in our schedule, this means that this code will work on hawkcpu
. The cascade24v2
Lmod architecture is forwards-compatible with ice24v2
, meaning that you can use this software on the rapids
partition as well.
To use the newest version of lammps
, from August 2024, you can run these commands:
module load arch/cascade24v2 module load gcc/12.4.0 openmpi/5.0.5 module load lammps/20240829.1
The first two commands are usually redundant because these are already loaded by default. If you changed the default with a module collection, you should use all three commands to find this software. Most users can just use the third command.
Now that we know that we can use a recent version of lammps
on an upgraded node, we can change our submission script to request hawkcpu
:
#!/bin/bash #SBATCH -p enge module load lammps/20240829.1
As long as your research is compatible with a newer version of LAMMPS, this method will allow you to continue you work on the upgraded partitions. If you want to use a partition marked haswell
in the schedule, you should read the section below for further instructions.
What if the module is missing?
If you ask for software that does not exist in the new modules system, you will see this error:
$ module load software-that-does-not-exist Lmod has detected the following error: The following module(s) are unknown: "software-that-does-not-exist" Please check the spelling or version number. Also try "module spider ..." It is also possible your cache file is out-of-date; it may help to try: $ module --ignore_cache load "software-that-does-not-exist" Also make sure that all modulefiles written in TCL start with the string #%Module
This means that we have not compiled this software after the upgrade. To ask us to add new software, please open an HPC ticket and specify the partition or architecture you want to use.
Some software is highly specialized or unique to a specific research groups. In those cases, we might give you instructions for installing a private copy instead of adding it to the centralized software tree.
You might also see this message:
$ module load python/3.9.19 Lmod has detected the following error: These module(s) or extension(s) exist but cannot be loaded as requested: "python/3.9.19" Try: "module spider python/3.9.19" to see how to load the module(s).
This means that the module is available if you load some other prerequisites. Since our site uses hierarchical modules, our software depends on specific architectures, compilers, and MPI implementations.
You should follow the instructions provided by Lmod to "see how to load the module(s)". If these instructions tell you to load arch/haswell24v2
, then this software is forwards-compatible with the entire cluster, and you will be able to use any partition to run your calculation. We explain this further in the next section.
Haswell
Most researchers will find the software they need by using the standard Lmod commands to search and then load the software. The default software will work on the largest partitions, namely hawkcpu
and rapids
. However, our cluster has a large number of nodes that entered service before 2021. Using any nodes in the haswell
partition will require extra steps. After the upgrade, this partition will absorb several older ones, for example enge
and engc
.
If you want to use any nodes in the haswell
partition, we recommend that you first migrate to hawkcpu, and then after you test your workflow, read the remainder of this section, and then request software for the haswell
partitions if necessary.
In this section we will first explain how the architectures work ("understanding Architecture"), and then tell you how to use the haswell
partition ("Using the Haswell Partition").
Understanding Architecture
First, we need to explain the architecture system. The schedule provides one of three architectures. These map onto the SLURM partitions. Each architecture is associated with an Lmod module. The architectures are:
- Ice Lake uses the
arch/ice24v2
module and can be used on therapids
andlake-gpu
partitions. - Most of the cluster, including the head node, uses the
arch/cascade24v2
module, and this provides software that can be used on any Ice Lake partitions above, as well as the Hawk partitions, for examplehawkcpu
. - Most hardware installed before 2021 is compatible with the Haswell architecture, and will be added to a large
haswell
partition. To use this partition, you will need to load thearch/haswell24v2
module. We will explain this process below.
Most users will log onto the cluster, search for their software, and then load it. In this example, we find that R version 4.4 is available.
module spider r module load r/4.4.1
We can add the latter command to a SLURM script to use this code on the login node, on the high-availability partition hawkcpu-express
, and on the hawkcpu
and rapids
partitions.
Our system uses hierarchical Lmod modules, which means that the available software is constrained by the architecture, compiler, and MPI you have already loaded.
You can inspect the default modules after you log in:
$ module list Currently Loaded Modules: 1) gcc/12.4.0 4) arch/cascade24v2 7) py-setuptools/69.2.0 2) openmpi/5.0.5 5) tcl/8.6.12 8) python/3.13.0 3) helpers 6) py-pip/23.1.2
You will see arch/cascade24v2
, which indicates that these modules will work on items 1 and 2 on the architecture list above.
Using the Haswell Partition
If you send a calculation to the new haswell
partition, and receive the an rror that says "The following module(s) are unknown", then you need to either install this software on your own or open an HPC help ticket to request it.
All calculations submitted to the haswell
partition will automatically load the arch/haswell24v2
module, and this will restrict the availabe modules so they are all compatible with the partition. As a result, if you try to use modules available with the arch/cascade24v2
module explained above, then you will see this error.
In the example above, if you need to use the r/4.4.1
module on enge
, and you open a ticket to request this software, then we will give you instructions for using it on enge
. Your SLURM script would look like this:
#!/bin/bash #SBATCH -p enge module load arch/haswell24v2 module load gcc openmpi module load r/4.4.1
This is the fully-explicit method for load the modules available on the haswell
partition. As you can see, the only difference between haswell
and the newer architectures, is the extra module load arch/haswell24v2
command.
If you want to streamline this process, we can help you build an entrypoint script that loads all of the software required by a specific project, which is also compatible with your target partitions. Open an HPC help ticket to schedule a consultation.
To summarize, every calculation that does not use custom-installed software must request a SLURM partition and use the modules to access the software. The software is constrained by the version of the arch
module revealed by module list
, and if you cannot find the exact software you need, for a specific partition you want to use, you need to either compile it yourself, or ask us to install it for you, by open an HPC help ticket.
The remainder of the upgrade guide includes instructions on legacy software, and a history of our communications about the upgrade.
Accessing Legacy Software
If you are accessing legacy software, please review the LAST CALL section for further instructions. We are retaining the following instructions for reference. The legacy software will be removed after the upgrade schedule is complete.
We call software that was installed before the upgrade "legacy", a euphemism for "old". Our goal for this upgrade is to upgrade as much software as possible so that our users can continue computing.
Our operating system upgrade requires brand-new software. While some legacy software might still work properly after the upgrade, this is not guaranteed. The best strategy is to rebuild everything.
The quickstart guide explains how to use the upgraded module system. During the transitional period, we expect that many users will still need to use the older software in order to continue using the cluster, or to inspect and modify their workflows so they work properly after the upgrade.
The schedule has a list of partitions which have not been upgraded yet. The legacy software will work on these partitions, however it is critical for everyone to move their workflow to an upgraded partition before the end of the upgrade schedule.
There are two commands required to access the legacy software.
- Clear the current Lmod tree with the command:
clearLmod
- Load the old modules with another command:
source /share/Apps/legacy.sh
We can combine these into one line that you can run in a terminal session:
clearLmod && source /share/Apps/legacy.sh
If you use this method to access older software, you will need a fresh terminal session to use the new software while submitting jobs to the upgraded nodes. Alternately you could submit jobs with old software in a bash
subshell, which can be exited to a fresh terminal.
The steps above will activate the older software modules. If you try to use legacy software without following these instructions, you will see one two errors. One possibility is that Lmod might tell you that it cannot find your modules:
Lmod has detected the following error: The following module(s) are unknown: "SOME_NAME" Please check the spelling or version number. Also try "module spider ..." It is also possible your cache file is out-of-date; it may help to try: $ module --ignore_cache load "sol" Also make sure that all modulefiles written in TCL start with the string #%Module
Or, you might see a glibc
error like this one:
/lib64/libc.so.6: version `GLIBC_2.34' not found`.
The latter error is characteristic of the mismatch between the operating system and your software. All of our compiled software depends on glibc
, and when we upgrade the operating system, it upgrades the glibc
, and this is precisely why we must recompile all of the software on the cluster after the upgrade.
Note that you should NOT use source /share/Apps/lake.sh
except as a temporary measure. Instead, you should switch to the new software tree, which uses module
commands exclusively. This requires no additional steps, since rapids
and lake-gpu
have now been upgraded.
While this method allows you to access the legacy software, it will be retired at the end of the upgrade schedule above.
Users will need to select one of two options before these partitions are upgraded (see the schedule for exact dates):
- Compile your own code or ask that we add your software to the new modules system by opening an HPC ticket.
- Port the software to a Singularity container.
If you have questions about this process or require guidance for migrating your workflow to the new system, please open an HPC help ticket and let us know the details.
Communications History
The Research Computing group send the following communications to the HPC mailing list during the upgrade:
Advance Notice
On November 25, 2024, our team sent the following message to the HPC mailing list:
To all HPC Users:
The Sol cluster, including the Hawk expansion, will have a scheduled downtime on January 6-8, 2025. During the downtime, our team will upgrade the scheduler (SLURM), the operating system (OS), and make a large set of new software available.
All users should expect to modify or update their workflows in order to continue computing after the OS upgrade is complete (after January 31), since these upgrades will almost certainly disrupt your existing compiled codes. To explain all of the important details and actions you should take, we have published a guide in our knowledge base. All users should review this guide before the downtime.
We encourage users to get in touch with us, by opening a help ticket, in order to (1) request software modules you will need after the upgrade, (2) share instructions with us for testing existing codes on the new software, and (3) answer any questions about the upgrade. We will repeat this announcement in the coming weeks to remind you. We ask you to provide requests for software or advance testing before December 20, 2024 to ensure we can respond to these requests before the downtime.
We expect this change to improve the security of our system while also giving us an opportunity to streamline our software and SLURM partitions in order to make it more usable to our research community. Thank you in advance for helping us prepare for a seamless transition for Sol.
Midpoint Update
On March 7, 2025, our team provided this update:
To all HPC Users:
This email includes specific instructions for HPC users along with links to our upgrade guide and ticket system system. I strongly encourage any users plan to actively use Sol and Hawk in the near future to read these instructions carefully.
I am writing to provide a progress update on the Sol operating system upgrade that began in January. So far, we have upgraded the newest parts of our cluster, namely our largest partitions ("rapids" and "hawkcpu"). The cluster appears to be well-utilized since the upgrade. Many users have updated their workflows, SLURM scripts, and software so they remain compatible with our systems.
On Monday we plan to continue upgrading the remaining SLURM partitions. We have published our target schedule in the upgrade guide. We originally planned for the transition period to end on January 31, however we have extended this schedule to give everyone more time to adjust to the changes.
If have NOT YET migrated your calculation to one of the already-upgraded partitions on our schedule, then you MUST take the following action:
Migrate your calculations to the "hawkcpu
" or "rapids
" partitions to confirm that they still work properly. This will help you figure out if you need us to install more software. If you need software that is not yet available in the modules system (use "module spider <name>
" to search our modules), then please open an HPC ticket to request the new software.
Our team has compiled new software that targets the nodes with Lmod architectures cascade24v2
and ice24v2
. To improve performance, this software is not backwards-compatible with partitions installed before Hawk was added in 2021. These partitions are marked haswell24v2
in the schedule above. If you want to use these partitions after the upgrade, then you MUST take the following action:
After you migrate your calculations to an already-upgraded partition such as "hawkcpu
", you should open an HPC ticket to tell us the exactly which software you need us to add to the haswell24v2
partition. You can send a module load
command that summarizes your requirements.
This process is outlined in greater detail in the upgrade guide. If you are not sure how to proceed, or you would like to discuss this process with us, please open an HPC ticket that tells us which software and partitions you are using.
Many thanks to everyone for providing useful feedback, identifying new issues, and helping us improve our HPC environment.