Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Our goal is to provide a simple introduction to both the hardware and software on this machine so that researchers can start using it as quickly as possible.

Anchor
hardware
hardware
Hardware

Sol is a highly heterogenous cluster, meaning that it is composed of multiple commodity hardware computing architectures and instruction sets. For our purposes, we have whittled these down to of many different types of hardware. The hardware on the cluster has three features:

  1. Architecture a.k.a. instruction set
  2. High-speed Infiniband (IB) networking available for many nodes
  3. Specialized graphics processing units (GPUs) available one some nodes

Anchor
architecture
architecture
Architecture

Architecture is the most important feature of our hardware, because it determines the set of software that you can use. We have whittled down the number of architectures into three categories. We list the architectures in reverse chronological order and give each of them an Lmod architecture name explained in the software section below.

...

Each architeture provides a distinct instruction set, and all compiled software on our cluster depends on these instructions. The architectures are backwards compatible, meaning that you can always use software compiled for an older architecture on newer hardware.

Anchor
specialized-hardware
specialized-hardware
Specialized Hardware

Besides architecture, there are two remaining pieces of specialized hardware that may be relevant to your workflows. First, most of the cluster has access to high-speed Infinband (IB) networking. This network makes it possible to run massively parallel calculations across multiple nodes.

The main exception comes from the Hawk partitions: hawkcpu, hawkcpu, and hawkmem. These partitions should be used for single-node jobs only, because the ethernet network is shared with our storage system and cannot accommodate fast communication between nodes.

Many partitions suffixed -gpu

Anchor
slurm-partitions
slurm-partitions
SLURM partitions

We segment the hardware on the cluster by SLURM partitions listed below. SLURM is our scheduler, and it allows each user to carve off a section of the cluster for their exclusive use. Note that the cluster is currently undergoing an upgrade. We report only the upgraded partitions here, but the full guide is available.

Partition

Lmod Architecture

Infiniband

GPUs

Cores per node

Memory per core

rapids

ice24v2

yes

64

8 GB

lake-gpu

ice24v2

yes

8x NVIDIA L40S

64

8 GB

hawkcpu

cascade24v2

52

7.3 GB

hawkgpu

cascade24v2

8x NVIDIA T4

48

4GB

hawkmem

cascade24v2

52

29.75 GB

In the next section, we will explain how to use the Lmod architecture when selecting your software.

Anchor
software
software
Software

There are two kinds of software on our system:

  1. System-wide software provided by the Lmod modules system.
  2. User-installed and user-compiled software.

Even users with custom codes will use the system-wide compilers and software to build their own software. It is particularly important to use the system-wide compilers and MPI (message-passing interface) implementations on our system to fully leverage the HPC hardware.

Anchor
using-lmod
using-lmod
Using Lmod

Users who log on to the head node at sol.cc.lehigh.edu will have access to the Lmod module command. Our default modules include the arch/cascade24v2, which matches the Cascade Lake architecture of the head node and the Hawk partitions, along with a default compiler (gcc) and a default MPI implementation (openmpi).

Users can list their loaded software with module list or just ml:

No Format

$ module list
Currently Loaded Modules:
  1) gcc/12.4.0   2) openmpi/5.0.5   3) helpers   4) arch/cascade24v2

Users can search for software using the module spider command. It is very important to use module spider to search for software because the menu provided by module avail will hide software which is not compatible with your architecture, compiler, or MPI. To find the hidden software, use module spider:

No Format

$ module spider lammps

---------------------------------
  lammps: lammps/20240829.1
---------------------------------

    You will need to load all module(s) on any one of the lines below before
    the "lammps/20240829.1" module is available to load.

      arch/cascade24v2  gcc/12.4.0  openmpi/5.0.5
 
    Help:
      LAMMPS stands for Large-scale Atomic/Molecular Massively Parallel
      Simulator.

You need to follow the instructions precisely to load this software using the module load command:

No Format

$ module load arch/cascade24v2 gcc openmpi lammps

You can string together multiple modules on one command. Version numbers are optional. Lmod will select the default module, typically the highest version.

Anchor
the-software-hierarchy
the-software-hierarchy
The software hierarchy

We reviewed our hardware at the top of this guide because it significantly restricts the types of software that you can use on each of our SLURM partitions. As a result, users should align the following choices when configuring their workflows:

  1. The hardware requirements, for example the presence of the Infiniband fabric, GPUs, memory requirements, etc, determine the SLURM partitions that are compatible with your workflow.
  2. The SLURM partitions are linked to an architecture name given in the table above. Once you select a partition, you can use the associated architecture name (and any lower ones).
  3. Each architecture has a set of compatible software.

The Lmod modules system provides a "software hierarchy" feature that allows us to deliver software for a specific architecture, compiler, and MPI. We exclude

Anchor
architectures-are-exclusive
architectures-are-exclusive
Architectures are exclusive

Loading or unloading the arch modules will change the available software so it matches one of two architectures: either Ice Lake or Cascade Lake. Note that the Hawk partition uses the lower Cascade Lake architecture, and this is the default for the head node as well.

When you load an architecture module, it will unload the modules which are exclusive to your current state. For example, when we log on to Sol, the arch/cascade24v2 module is automatically loaded. Imagine that we load Python to do a simple calculation. Our modules might look like this:

No Format

$ module list

Currently Loaded Modules:
  1) gcc/12.4.0         7) tcl/8.6.12
  2) openmpi/5.0.5      8) bzip2/1.0.8
  3) helpers            9) py-pip/23.1.2
  4) arch/cascade24v2  10) py-setuptools/69.2.0
  5) libxcb/1.17.0     11) python/3.13.0
  6) libx11/1.8.10

Our MPI uses Cascade Lake:

No Format

$ which mpirun
/share/Apps/ice24v2/gcc-12.4.0/openmpi-5.0.5-6thd6mkhodcoqrpolw35qosoqels7vak/bin/mpirun

Later, imagine that we want to compile some code for the newer nodes in the rapids partition. This partition is compatible with the Ice Lake architecture (even though it happens to be the slightly newer Sapphire Rapids architecture). If we switch to this architecture, it will unload our Python module, because the arch/ice24v2 software does not include Python:

No Format

$ module load arch/ice24v2

Inactive Modules:
  1) bzip2/1.0.8       4) py-pip            7) tcl/8.6.12
  2) libx11/1.8.10     5) py-setuptools
  3) libxcb/1.17.0     6) python/3.13.0

Due to MODULEPATH changes, the following have been reloaded:
  1) gcc/12.4.0     2) openmpi/5.0.5

The following have been reloaded with a version change:
  1) arch/cascade24v2 => arch/ice24v2

As you can see, Lmod helpfully reports that some of our software has been unloaded. We can see that we now have access to the newer software tree:

No Format

$ which mpirun
/share/Apps/ice24v2/gcc-12.4.0/openmpi-5.0.5-6thd6mkhodcoqrpolw35qosoqels7vak/bin/mpirun

The upshot of this system is that users are encouraged to develop explicit recipes that match their software, architecture, and SLURM partition. If you need to run high-performance codes on the newer nodes, while also using some of the large arch/cascade24v2 software library provided by our modules system, you might want to build module collections using guidance in the next section.

Anchor
saving-modules
saving-modules
Saving modules

As we will explain in the exclusive architectures section above, the arch module will limit the available software to a specific Lmod architecture name, typically either arch/cascade24v2 or arch/ice24v2, corresponding to our late-2024 editions of Cascade Lake or Ice Lake software.

Software available under the Ice Lake architecture may sometimes provide up to double the performance for certain workflows, by leveraging the new instruction set available in this architecture. To organize sets of modules for different workflows, you can save a module collection.

No Format

$ module save hawk_md_project
Saved current collection of modules to: "hawk_md_project"

At the top of SLURM scripts that you develop for this project, you can then load all of the modules with this line:

No Format

module restore hawk_md_project

This allows you to abstract the software details away from your SLURM scripts, meaning you can upgrade the software, add new modules, etc, without editing many individual SLURM scripts.

Anchor
custom-modules
custom-modules
Custom modules

Users are welcome to extend the modules systems with their own, custom modules.

No Format

ml help own

This reports the help file for the own module. It provides a link to the Lmod documentation to explain how it works. We use this feature when building custom virtual environments

Anchor
where-is-my-software
where-is-my-software
Where is my software?

Following the January 2025 upgrade we are rebuilding large sets of software for our users. You can expect to see the list of available modules grow in the coming months. In the meantime, we have a transitional period in which the legacy software is still available. This feature is documented on our upgrade page.

In short, users can run source /share/Apps/legacy.sh to return to the previous Lmod tree or source /share/Apps/lake.sh to use the software tree from the rapids and lake-gpu expansion in Spring, 2024.

Users with new software requests or or general questions should open a ticket.

Anchor
interactive-jobs
interactive-jobs
Interactive jobs

Users are welcome to use our web portal to access the cluster. This portal is based on the Open OnDemand project and can be found behind the VPN or on the campus network at hpcportal.cc.lehigh.edu.

The vast majority of research calculations are executed on HPC clusters in a non-interactive way, hence we encourage all users to try to design their calculations so they can be completed in an automated fashion. Most of the SLURM partitions will host these asynchronous batch jobs and as a result, they will have lengthy wait times.

Some work requires interactive use of a compute node, for example compiling code for your target architecture. In those cases, users should use the express partition. This partition has a short time limit of 4 hours and a low maximum limit of 4 cores. It will have low wait times and act similar to a head node because we oversubscribe it. Many users can share all of the cores on these nodes so performance may be limited.

The best way to get an interactive job is with the combination of salloc … srun. Both are SLURM commands, and when used together, they will allow you to enter a SLURM job and use the compute nodes directly. Here is an example:

No Format

salloc -c 4 -p hawkcpu -t 2:0:0 srun --pty bash

Users can select the number of cores and time limit (up to 4 hours) using the usual SLURM flags. It is also possible to use other partitions for interactive jobs with longer limits, but we cannot predict your wait times in advance.

Anchor
python
python
Python

We have started to add popular python packages directly to the Lmod modules system, so that quick calculations require zero extra installation steps. For example:

Code Block
bash
bash

$ module load python py-numpy
$ python
Python 3.13.0 (main, Jan  6 2025, 19:35:53) [GCC 12.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import numpy as np
>>> np.random.randint(100)
97

If you need to install your own environments, we recommend the following procedure.

Note that we are using a trick to write a file from the command line using cat…EOF. This command should be copied through the second EOF and executed directly in the terminal. You could just as easily write the text into a new file with your favorite text editor.

Code Block
bash
bash

module load python
# name your ceph project
CEPH_PROJECT=hpctraining_proj
# go to the shared forlder
cd $HOME/$CEPH_PROJECT/shared
# the cat command below writes a spec file
# you should copy the entire multi-line command through the second EOF
cat > venv-spec-project-a01.txt <<EOF
scipy==1.15
seaborn
EOF
python -m venv ./venv-project-a01
source ./venv-projectA/bin/activate
pip install -r venv-spec-project-a01.txt

Later, you can use this environment by using the absolute path to your virtual environment:

No Format

my_share_folder=/share/ceph/hawk/hpctraining_proj/shared/
source $my_share_folder/venv-project-a01/bin/activate

This procedure is the best way to add software to your Python-based software environment if it is not available when you search for module spider <some_package_name>.

Anchor
virtual-environment-modules
virtual-environment-modules
Virtual environment modules

If you want to use the Python virtual environment above with less text, you can make a custom modulefile.

No Format

MY_PROJECT_NAME=project-a01
make_venv_module.py 

After you run this command, you can access the module (without the helpful terminal prompt, however), with this command:

No Format

ml own project-a01

Be sure to replace project-a01 through the instructions above with a meaningful name. After you create this module, you can add additional software and save a module collection, for example:

No Format

module load own project-a01
module load intel-oneapi-mkl
module save

This saves your custom module to the default collection so it is always available without adding any module commands to your scripts. If you have many projects, you could use a collection name:

No Format

module load own project-a01
module load intel-oneapi-mkl
module save project-a01

In this case, you could access this specific project with a single command:

No Format

module restore project-a01

The goal for this method is to leverage the module system, apply it to custom virutal environments, and make your SLURM scripts as simple as possible.

Anchor
anaconda
anaconda
Anaconda

We have installed a miniconda3 module so that users can build their own Anaconda environments. Please not that a standard Python virtual enviroment is the preferred way to build virtual environments unless you need access to the broader set of packages provided by conda.

NOTE: We have customized the miniconda3 module so that you should NOT need to modify your ~/.bashrc file to use conda environments. These customizations are extremely confusing on HPC systems that use Lmod to manage software. To avoid troubleshooting headaches later, we strongly recommend that you use module load miniconda3 instead of modifying your ~/.bashrc.

We recommend building your environments in the shared folder in your research group’s Ceph project. The following procedure will allow you to build a shared environment from a single specification (yaml) file.

Note that we are using a trick to write a file from the command line using cat…EOF. This command should be copied through the second EOF and executed directly in the terminal. You could just as easily write the text into a new file with your favorite text editor.

Code Block
bash
bash

# instructions for maintaining a shared conda env
module load miniconda3
# name your ceph project
CEPH_PROJECT=hpctraining_proj
# go to the shared forlder
cd $CEPH_PROJECT/shared
# the cat command below writes a spec file
# you should copy the entire multi-line command through the second EOF
cat > env-conda-myenv-spec.yaml <<EOF
name: stats2
channels:
  - javascript
dependencies:
  - python=3.9
  - bokeh=2.4.2
  - conda-forge::numpy=1.21.*
  - nodejs=16.13.*
  - flask
  - pip
  - pip:
    - Flask-Testing
EOF
conda env update -f env-conda-myenv-spec.yaml -p ./env-conda-myenv
# the following is the path to this environment
echo $PWD/env-conda-myenv
# activate any arbitrary conda environment by using this path
module load miniconda3
conda activate ~/$CEPH_PROJECT/shared/env-conda-myenv

You can add any conda or pip packages to the Anaconda environment file we used above. The conda env update procedure above ensures that you can easily update or reproduce this environment on other systems.

Anchor
other-topics
other-topics
Other topics

Our documentation is in a transitional state. Besides the upgrade guide, and this quickstart, we also maintain tutorial-style notes from the HPC sessions offered as part of the LTS seminar series, which can be found at go.lehigh.edu/rcnotes.