Quickstart

Quickstart

This "quickstart" user guide is designed to familiarize users with the latest upgrades on the Sol and Hawk clusters. Since these clusters share a head node, storage system, and software, we will refer to them as Sol throughout the guide.

Our goal is to provide a simple introduction to both the hardware and software on this machine so that researchers can start using it as quickly as possible.

Hardware

Sol is a highly heterogenous cluster, meaning that it is composed of many different types of hardware. The hardware on the cluster has three features:

  1. Architecture a.k.a. instruction set

  2. High-speed Infiniband (IB) networking available for many nodes

  3. Specialized graphics processing units (GPUs) available one some nodes

Architecture

Architecture is the most important feature of our hardware, because it determines the set of software that you can use. We have whittled down the number of architectures into three categories. We list the architectures in reverse chronological order and give each of them an Lmod architecture name explained in the software section below.

  1. Intel Haswell (2013) uses arch/haswell24v2

  2. Intel Cascade Lake (2019) uses arch/cascade24v2

  3. Intel Ice Lake (2020) and higher uses arch/ice24v2

Each architeture provides a distinct instruction set, and all compiled software on our cluster depends on these instructions. The architectures are backwards compatible, meaning that you can always use software compiled for an older architecture on newer hardware.

Specialized Hardware

Besides architecture, there are two remaining pieces of specialized hardware that may be relevant to your workflows. First, most of the cluster has access to high-speed Infinband (IB) networking. This network makes it possible to run massively parallel calculations across multiple nodes.

The main exception comes from the Hawk partitions: hawkcpu, hawkcpu, and hawkmem. These partitions should be used for single-node jobs only, because the ethernet network is shared with our storage system and cannot accommodate fast communication between nodes.

Many partitions suffixed -gpu

SLURM partitions

We segment the hardware on the cluster by SLURM partitions listed below. SLURM is our scheduler, and it allows each user to carve off a section of the cluster for their exclusive use. Note that the cluster is currently undergoing an upgrade. We report only the upgraded partitions here, but the full guide is available.

Partition

Lmod Architecture

Infiniband

GPUs

Cores per node

Memory per core

Partition

Lmod Architecture

Infiniband

GPUs

Cores per node

Memory per core

rapids

ice24v2

yes

64

8 GB

lake-gpu

ice24v2

yes

8x NVIDIA L40S

64

8 GB

hawkcpu

cascade24v2

52

7.3 GB

hawkgpu

cascade24v2

8x NVIDIA T4

48

4GB

hawkmem

cascade24v2

52

29.75 GB

In the next section, we will explain how to use the Lmod architecture when selecting your software.

Software

There are two kinds of software on our system:

  1. System-wide software provided by the Lmod modules system.

  2. User-installed and user-compiled software.

Even users with custom codes will use the system-wide compilers and software to build their own software. It is particularly important to use the system-wide compilers and MPI (message-passing interface) implementations on our system to fully leverage the HPC hardware.

Using Lmod

Users who log on to the head node at sol.cc.lehigh.edu will have access to the Lmod module command. Our default modules include the arch/cascade24v2, which matches the Cascade Lake architecture of the head node and the Hawk partitions, along with a default compiler (gcc) and a default MPI implementation (openmpi).

Users can list their loaded software with module list or just ml:

$ module list Currently Loaded Modules: 1) gcc/12.4.0 2) openmpi/5.0.5 3) helpers 4) arch/cascade24v2

Users can search for software using the module spider command. It is very important to use module spider to search for software because the menu provided by module avail will hide software which is not compatible with your architecture, compiler, or MPI. To find the hidden software, use module spider:

$ module spider lammps --------------------------------- lammps: lammps/20240829.1 --------------------------------- You will need to load all module(s) on any one of the lines below before the "lammps/20240829.1" module is available to load. arch/cascade24v2 gcc/12.4.0 openmpi/5.0.5 Help: LAMMPS stands for Large-scale Atomic/Molecular Massively Parallel Simulator.

You need to follow the instructions precisely to load this software using the module load command:

$ module load arch/cascade24v2 gcc openmpi lammps

You can string together multiple modules on one command. Version numbers are optional. Lmod will select the default module, typically the highest version.

The software hierarchy

We reviewed our hardware at the top of this guide because it significantly restricts the types of software that you can use on each of our SLURM partitions. As a result, users should align the following choices when configuring their workflows:

  1. The hardware requirements, for example the presence of the Infiniband fabric, GPUs, memory requirements, etc, determine the SLURM partitions that are compatible with your workflow.

  2. The SLURM partitions are linked to an architecture name given in the table above. Once you select a partition, you can use the associated architecture name (and any lower ones).

  3. Each architecture has a set of compatible software.

The Lmod modules system provides a "software hierarchy" feature that allows us to deliver software for a specific architecture, compiler, and MPI. We exclude

Architectures are exclusive

Loading or unloading the arch modules will change the available software so it matches one of two architectures: either Ice Lake or Cascade Lake. Note that the Hawk partition uses the lower Cascade Lake architecture, and this is the default for the head node as well.

When you load an architecture module, it will unload the modules which are exclusive to your current state. For example, when we log on to Sol, the arch/cascade24v2 module is automatically loaded. Imagine that we load Python to do a simple calculation. Our modules might look like this:

$ module list Currently Loaded Modules: 1) gcc/12.4.0 7) tcl/8.6.12 2) openmpi/5.0.5 8) bzip2/1.0.8 3) helpers 9) py-pip/23.1.2 4) arch/cascade24v2 10) py-setuptools/69.2.0 5) libxcb/1.17.0 11) python/3.13.0 6) libx11/1.8.10

Our MPI uses Cascade Lake:

$ which mpirun /share/Apps/ice24v2/gcc-12.4.0/openmpi-5.0.5-6thd6mkhodcoqrpolw35qosoqels7vak/bin/mpirun

Later, imagine that we want to compile some code for the newer nodes in the rapids partition. This partition is compatible with the Ice Lake architecture (even though it happens to be the slightly newer Sapphire Rapids architecture). If we switch to this architecture, it will unload our Python module, because the arch/ice24v2 software does not include Python:

$ module load arch/ice24v2 Inactive Modules: 1) bzip2/1.0.8 4) py-pip 7) tcl/8.6.12 2) libx11/1.8.10 5) py-setuptools 3) libxcb/1.17.0 6) python/3.13.0 Due to MODULEPATH changes, the following have been reloaded: 1) gcc/12.4.0 2) openmpi/5.0.5 The following have been reloaded with a version change: 1) arch/cascade24v2 => arch/ice24v2

As you can see, Lmod helpfully reports that some of our software has been unloaded. We can see that we now have access to the newer software tree:

$ which mpirun /share/Apps/ice24v2/gcc-12.4.0/openmpi-5.0.5-6thd6mkhodcoqrpolw35qosoqels7vak/bin/mpirun

The upshot of this system is that users are encouraged to develop explicit recipes that match their software, architecture, and SLURM partition. If you need to run high-performance codes on the newer nodes, while also using some of the large arch/cascade24v2 software library provided by our modules system, you might want to build module collections using guidance in the next section.