Rapids and Lake Partitions
This guide reviews the motivation, policy, and implementation of the Spring 2024 edition of our high-performance computing HPC condo (condominium) program. Any Sol users interested in using the newest compute and GPU hardware should review this guide before migrating their workflows to the new hardware.
Contents
Executive Summary
Our condo expansion includes two new partitions, rapids
, and lake-gpu
. Both include 64 cores per node, 8GB memory per core, and Infiniband. The GPU nodes include 8x NVIDIA L40S GPUs. We have deployed a completely new Lmod tree which users can access by running the sol_lake
command on the head node. This switches their current session to the new tree, and ensures that any SLURM scripts submitted during that session inherit access to the new modules, which are otherwise accessible via standard Lmod commands. To make the change persistent across sessions, or devolve the software selection to your individual SLURM scripts, please review our modules usage guide below.
All condo investors receive a full subscription to their hardware which renews annually on October 1, and is prorated in the meantime. GPU investors receive six-fold more core-hours to match the higher billing rate for these resources, otherwise
The condo program is a cooperative effort supported by faculty condo investors, LTS, and the University leadership. Since our objective is to deliver maximum utilization of our shared infrastructure, we expect that faculty and student researchers will make efforts to use this resource efficiently and responsibly. Our team is available to assist in this process. Users with any HPC questions should open a ticket. Researchers with large computational needs should schedule a consultation to learn about the best ways to seek funding and support for their projects. efforts.
The Condo Program
In Spring, 2024, Research Computing, with support from our colleagues at LTS, coordinated a condo expansion of the Sol cluster which was funded with a mixture of equipment funding and startup funds provided by nine Lehigh faculty (a.k.a. the condo investors). Under the condo program, faculty provides funds for new hardware while LTS and the University fund the infrastructure required to support it, including rack space in the datacenter, power, cooling, and administration.
Condo investors and their research groups receive high-priority access to the hardware they contributed to the expansion (see allocations below), however all users on Sol are welcome to use the new hardware. Since Sol users who are not part of the condo program do not have high-priority access to the nodes, they should review the sharing section below before planning their resource usage.
The condo expansion is integrated into the Sol cluster, and therefore requires up to three minimal changes to your existing workflows:
Since the expansion nodes have access to existing storage, you should not need to move any data.
Since most existing software is compatible with the newer architecture, you can continue to run existing workflows on the new nodes, however we recommend that you consider recompiling code, per the software guide.
Users need to request the new nodes by selecting the right partition name according to the scheduler instructions.
In the remainder of this guide we will provide instructions for scheduling jobs on the new hardware, using software optimized for the new hardware, and sharing the cluster.
Scheduler
The most immediate action required to use the new cluster requires the scheduler, (SLURM). Alongside the many partitions hosted on Sol, we have added four new partitions:
rapids
: 64-core Sapphire Rapids nodes with 512GB memorylake-gpu
: 64-core Ice Lake nodes with 8x NVIDIA L40S GPUsrapids-express
: high-availability queue for running a single four-core job onrapids
lake-gpu-express
: another high-availability queuelake-gpu
Most users should expect to use the rapids
partition for any single-node or multi-node calculations, unless they have a calculation that specifically benefits from the GPUs.
The GPUs have an inflated billing rate due to their higher cost. When allocating memory, we bind the cores to the memory and require that all users request an equivalent number of cores to match their memory request. This ensures that neither resource is underutilized. In practice, this means that your jobs may be held with a warning similar to BadConstraints
. If you see this, you should adjust your memory and core requests so they match the 8GB default memory per core constraint in SLURM (DefMemPerCPU
).
For example, if you need 210GB memory, you should round this up to the nearest factor of 8 and request 27 cores. You could include a request for the equivalent 216GB memory, but this will be automatically set. In short, users should be able to meet their memory requirements by requesting more cores.
As we explain in the architectures section, the head node has an older archiecture (Haswell). This means that users who want to compile code optimized for the new hardware should use the -express
partitions to get quick access to up to four cores for re-compiling their codes.
It’s important to note that the scheduler provides fungible access to nodes, cores, memory, and sometimes GPUs, but that the computational power of this hardware differs from the existing nodes on the cluster. Therefore, all users are strongly encouraged to profile their codes and workflows with fresh benchmarks on the new hardware before committing to a larger project. The responsible use of this valuable hardware requires that users justify their specific resource requests. Users with questions about benchmarking and profiling should open a ticket.
Software Guide
In this section, we will review the best practices for configuring your software on the new nodes. It may be easy to adjust your resource requests when interacting with the scheduler – software can be much more individualized. In this section we provide some cautions related to architecture and Infiniband, followed by a usage guide for the modules system.
Architectures
Our condo expansion spans two Intel architectures: Ice Lake and Sapphire Rapids. This choice was a function of the available hardware supporting GPUs, which use the (very slightly) older Ice Lake architecture.
The Sol cluster has an Intel Haswell head node, and this now-aging architecture provides only AVX2 instructions. This maximizes compatibility with many other portions of the cluster, meaning that modules you access on the head node, along with code that you compile there, will be compatible with many partitions. The downside is that code compiled for Haswell with AVX2 might not take advantage of newer instruction sets available on newer hardware, for example the AVX512 instructions provided by the Cascade Lake architecture in the Hawk nodes.
Note: You should almost always use the maximum instruction set for your hardware, however there is a minor edge case in which some Intel chips have either one or two AVX512 fused multiply add (FMA) units. This means that you may need to decide between two instruction sets, and depending on the way the use of the FMA units affects the clock, you might see a different result. A useful example of this is documented by GROMACS. We cite it here as an example of the complex relationship between instruction set and performance.
The upshot is that users must decide between using and compiling code with a lowest common denominator architecture on the headnode, and possibly forgoing some performance gains, versus custom compilations for the hardware you are targeting. The most performant option is to compile for the best instruction set for each architecture. You can do this by starting an interactive session:
salloc -c 4 -t 60 -p hawkcpu srun --pty bash
This starts a one-hour session with 4 cores on hawkcpu
, however depending on traffic, you might have to wait for these resources to be available. Any researchers who plan to use the new condo nodes should simply use the express partition and compile their code directly on the Sapphire Rapids or Ice Lake nodes using one of these two commands:
salloc -c 4 -t 60 -p rapids-express srun --pty bash
salloc -c 4 -t 60 -p lake-gpu-express --gres=gpu:1 srun --pty bash
After you enter an interactive session, you must manually switch to the new software tree using the sol_lake
command. Your terminal should look like this:
$ salloc -c 4 -t 60 -p rapids-express srun --pty bash
$ sol_lake
$ module list
Currently Loaded Modules:
1) gcc/12.3.0 3) zlib-ng/2.1.4-h2egiei 5) libiconv/1.17-x7ahpvf 7) cuda/12.2.1 9) standard/2024.02
2) openmpi/4.1.6 4) xz/5.4.1-26uvl5w 6) libxml2/2.10.3-akjmxiw 8) helpers
Note that this extra step differs from the sessions method below due to a quirk in the way that SLURM works with BASH.
Targeting a specific architecture will guarantee the best possible performance from your hardware.
InfiniBand
The new nodes are distinguished by NDR InfiniBand, making it possible to run multi-node calculations. To use these features you must compile any new codes using one of two MPI implementations provided by the new modules:
Intel MPI via
module load intel intel-oneapi-mpi
OpenMPI via
module load gcc openmpi
(this is the default)
These modules are revealed with the sol_lake
command explained below.
Modules
Our cluster, like most academic HPC clusters, provides centralized software using the Lmod modules system. We have populated the Lmod tree with commonly used software, libraries, and middleware. LURC is available to compile software as long as it is used by a critical mass of our users, otherwise we encourage users to take some ownership of their workflows by compiling their own software, especially if this software is highly customized. We expect that the vast majority of our users can get most of their supporting software from Lmod and compile only the most highly-specific packages for their projects.
Modules Redesign
Because Sol contains multiple hardware generations, the module tree includes over 1,300 packages which apply to different SLURM partitions. In order to provide a more elegant user experience for the new nodes, we have decided to hard-fork the software tree and start with a new one. We will call this the "new module tree" in contrast to the "legacy module tree". Starting fresh provides a number of benefits, and one challenge:
Optimization. All of the new software is optimized for the Ice Lake architecture which matches the GPU nodes in
lake-gpu
but is also compatible with the newerrapids
architecture.Uniformity and combinations. A new module tree avoids any downstream complexity when selecting software, because it is guaranteed to work effectively on the new nodes.
Usability improvements. The legacy modules included non-standard behavior in which
module load <name>
commands might silently fail. The new modules avoid this problem.
To elaborate further on the question of uniformity (item 2), we should review the two options for building hardware-optimized software. One common method is to use so-called "fat binaries" which have separate copies of each program designed for execution on different hardware. These take up extra storage and can be prohibitively complex on large HPC clusters.
The second method is to build an Lmod tree that presents a single uniform software tree and then automatically switches to hardware-optimized software when a job is dispatched from the head node to a compute node. While this method offers both uniformity and performance, we have not yet implemented this strategy. To do so on Sol would require at least four separate, redundant software trees with identical configurations and distinct optimizations. The use of so many hardware generations in one cluster creates an enormous number of combinations of usable software. To reduce redundancy, we therefore build software that targets specific SLURM partitions. As a result, we rely on the end-users to confirm that they are using the best optimizations for their workflow. We find that it is better to field requests for consultations than to maintain an excessive software stack.
If you have questions about whether your software is optimized properly, you should open a ticket to ask for our assistance.
The only associated challenge with a separate software tree is that users will need to opt-in to the new software in a careful way. We explain this in the usage notes below.
Using the new modules
Recall that two primary distinguishing features of an HPC environment are the use of Lmod modules for managing software and the use of SLURM for scheduling access to specific computational resources. These two pieces of software work in unison to make it easy to build repeatable workflows. They do this with the BASH environment. When you load a piece of software with Lmod, you alter the BASH environment:
When you submit a SLURM job with sbatch
, your remote session receives hardware elsewhere on the cluster, and importantly, it inherits the BASH variables in the session used to submit the job. This elegant design means that you can customize a BASH environment on the head node and ensure that it can be used on the compute nodes.
This presents a challenge for maintaining multiple software trees. We need a way for the user to signal that they want to switch to a new tree. While it is possible to include this natively and automatically (we review this in item 2 in the redesign notes above), there are many benefits to building a fresh tree. We offer two methods to use the new software.
Method 0: Interactive
Whenever we start an interactive session, we need to opt-in to the new modules tree after the session starts:
In the following two methods for asynchronous jobs, we will need to opt-in to the new modules tree before submitting the job.
Method 1: Sessions
The easiest way to use the software is to switch to the new module tree on the head node, and then build and submit your calculations using SLURM. You can switch to the new tree by running a single command:
You can see that when we run this on the head node, we have to confirm our decision. We also receive a warning that the newer software (which uses Ice Lake) may not work on the older (Haswell) head node. The benefit is that we can now review the entire software tree with the usual module commands:
You can also switch to the new modules for all future sessions, for example when you log-in on another day, by touching another hidden file:
You can remove this file to revert to the legacy tree. You can also use the sol_legacy
command to switch to the original tree at any time.
The benefit to this method is that any users who are exclusively using the new nodes can make a persistent change in which they use the new modules on the head node, they compile and install new code with the -express partitions, and they build simple sbatch
scripts that use the module tree in the usual way.
The only downside is that users will need to remember whether a workflow uses the new nodes, and make sure they have used sol_lake
before running sbatch
. If they fail to do this, then submitting an sbatch
script that uses a newer module, for example python/3.11.6
without running sol_lake
first, will cause a module error, since this module is unique to the new module tree.
If you are mindful of which tree you are using, this should not be a problem. If you want to devolve instructions to individual SLURM scripts, you should use the second method.
Method 2: Modularity
You should review the sessions method before using this method. If you are dispatching SLURM jobs to multiple partitions, using both the new (sol_lake
) and legacy(sol_legacy
) module trees, then you can devolve the code to switch to a different tree to the sbatch
script itself. This ensures that each script is self-contained, and doesn’t depend on the module you are using on the head node when you submit jobs. To use this method, you should add the following line to your SLURM scripts:
This command is also executed by the alias (which you can review with which sol_lake
), and follows the Lmod instrutions for switching to a new software tree.
Selecting modules
Once you select the new module tree with the sol_lake
for your sessions on the head node, or by writing this into your scripts, the module tree provides bog-standard Lmod functionality. Switching compilers and MPI modules will reveal new software compiled with these tools using module avail
, while module spider
can help you identify software that might depend on a different compiler or MPI.
Sharing
This section will review our allocation and priority policies for the condo expansion.
Factors that affect sharing
As with any high-performance computing infrastructure, we have twin design objectives:
Maximum utilization. We want to make sure that we dedicate these resources towards the greatest research productivity. In short, this means we seek to minimize both downtime and idle resources.
Condo investor access. Since many faculty have funded the condo nodes from grants dedicated to a specific research project, we must ensure that these faculty can use the resources effectively.
As you will see below, our allocation policy is designed to fully-subscribe the condo hardware without any over-subscription. Under ideal conditions, this means that condo investors are granted an allocation equivalent to their contribution. As with any other HPC resource allocation, we expect that the faculty will take steps to consume these resources at a regular pace.
There are a few obstacles to meeting our sharing objectives:
Traffic may be irregular. It can be very difficult to plan for five years of continuous, regular usage of a machine.
It can be difficult to fully utilize a condo investment. Since acquiring new hardware requires advance notice, we must necessarily either over-provision our hardware or shrink our research objectives whenver we cannot perfectly predict our resource needs.
Many researchers require access to different resources.
These obstacles motivate our efforts to effectively share the cluster among many different researchers. Besides making the most of our shared datacenter and networking infrastructure, building a large collective resource can help to average-out or smooth-over the irregular demand profiles created by the obstacles listed above.
In practice, this means that we can improve the degree of sharing on the new condos in three ways.
Condo investors who purchased GPU nodes are free to use CPU-only nodes for their non-GPU workflows. It is extremely rare for one principal investigator (PI) to have a 100% GPU-oriented research program.
Similarly, condo investors who did not purchase GPUs are free to use the GPU partitions, subject to some restrictions described below.
Lastly, all users of Sol and Hawk are free to use their allocations on the new hardware. In short, this means that we are not restricting condo access to the investors.
In order for our community to maximize utilization while meeting the needs of our condo investors, we have implemented higher priority for the condo investors. We hope that the effective cross-traffic from one resource to another will balance out in the long-term.
This might mean, for example that a condo investor with privileged access to GPU nodes might run some single-node jobs on the Hawk nodes (which lack InfiniBand) or some multi-node CPU-only calcuations on the rapids
partition. Whenever this happens, this frees up GPU cycles for other members of our community to use the newer GPU nodes. These cycles could be used to complete small projects or to benchmark new workflows for future grant proposals.
There is a caveat. Any users with access to discretionary hours, for example as part of the Hawk grant, have no firm guarantee that they can spend their allocations on the new condo hardware. In the event that the investors fully-book this hardware, then we cannot guarantee an exchange between Hawk cycles and cycles on the new condo nodes.
To summarize our sharing objectives, we hope to manage the inherently uncertain and irregular patterns required by many research projects by allowing the researchers who use Sol to autonomously select the best compute hardware for their calculations. Insofar as we allow our cycles to be exchangable across partitions, we hope to maximize utilization and increase access to specialize hardware for each research project.
Allocations
The condo expansion continues our historical allocation strategy in which condo investors receive a yearly allocation starting in the third quarter of each calendar year, on October 1. Allocations are tagged by year, for example rpb222_2425
would refer to an allocation term of October 1, 2024, through September 30, 2025. We will prorate allocations when we begin production. As always, we encourage faculty to take steps to plan their annual resource usage. We can provide consultations to make this process easier.
We will allocate the condo nodes using a full-subscription model. This means that each condo investor will receive the exact number of hours provided by their hardware. This simplifies the accounting system, however unannounced downtime or outages may impact your ability to consume your entire allocation.
Given the significant expense required to purchase GPU nodes, we will take two steps to balance GPU usage.
Faculty condo investors who purchased GPUs will receive an allocation which is six times larger than the number of core-hours provided by their hardware. This accounts for the difference in dollar value between a GPU node and a compute node.
Any researcher using a GPU will be charged at a higher rate, specifically six core-hours per effective core-hour. This ensures that there is an appropriate cost for using the GPU portion of these nodes.
This accounting strategy guarantees maximum flexibility for all users.
Condo investors who purchased GPUs are free to use their nodes with no penaltiy, even if they don’t use the GPUs.
Non-condo investors and other at-large Sol users can get access to GPUs at a cost commensurate with the dollar value of these nodes. This gives users an incentive to make sure their workflow benefits from GPUs.
This strategy requires no access restrictions for any users, thereby helping to maximize the extent to which we share the machine.
If we find that GPU traffic is high, we may enforce an access policy described below in which GPU users must be vetted to ensure their code makes use of the GPUs.
This means that condo investors will receive:
560,000 annual core-hours (
64*24*365
) per compute node in therapids
partition3,350,000 annual core-hours (
6*64*24*365
) per compute node in thelake-gpu
partition
Allocations can be viewed using the alloc-summary.sh
command.
GPU Allocations
Sol currently allocates the cluster according to core-hours (a.k.a. cycles, a.k.a. service units or SUs). A core-hour provides the use of one CPU core for one hour. Our scheduler, SLURM, can charge according to many different trackable resources.
bq\. Question: how can we provide broad access to both compute nodes and GPU nodes while ensuring that both are used efficiently and effectively?
As a first principle, the GPU nodes must receive a higher weight or value in our billing mechanism, since they are roughly six-fold more expensive than a standard compute node. There are two common ways to assign this value:
Provide a fine-grained allocation of GPU hours in parallel to core-hours.
Allow core-hours to be converted into higher-value GPU hours.
The first method above has two downsides. First, it requires a separate accounting system in which condo investors who purchase GPUs receive a separate allocation. Second, it prevents condo investors from using core-hours as a fungible resource. If we explicitly allocate the GPU, then we would need to build some exchange mechanism for situations in which Sol users who did not purchase them could use the GPUs, and vis-versa, for situations in which a condo investor who purchased GPUs might want to run a compute-node job. In practice, it is rare for a single research group to require a single piece of hardware; shared HPC clusters provide the benefit of many different types of resources.
For this reason, we take the second approach, and select a higher value for the GPU nodes. This allows condo investors who purchased a GPU node to share it more easily with other users, and eliminates additional bookkeeping and gatekeeping.
Our GPU allocations are implemented as follows:
We assign a
6x
scaling factor to GPU nodes to reflect their extra costs.GPU condo investors receive a
6x
core-hour allocation for their nodes.Any users who access the GPU nodes will be charged a
6x
rate for core-hours. This means that an 8-core, single GPU calculation that lasts one hour will cost 48 core-hours. All allocations are denominated in core-hours; the GPU nodes consume these at a rate which is six times higher than the rest of the cluster.If a GPU condo investor uses GPUs exclusively, they will have access to an allocation that reflects their contribution. If they decide to "spend" their core-hours on compute nodes, they will have six times as many effective core-hours.
We expect traffic to balance in both directions. That is, some condo investors will spend their allocations on compute nodes, while some at-large Sol users will access the GPUs.
All users are encouraged to book 8 cores with each GPU on
lake-gpu
, since they will be charged for the maximum equivalent resource. GPUs are bound to 8 cores, hence a request for 1 GPU will bill for 8-cores. This means we cannot oversubscribe CPU calculations alongside the GPUs. This often impacts performance.
Here are some example billing outcomes:
1 GPU + 6 CPU is billed as 6x8 or 48 core-hours per hour (6x for the use of the GPU partition, and we see that 8 cores per GPU is the maximum requested resource)
2 GPU + 12 CPU is billed as 6x(2x8) or 96 core-hours per hour
1 GPU + 12 CPU is billed as 6x12 or 72 core-hours per hour
256GB + 1 CPU on lake-gpu is billed as 6x32 or 192 core-hours per hour (at 8GB/core)
As you can see from these examples, we are billing for the maximum trackable resource in SLURM. All users are advised to convert their GPU requirements and memory requirements into cores, so they can take advantage of the idle cores. For example, if you request 1 core and 1 GPU, you are being billed for 7 additional cores that might go idle.
We recommend the same approach for memory. If you were to request 256GB memory for a serial calculation, you are being billed for 31 cores you are not using because we bind 8GB/core and bill for the maximum resource. In the event that you have a strictly serial calculation, this is still optimal. Uses should still request the additional cores for full transparency. We may enforce this as a rule: if a high-memory job is held by a SLURM condition, for example BadConstraints
, then you should resubmit with the correct number of cores to reach 8GB/core, which is the default memory (DefMemPerCPU
) setting in SLURM.
To summarize, all GPUs are allocated at a 6x
higher rate and cost 6x
as many core-hours. Users are billed for the maximum number of three resources: cores, GPUs, and memory. As a result, we recommend that you bind your requests for GPUs and memory to the equivalent number of cores, specificaly 8 cores per GPU and 8GB memory per core.
GPU Access
We are not certain that the GPU nodes will be fully booked. We expect that these nodes will be fully utilized. If not, we reserve the option to add a preemptible partition that will allow users to run non-GPU calculations on the nodes to increase utilization. Users with GPU workflows would provide documentation to our group to get access to a non-preemptible group. This strategy will hedge against both under- and over-utilization.
By restricting access to the GPUs to workflows that we have vetted, we can guarantee that we don’t waste any core-hours on these nodes while also making sure that GPU workflows take precedence. As of April, 2024, we have not implemented this feature.
Priority
Our allocation policy ensures that each researcher is entitled to the number of core-hours equivalent to their contribution. As we explain above, it is unlikely that each group will fully-book their hardware continuously for a long-period. Research proceeds at an inherently irregular pace. If each of the partitions in Sol have a large enough scale and usage, and this usage is fully decorrellated on average, then each researcher can theoretically consume their entire allocation. Since this is rare in practice, we provide a priority policy.
The priority policy ensures that we meet our second sharing objective, namely that the condo investors can have privileged access to their hardware. Members of research groups led by the condo investors will receive a hidden "priority boost" which is applied to each SLURM job.
This high priority resembles the SLURM fairshare algorithm with one minor difference. To enforce a full-subscription model with annual recents on October 1, we have applied this algorithm on a three-week timescale.
This rewards researchers who regularly submit their calculations while also providing higher priority to researchers who have taken a short break from regular usage. As always, the priority value for each job is also affected by the age of the submission. This means that the high condo priority coexists with a first-come-first-serve model. We will plan to tune the weights on this priority system to ensure that both regular Sol users and condo investors can work together to maximize utilization.
Questions
As always, users are free to open a ticket with technical questions or schedule a consultation for broader conversations. Feedback about the priority system, GPU access, or sharing considerations are most welcome, particularly from the condo investors and the members of the HPC steering committee.