Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 5 Next »

Sol Quickstart User Guide

This "quickstart" user guide is designed to familiarize users with the latest upgrades on the Sol and Hawk clusters. Since these clusters share a head node, storage system, and software, we will refer to them as Sol throughout the guide.

Our goal is to provide a simple introduction to both the hardware and software on this machine so that researchers can start using it as quickly as possible.

Hardware

Sol is a highly heterogenous cluster, meaning that it is composed of many different types of hardware. The hardware on the cluster has three features:

  1. Architecture a.k.a. instruction set
  2. High-speed Infiniband (IB) networking available for many nodes
  3. Specialized graphics processing units (GPUs) available one some nodes

Architecture

Architecture is the most important feature of our hardware, because it determines the set of software that you can use. We have whittled down the number of architectures into three categories. We list the architectures in reverse chronological order and give each of them an Lmod architecture name explained in the software section below.

  1. Intel Haswell (2013) uses arch/haswell24v2
  2. Intel Cascade Lake (2019) uses arch/cascade24v2
  3. Intel Ice Lake (2020) and higher uses arch/ice24v2

Each architeture provides a distinct instruction set, and all compiled software on our cluster depends on these instructions. The architectures are backwards compatible, meaning that you can always use software compiled for an older architecture on newer hardware.

Specialized Hardware

Besides architecture, there are two remaining pieces of specialized hardware that may be relevant to your workflows. First, most of the cluster has access to high-speed Infinband (IB) networking. This network makes it possible to run massively parallel calculations across multiple nodes.

The main exception comes from the Hawk partitions: hawkcpu, hawkcpu, and hawkmem. These partitions should be used for single-node jobs only, because the ethernet network is shared with our storage system and cannot accommodate fast communication between nodes.

Many partitions suffixed -gpu

Map of our partitions

We segment the hardware on the cluster by SLURM partitions listed below. SLURM is our scheduler, and it allows each user to carve off a section of the cluster for their exclusive use. Note that the cluster is currently undergoing an upgrade. We report only the upgraded partitions here, but the full guide is available.

Partition

Lmod Architecture

Infiniband

GPUs

rapids

ice24v2

yes

lake-gpu

ice24v2

yes

8x NVIDIA L40S

hawkcpu

cascade24v2

hawkgpu

cascade24v2

8x NVIDIA T4

hawkmem

cascade24v2

In the software section below, we will explain how to use the Lmod architecture names above.

Software

We explained the hardware first because it significantly restricts the types of software that you can use on each SLURM partitions. As a result, users should

UNDER CONSTRUCTION

  • No labels