Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Version History

« Previous Version 6 Current »

Environments

This page reviews the best practices for building custom environments. Recall that we have two ways to use software on the cluster:

  1. Use the Lmod modules system to find pre-installed software.
  2. Compile your own software into a custom environment.

Our team typically installs shared software modules for general use whenever this software is sufficiently popular. If several researchers or more than one research group requires the software, we can try to add it to the module system.

Most sufficiently complicated research projects will require customization, and most mature programming languages have a mechanism for providing a virtual environment, a piece of software that provides two features:

  1. A virtual envionment functions as a package manager, allowing you to install software from a centralized repository.
  2. Virtual environments are isolated from the system and other environments, so that you can precisely control the software that you use on a specific project.

This page focuses on the wide range of Python virtual environments, some of which are explained well in a well-read stack overflow answer. (Note that Stack Overflow sometimes falls into the awkward position of hosting the canonical documentation for an important question that cannot easily be answered elsewhere. Ironically, despite having rules against opinionated answers, it can often be the best public square to compare compating technologies, see comments on pipenv, and conda.)

This guide covers the following environments:

  1. Python
  2. R packages

But first, let’s review best practices for building a lab notebook.

Recipes and Scripts

Resarchers who use an HPC cluster should maintain a lab notebook that provides detailed instructions for repeating their work and reproducing their findings. Using a terminal makes this easy, since almost every step can be captured at the command line and added to the script.

Our group often provides instructions to users in a text block that can be executed directly in the terminal, for example:

# get an interactive session
salloc -c 6 -t 6:0:0 -p hawkcpu-express srun --pty bash
# set a variable to name your Ceph space
MY_CEPH=hpctraining_proj
# go to your shared folder
SPOT=$HOME/$MY_CEPH/shared
cd $SPOT

In this example, we are setting a BASH variable called MY_CEPH to name our research group’s Ceph project. When writing scripts and recipes that can be shared with others, it can be useful to abstract some of the personal details, in this case a storage location, into a single variable that we can reference later with $MY_CEPH. There are other variables such as $HOME and $USER which are predefined. The more we rely on these, the easier it will be to share our recipes with others.

The text above includes hash marks (#) to provide in-line comments so the reader understands the following step. You don’t need to execute these, but if you do, nothing will happen.

By convention, some documentation uses a dollar sign ($) to represent the universal command prompt signal to distinguish the command from the response, for example:

$ ls $CONDA_PREFIX/etc/profile.d/
gawk.csh  gawk.sh

Since there is no Linux command "$", you should be able to infer the correct commands from context.

The cat trick

In the spirit of building self-contained, easy-to-read blocks of text to describe our works, we sometimes want to tell a user to "save some text in a file". To avoid unnecessary exposition, it can be convenient to include this inline with the commands instead of separating the text of the file from the rest of the procedure. We can accomplish this by writing a text file directly from the terminal. If we continue from our example above, we can write a file into our shared Ceph space:

cd $SPOT
cat > README.txt <<EOF
This is the shared folder for our research group.
The current path is "$PWD".
Users in a research group can share the data in this space, and this data can include shared software environments.
EOF
# next we can continue to run some other commands, first by reviewing the contents of the file
cat README.txt

In the "cat trick" above, the user must copy from the first cat through the second EOF, which must appear immediately following the newline character, that is, with no preceding spaces. If you copy this entire block of text into a terminal, it effectively writes a file, saving you the effort of opening a text editor. The command it self uses redirection operators to send data from the terminal into a file (via the > operator) until it notices the second end-of-file operator (EOF).

There are two caveats:

  1. You cannot include any spaces after the second EOF (or else it won’t signal the end of the file).
  2. You can use BASH variables, for example $PWD inside the script, but they will be expanded. If you want the script to dynamically dereference a BASH variable, you should escape the dollar sign with a backslash.

Once your reader knows about the cat trick, or if they are a Linux expert already, then you can easily share elaborate installation procedures with them. Our group uses it extensievly when sharing instructions with our users.

Python

While there are many different python packaging systems (for example pipenv and poetry), we focus exclusively on venv and conda, with a strong preference for using venv whenever possible, because it is the simplest method. Nevertheless, many different packages are exclusively distributed on conda channels, so users are encouraged to use their discretion when deciding between these two options.

Use preinstalled packages

In the example below, we will install numpy as an example. If you only need numpy, however, you can avoid installing a virtual environment and just use our Lmod modules system:

module load python py-numpy

This one-line command loads our optimized numpy package, which should provide faster performance than the custom-installed version because we built this copy on Intel’s MKL library, whereas the default supporting library is openblas, which might be slower on our Intel-based system.

All central python packages are visible when you load the primary python module and run ml av:

module load gcc openmpi python
module avail
# or use the concise shortcut
ml av

The python packages use the py- prefix (while R packages use the r- prefix).

Python virtual environments

Building a python virtual environment is covered in the standard library documentation for the venv module. First, make sure that you have access to the Lmod python module:

$ ml --terse
gcc/12.4.0
openmpi/5.0.5
helpers
arch/cascade24v2
tcl/8.6.12
py-pip/23.1.2
py-setuptools/69.2.0
python/3.13.0

We can see that the modules system provides python version 3.13. If this version is compatible with your software, we don’t need to take any other steps. If you need a different version, you should search for it first with module spider python. If it doesn’t exist, the easiest solution is to use a conda environment.

Our head node has limited resources, so we prefer to build a virtual environment in an interactive session:

salloc -p hawkcpu-express -c 4 -t 60 srun --pty bash

You can build the virtual environment directly in your home directory, but if the environment will be large, or you want to share it with other colleagues in your research group, you should name your Ceph project and change to a subdirectory. Note that most Ceph projects are named by your advisor’s Lehigh ID, with an expiration date, for example abc123_123125. In this example, we use the Ceph project for HPC training, but you should avoid this, because we periodically delete this data.

# set a variable to name your Ceph project
MY_CEPH=hpctraining_proj
# change to the shared directory if you want to share
cd $HOME/$MY_CEPH/shared
# alternately, change to your individual, private directory
cd $HOME/$MY_CEPH/$USER
# find out where you are by printing the path to the "present working directory"
echo $PWD
# or more concirely
pwd

Once you pick a place to install the directory, the instructions are simple:

python -m venv ./venv
source ./venv/bin/activate

The first command creates a virtual environment in a subfolder named venv. This is common practice, but if you plan to install more than one, you should give it a more meaningful name.

The second command "activates" the environment, basically by giving you access to it. You will need to run this command every time you want to use the software.

We might want to write a single command to use this environment from any location. Recall that we used echo $PWD to check our path before installing the software. In this example, the path was:

$ echo $PWD
/share/ceph/hawk/hpctraining_proj/shared

We can combine the present working directory with the named subfolder holding our environment, to create a single command to access the virtual environment:

source /share/ceph/hawk/hpctraining_proj/shared/venv/bin/activate

To make this more concise, and maybe easier to change earlier, we can write an "entrypoint" in our home directory:

cat > ~/entry-project-v01.sh <<EOF
source /share/ceph/hawk/hpctraining_proj/shared/venv/bin/activate
EOF 

Then, in the future, we can use this environment by adding a shorter command to our SLURM scripts:

source ~/entry-project-v01.sh

By writing our environment in one location we are following the DRY principle, "don’t repeat yourself". If we want to move the environment later, we can change this script without having to update many SLURM scripts.

After we enter the environment, our terminal prompt changes:

[rpb222@hawk-a120 shared]$ source /share/ceph/hawk/hpctraining_proj/shared/venv/bin/activate
(venv) [rpb222@hawk-a120 shared]$ 

Now we can install packages directly into the virtual environment with pip:

pip install tqdm

Installing packages one-by-one can often lead to problems. A more repeatable method is to write down your packages in a single text file and then install them all at once:

cat > reqs.txt <<EOF
numpy==2.2.2
EOF
python -m pip install -r reqs.txt

Repeatable venv method

We can combine all of the features above into a single, concise set of instructions that uses the cat trick for deploying our software environment:

# STEP 0: get an interactive session and load the python module
salloc -p hawkcpu-express -c 4 -t 60 srun --pty bash
module load python/3.13.0
# STEP 1: choose a location
# set a variable to name your Ceph project
MY_CEPH=hpctraining_proj
# change to your individual, private directory
cd $HOME/$MY_CEPH/$USER
# alternately, change to the shared directory if you want to share
cd $HOME/$MY_CEPH/shared
# check our location
pwd
# NOTE the paths
#   our environment will is installed in path with a meaningful name:
#     /share/ceph/hawk/hpctraining_proj/shared/venv-project-v01
#   you should carefully select the Ceph project, 
#      choose between a "shared" and individual directory, 
#      and make sure to record the path to the environment
# STEP 2: build the environment
python -m venv ./venv-project-v01
# STEP 3: activate the environment
source ./venv-project-v01/bin/activate
# STEP 4: write a requirements file
cat > reqs.txt <<EOF
numpy==2.2.2
EOF
python -m pip install -r reqs.txt
# STEP 5: record the exact versions for posterity
# export the environment for reference, and save the results for later
pip freeze
# STEP 6: build an entrypoint script
cat > ~/entry-project-v01.sh <<EOF
source $PWD
EOF 
# STEP 7: use the environment from any SLURM script with:
source ~/entry-project-v01.sh

These instructions are portable to other systems, and should be included in any researcher’s lab notebook.

Conda environments

Some authors distribute their codes exclusively on conda channels, for example bioconda. While we typically prefer the bog-standard python and pip installation process, oftentimes conda provides additional distributions. Before providing a concise build method below, we will review some important context.

Requirements for using conda

Our cluster uses the miniconda3 module to provide access to Anaconda environments. Our current understanding of the Anaconda terms of service is that higher-education non-profit organizations can use Anaconda for teaching purposes, while organizations with more than 200 members are subject to licensing fees.

Our solution to this problem is to constrain miniconda3 to use conda forge, which provides the underlying open-source packages without incurring licensing fees. The upshot is that all of our users can use miniconda3 and the conda command out of the box without any extra steps.

Warning: our cluster uses a modified version of conda provided by module load miniconda3 which allows you to use conda activate and conda deactivate commands without any modification to your ~/.bashrc. It is typical for a conda provider to ask you to add some commands to this file, but in our experience, this pollutes your user environment and makes it extremely confusing to use more than a single Python at one time. Our modifications to conda make it possible to use these environments without any extra customization. DO NOT modify your ~/.bashrc unless you know what you’re doing!

Readers should review the python virtual envioronments section before using the following method, because conda acts as a superset of pip in which we make a few substitutions. Understanding this pattern benefits anyone using an environment.

Repeatable conda environments

We recommend building all conda environments from a YAML-based requirements file written into our instructions below using the cat trick. You can use the following method to customize your own environment as long as you select a location and list all of your dependencies.

# STEP 0: get an interactive session and load the python module
salloc -p hawkcpu-express -c 4 -t 60 srun --pty bash
module load miniconda3
# STEP 1: choose a location
# set a variable to name your Ceph project
MY_CEPH=hpctraining_proj
# change to your individual, private directory
cd $HOME/$MY_CEPH/$USER
# alternately, change to the shared directory if you want to share
cd $HOME/$MY_CEPH/shared
# check our location
pwd
# NOTE the paths
#   our environment will is installed in path with a meaningful name:
#     /share/ceph/hawk/hpctraining_proj/shared/cenv-project-v01
#   you should carefully select the Ceph project, 
#      choose between a "shared" and individual directory, 
#      and make sure to record the path to the environment
# STEP 2: write the requirements file
cat > cenv-project-v01-reqs.yaml <<EOF
dependencies:
- python==3.12
- "libblas=*=*mkl"
- conda-forge::numpy
- conda-forge::scipy
- pip
- pip:
  - tqdm
EOF
# STEP 2: build the environment
conda env update -f cenv-project-v01-reqs.yaml -p ./cenv-project-v01
# STEP 3: activate the environment
conda activate ./cenv-project-v01
# STEP 5: record the exact versions for posterity
# export the environment for reference, and save the results for later
conda env export -f cenv-project-v01-export.yaml
# STEP 6: build an entrypoint script
cat > ~/entry-project-v01.sh <<EOF
module load miniconda3
conda activate $PWD/cenv-project-v01
EOF
# STEP 7: use the environment from any SLURM script with:
source ~/entry-project-v01.sh

This method takes inputs from cenv-project-v01-reqs.yaml and builds an environment from them. You can include both conda and pip packages. The export file, cenv-project-v01-export.yaml records the exact versions that conda found, so you can reproduce this environment on another system if you want.

R packages

Before installing R packages, it can be useful to see if it’s already available. Imagine you are looking for ggplot2. You can use the Lmod system to find it:

module spider ggplot2
module load r/4.4.1 r-ggplot2

We don’t need to reload arch/cacasde24v2 because this is the default. We currently provide over 30 popular R packages. If you can’t find the one you need, you can install it into your home directory using install.packages. Our default R module (r/4.4.1) modifies R_LIBS_USER so that new packages are installed to a compiler-specific directory in your home directory.

As an example, we will install netcdf4 into R. This package typically requires a system-level installation, via sudo apt-get install -y libnetcdf-dev. Since our users are unpriviledged, and hence lack sudo, they cannot install software into the operating system. Thankfully, the netcdf library is provided by Spack. You can find it in the usual way:

module spider netcdf
module load netcdf-c/4.9.2

It helps to put the module load commands on one line for clarity and completeness:

ml r/4.4.1 netcdf-c/4.9.2

Next we start an R session and install it:

> install.packages("ncdf4")
Installing package into ‘/share/Apps/cascade24v2/gcc-12.4.0/r-ape-5.8-air57podx6r2ynqs2jrachqhkltywrg3/rlib/R/library’
(as ‘lib’ is unspecified)
Warning in install.packages("ncdf4") :
  'lib = "/share/Apps/cascade24v2/gcc-12.4.0/r-ape-5.8-air57podx6r2ynqs2jrachqhkltywrg3/rlib/R/library"' is not writable
Would you like to use a personal library instead? (yes/No/cancel) yes
Would you like to create a personal library
‘~/R/4.4/gcc/12.4’
to install packages into? (yes/No/cancel) yes

We are not reproducing the output here, but thanks to the Lmod system, our Spack-installed netcdf-c package populates the right BASH variables so that the R ncdf4 can find the libraries and development headers. YOU can review these with another module command:

ml show netcdf-c/4.9.2

We can see that our moudulefile sets CPLUS_INCLUDE_PATH, and the details in the PKG_CONFIG_PATH to find the software. This example demonstrates a pattern in which we use Spack to install common middleware and foundational libraries which we might otherwise install with sudo, and then use the package manager inside another scripting language, in this case R, to extend its functionality by linking against these libraries.

  • No labels