Environments
This page reviews the best practices for building custom environments. Recall that we have two ways to use software on the cluster:
Use the Lmod modules system to find pre-installed software.
Compile your own software into a custom environment.
Our team typically installs shared software modules for general use whenever this software is sufficiently popular. If several researchers or more than one research group requires the software, we can try to add it to the module system.
Most sufficiently complicated research projects will require customization, and most mature programming languages have a mechanism for providing a virtual environment, a piece of software that provides two features:
A virtual envionment functions as a package manager, allowing you to install software from a centralized repository.
Virtual environments are isolated from the system and other environments, so that you can precisely control the software that you use on a specific project.
This page focuses on the wide range of Python virtual environments, some of which are explained well in a well-read stack overflow answer. (Note that Stack Overflow sometimes falls into the awkward position of hosting the canonical documentation for an important question that cannot easily be answered elsewhere. Ironically, despite having rules against opinionated answers, it can often be the best public square to compare compating technologies, see comments on pipenv, and conda.)
This guide covers the following environments:
But first, let’s review best practices for building a lab notebook.
Recipes and Scripts
Resarchers who use an HPC cluster should maintain a lab notebook that provides detailed instructions for repeating their work and reproducing their findings. Using a terminal makes this easy, since almost every step can be captured at the command line and added to the script.
Our group often provides instructions to users in a text block that can be executed directly in the terminal, for example:
# get an interactive session
salloc -c 6 -t 6:0:0 -p hawkcpu-express srun --pty bash
# set a variable to name your Ceph space
MY_CEPH=hpctraining_proj
# go to your shared folder
SPOT=$HOME/$MY_CEPH/shared
cd $SPOTIn this example, we are setting a BASH variable called MY_CEPH to name our research group’s Ceph project. When writing scripts and recipes that can be shared with others, it can be useful to abstract some of the personal details, in this case a storage location, into a single variable that we can reference later with $MY_CEPH. There are other variables such as $HOME and $USER which are predefined. The more we rely on these, the easier it will be to share our recipes with others.
The text above includes hash marks (#) to provide in-line comments so the reader understands the following step. You don’t need to execute these, but if you do, nothing will happen.
By convention, some documentation uses a dollar sign ($) to represent the universal command prompt signal to distinguish the command from the response, for example:
$ ls $CONDA_PREFIX/etc/profile.d/
gawk.csh gawk.shSince there is no Linux command "$", you should be able to infer the correct commands from context.
The cat trick
In the spirit of building self-contained, easy-to-read blocks of text to describe our works, we sometimes want to tell a user to "save some text in a file". To avoid unnecessary exposition, it can be convenient to include this inline with the commands instead of separating the text of the file from the rest of the procedure. We can accomplish this by writing a text file directly from the terminal. If we continue from our example above, we can write a file into our shared Ceph space:
cd $SPOT
cat > README.txt <<EOF
This is the shared folder for our research group.
The current path is "$PWD".
Users in a research group can share the data in this space, and this data can include shared software environments.
EOF
# next we can continue to run some other commands, first by reviewing the contents of the file
cat README.txtIn the "cat trick" above, the user must copy from the first cat through the second EOF, which must appear immediately following the newline character, that is, with no preceding spaces. If you copy this entire block of text into a terminal, it effectively writes a file, saving you the effort of opening a text editor. The command it self uses redirection operators to send data from the terminal into a file (via the > operator) until it notices the second end-of-file operator (EOF).
There are two caveats:
You cannot include any spaces after the second
EOF(or else it won’t signal the end of the file).You can use BASH variables, for example $PWD inside the script, but they will be expanded. If you want the script to dynamically dereference a BASH variable, you should escape the dollar sign with a backslash.
Once your reader knows about the cat trick, or if they are a Linux expert already, then you can easily share elaborate installation procedures with them. Our group uses it extensievly when sharing instructions with our users.
Python
While there are many different python packaging systems (for example pipenv and poetry), we focus exclusively on venv and conda, with a strong preference for using venv whenever possible, because it is the simplest method. Nevertheless, many different packages are exclusively distributed on conda channels, so users are encouraged to use their discretion when deciding between these two options.
Use preinstalled packages
In the example below, we will install numpy as an example. If you only need numpy, however, you can avoid installing a virtual environment and just use our Lmod modules system:
module load python py-numpyThis one-line command loads our optimized numpy package, which should provide faster performance than the custom-installed version because we built this copy on Intel’s MKL library, whereas the default supporting library is openblas, which might be slower on our Intel-based system.
All central python packages are visible when you load the primary python module and run ml av:
module load gcc openmpi python
module avail
# or use the concise shortcut
ml avThe python packages use the py- prefix (while R packages use the r- prefix).
Python virtual environments
Building a python virtual environment is covered in the standard library documentation for the venv module. First, make sure that you have access to the Lmod python module:
$ ml --terse
gcc/12.4.0
openmpi/5.0.5
helpers
arch/cascade24v2
tcl/8.6.12
py-pip/23.1.2
py-setuptools/69.2.0
python/3.13.0We can see that the modules system provides python version 3.13. If this version is compatible with your software, we don’t need to take any other steps. If you need a different version, you should search for it first with module spider python. If it doesn’t exist, the easiest solution is to use a conda environment.
Our head node has limited resources, so we prefer to build a virtual environment in an interactive session:
salloc -p hawkcpu-express -c 4 -t 60 srun --pty bashYou can build the virtual environment directly in your home directory, but if the environment will be large, or you want to share it with other colleagues in your research group, you should name your Ceph project and change to a subdirectory. Note that most Ceph projects are named by your advisor’s Lehigh ID, with an expiration date, for example abc123_123125. In this example, we use the Ceph project for HPC training, but you should avoid this, because we periodically delete this data.
# set a variable to name your Ceph project
MY_CEPH=hpctraining_proj
# change to the shared directory if you want to share
cd $HOME/$MY_CEPH/shared
# alternately, change to your individual, private directory
cd $HOME/$MY_CEPH/$USER
# find out where you are by printing the path to the "present working directory"
echo $PWD
# or more concirely
pwdOnce you pick a place to install the directory, the instructions are simple:
python -m venv ./venv
source ./venv/bin/activateThe first command creates a virtual environment in a subfolder named venv. This is common practice, but if you plan to install more than one, you should give it a more meaningful name.
The second command "activates" the environment, basically by giving you access to it. You will need to run this command every time you want to use the software.
We might want to write a single command to use this environment from any location. Recall that we used echo $PWD to check our path before installing the software. In this example, the path was:
$ echo $PWD
/share/ceph/hawk/hpctraining_proj/sharedWe can combine the present working directory with the named subfolder holding our environment, to create a single command to access the virtual environment:
source /share/ceph/hawk/hpctraining_proj/shared/venv/bin/activateTo make this more concise, and maybe easier to change earlier, we can write an "entrypoint" in our home directory:
cat > ~/entry-project-v01.sh <<EOF
source /share/ceph/hawk/hpctraining_proj/shared/venv/bin/activate
EOF Then, in the future, we can use this environment by adding a shorter command to our SLURM scripts:
source ~/entry-project-v01.shBy writing our environment in one location we are following the DRY principle, "don’t repeat yourself". If we want to move the environment later, we can change this script without having to update many SLURM scripts.
After we enter the environment, our terminal prompt changes:
[rpb222@hawk-a120 shared]$ source /share/ceph/hawk/hpctraining_proj/shared/venv/bin/activate
(venv) [rpb222@hawk-a120 shared]$