Tensorflow

These notes will show you how to use Tensorflow with an NVIDIA T4 GPU on our hawkgpu partition, including a connection to a Jupyter Notebook in our Open OnDemand portal at hpcportal.cc.lehigh.edu. This method can create a GPU-enabled environment with an interface similar to Google’s Colab.

Background

As of Fall, 2024, the Hawk expansion to the Sol cluster provides the easiest access to GPUs for users with a discretionary allocation. Note that some faculty condo users have privileged access to somewhat newer GPUs (NVIDIA L40S) in the lake-gpu partition. The hawkgpu partition has no privileged users, however both partitions are available to all. The hawkgpu partition can provide access to NVIDIA T4 GPUs, which tend to work best with inference.

As a word of caution, anyone who uses GPUs for scientific computations must make sure that their CUDA driver, CUDA toolkit, and user-land code (e.g. Tensorflow, Torch) are all mutually compatible. Failure to use compatible versions of these codes can cause one of two failure modes. First, you might notice extremely long delays before starting your first calculation. This is evidence that CUDA is trying to bridge a compatibility gap with a just-in-time (JIT) compiler. Second, you might notice generally poor performance if your codes are not talking to the GPUs.

Method

We typically distribute instructions for installing software or building environments in a flat text file. Users should review each line, including comments (prefixed by #). To build simple instructions that are easy to copy, we often use a "cat" trick to write a file directly from the terminal. We explain this in the comments below.

In the followng method, we first start an interactive session on Hawk. After installing the environment, you are welcome to start an interactive Jupyter session on our Open OnDemand portal at hpcportal.cc.lehigh.edu. Note that you must be on the campus network or VPN to access this page. When using Jupyter on the portal, you should request a single GPU along with 6 cores, since these are usually bundled together on our hawkgpu nodes.

# get an interactive session from the terminal # alternately use the terminal in an Open OnDemand session via hpcportal.cc.lehigh.edu salloc -c 6 --gres=gpu:1 -p hawkgpu -t 120 srun --pty bash # check the driver nvidia-smi # NVIDIA-SMI 515.105.01 Driver Version: 515.105.01 CUDA Version: 11.7 # select a ceph project, ideally associated with your advisor CEPH=hpctraining_proj SPOT=$HOME/$CEPH/$USER/test-gpu-hawk mkdir -p $SPOT cd $SPOT module load anaconda3 # use the following block to write a new requirements file (copy everything through the second EOF into a single command in the terminal, or add the text to a yaml file manually) cat > reqs.yaml <<EOF dependencies: - python==3.12 - pip - pip: # note that hawk has NVIDIA driver 515.105.01 which pairs with CUDA 11.7 # see notes above regarding CUDA toolkit and driver compatibility # we review previous versions to constrain things https://pytorch.org/get-started/previous-versions/ - ipykernel - tensorflow[and-cuda] - tensorboard - matplotlib - scikit-learn - h5py EOF time conda env update -f reqs.yaml -p ./cenv conda activate ./cenv python -c 'import tensorflow as tf;print(tf.config.list_physical_devices())' # result is: [PhysicalDevice(name='/physical_device:CPU:0', device_type='CPU'), PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')] python -c 'import tensorflow as tf;print(tf.__version__)' # result is version 2.18 # note there are several warnings about various CUDA features which might not be available, hence we recommend paying careful attention to performance in case one of these components is important for your workflow # after installing this, you need to add the kernel to jupyter before using it in Open OnDemand python -m ipykernel install --user --name my-tensorflow-environment --display-name my-tensorflow-environment # see further instructions for using this in the portal or terminal

To use your new environment, you should see "my-tensorflow-environment" in the list of available kernels in Jupyter in hpcportal.cc.lehigh.edu. Similarly, you can access the Anaconda environment from a terminal session or a SLURM script with these commands:

module load anaconda3 conda activate $HOME/hpctraining_proj/$USER/test-gpu-hawk

We recommend using the Ceph project associated with your faculty advisor, however you may also wish to use an alternate path. Note that this method does not include all of the CUDA-associated components that Tensorflow might use.