Compilers
Research Computing offers several tools to compile and debug software. All compiler suites contain compilers
- GNU Compiler (version 8.3.1 and 9.3.0)
- Intel Compiler (version 19.0.3, and 20.0.3)
- Intel OneAPI Base and HPC Toolkit (version 2021.3.0)
- NVIDIA HPC SDK (version 20.9)
for the following languages:
- C
- C++
- Fortran
All compilers except GNU Compiler 8.3.1 (system default) are available via the module command
[2022-04-28 09:27.51] ~ [alp514.sol](1046): module av gcc intel nvhpc oneapi ------------------------------------------------------------------------------ /share/Apps/lusoft/share/spack/lmod/avx2/linux-centos8-x86_64/Core ------------------------------------------------------------------------------- gcc/9.3.0 intel-mkl/2021.3.0 (D) intel-tbb/2021.3.0 (D) intel/20.0.3 (D) nvhpc/20.9 oneapi-inspector/2021.3.0 oneapi-mpi/2021.3.0 oneapi/2021.3.0 intel-mkl/2020.3.279 intel-tbb/2020.3 intel/19.0.3 intel/2021.3.0 oneapi-advisor/2021.3.0 oneapi-itac/2021.3.0 oneapi-vtune/2021.5.0 Where: D: Default Module Use "module spider" to find all possible modules and extensions. Use "module keyword key1 key2 ..." to search for all possible modules matching any of the "keys".
To compile code, first load the appropriate module and then use the correct command to compile a Fortran/C/C++ code
Language | GNU | Intel | OneAPI | NVIDIA HPC SDK |
---|---|---|---|---|
Fortran | gfortran | ifort | ifc | nvfortran |
C | gcc | icc | icx | nvc |
C++ | g++ | icpc | icpx | nvc++ |
Note
- Intel OneAPI compilers provide the legacy intel compilers for Fortran, C and C++. The description below for Intel Compilers are applicable for the legacy tools provided by OneAPI.
- The intel/2021.3.0 and oneapi/2021.3.0 modules are the same. However, intel/2021.3.0 is used by spack to build applications and needs to be loaded for module files to be displayed.
- nvc is the C compiler while nvcc is the CUDA compiler included in NVIDIA HPC SDK
Compiling and Running Serial Code
To compile a code, you need to provide the source code as an option to the compiler command.
<compiler> <source code>
This will create an executable a.out.
All compilers take options or compiler flags to specify libraries and provide an optimized executable.
Compiler Flags
Common to all compilers
Compiler Flags vary depending on the compiler. The following four compiler flags are common across the three compilers available on Sol
-o myexec
: compile code and create an executablemyexec
. If this option is not given, then a defaulta.out
is created.- -On: optmize code to level n.
- n = 0: no optimization
- n = 1: optmization for code size and execution time. Default when n is omitted
- n = 2: more extensive optmization
- n = 3: agressive optimization compared to n=2 and will increase compile time. Recommended for codes with loops with intensive floating point calculations
- -g: tells the compiler to generate a level of debugging information in the object file. Superseded by -O option only use while debugging code not for production code.
-mcmodel=mem_model
: tells the compiler to use a specific memory model to generate code and store data.- mem_model=small resticts code and data to first 2GB of address space,
- mem_model=medium restricts code to first 2GB address space, no restriction on data recommended
- mem_model=large no restriction on code and data
- -m64: Generate code for 64 bit architectures
- -fpic/-fPIC: generate position independent code (PIC) suitable for use in a shared library.
-l{libname}
: link compiled code to a library called libname. e.g. to use lapack libraries, add-llapack
as a compiler flag.-L{directory path}
: directory to search for libraries. e.g.-L/usr/lib64 -llapack
will search for lapack libraries in /usr/lib64.-I{directory path}
: directory to search for include files and fortran modules.
ifort -o saxpyf saxpy.f90 gcc -o saxpyc saxpy.c
The above options create a serial executable that can run on only one processor or cpu. Code can also be compiled for running in parallel, either shared memory parallel i.e. same node but multiple processors using OpenMP or distributed parallel i.e. multiple cpus on same or multiple nodes using Message Passing Interface (MPI) libraries.
[2018-02-22 08:47.27] ~/Workshop/2017XSEDEBootCamp/OpenMP [alp514.sol-d118](842): icc -o laplacec laplace_serial.c [2018-02-22 08:47.46] ~/Workshop/2017XSEDEBootCamp/OpenMP [alp514.sol-d118](843): ./laplacec Maximum iterations [100-4000]? 1000 ---------- Iteration number: 100 ------------ [995,995]: 63.33 [996,996]: 72.67 [997,997]: 81.40 [998,998]: 88.97 [999,999]: 94.86 [1000,1000]: 98.67 ---------- Iteration number: 200 ------------ [995,995]: 79.11 [996,996]: 84.86 [997,997]: 89.91 [998,998]: 94.10 [999,999]: 97.26 [1000,1000]: 99.28 ---------- Iteration number: 300 ------------ [995,995]: 85.25 [996,996]: 89.39 [997,997]: 92.96 [998,998]: 95.88 [999,999]: 98.07 [1000,1000]: 99.49 ---------- Iteration number: 400 ------------ [995,995]: 88.50 [996,996]: 91.75 [997,997]: 94.52 [998,998]: 96.78 [999,999]: 98.48 [1000,1000]: 99.59 ---------- Iteration number: 500 ------------ [995,995]: 90.52 [996,996]: 93.19 [997,997]: 95.47 [998,998]: 97.33 [999,999]: 98.73 [1000,1000]: 99.66 ---------- Iteration number: 600 ------------ [995,995]: 91.88 [996,996]: 94.17 [997,997]: 96.11 [998,998]: 97.69 [999,999]: 98.89 [1000,1000]: 99.70 ---------- Iteration number: 700 ------------ [995,995]: 92.87 [996,996]: 94.87 [997,997]: 96.57 [998,998]: 97.95 [999,999]: 99.01 [1000,1000]: 99.73 ---------- Iteration number: 800 ------------ [995,995]: 93.62 [996,996]: 95.40 [997,997]: 96.91 [998,998]: 98.15 [999,999]: 99.10 [1000,1000]: 99.75 ---------- Iteration number: 900 ------------ [995,995]: 94.21 [996,996]: 95.81 [997,997]: 97.18 [998,998]: 98.30 [999,999]: 99.17 [1000,1000]: 99.77 ---------- Iteration number: 1000 ------------ [995,995]: 94.68 [996,996]: 96.15 [997,997]: 97.40 [998,998]: 98.42 [999,999]: 99.22 [1000,1000]: 99.78 Max error at iteration 1000 was 0.034767 Total time was 4.099030 seconds.
Intel Compiler
The following are common optimization flags to use when compiling with the Intel Compiler
Flag | Description |
---|---|
-ipo | Enables interprocedural optimization between files. |
-ax_code_/-x_code_ | Tells the compiler to generate multiple, feature-specific auto-dispatch code paths for Intel processors if there is a performance benefit. code can be COMMON-AVX512, CORE-AVX512, CORE-AVX2, CORE-AVX-I or AVX |
-xHost | Tells the compiler to generate instructions for the highest instruction set available on the compilation host processor. DO NOT USE THIS OPTION |
-march=processor | Tells the compiler to generate code for processors that support certain features. processor can be SANDYBRIDGE, IVYBRIDGE, HASWELL, BROADWELL or SKYLAKE-AVX512. |
-mtune=processor | Performs optimizations for specific processors but does not cause extended instruction sets to be used. processor can be SANDYBRIDGE, IVYBRIDGE, HASWELL, BROADWELL or SKYLAKE-AVX512. |
-fast | Maximizes speed across the entire program. Also sets -ipo, -O3, -no-prec-div, -static, -fp-model fast=2, and -xHost. Not recommended |
-funroll-all-loops | Unroll all loops even if the number of iterations is uncertain when the loop is entered. |
-mkl | Tells the compiler to link to certain libraries in the Intel Math Kernel Library (Intel MKL) |
-shared-intel/libgcc | Links to Intel/GNU Libgcc libraries statically, use -shared to link all libraries dynamically |
-static-intel/libgcc | Links to Intel/GNU libgcc libraries statically, use -static to link all libraries dynamically |
For more details and additional compiler options, see
- https://software.intel.com/en-us/articles/step-by-step-optimizing-with-intel-c-compiler
- https://software.intel.com/en-us/fortran-compiler-developer-guide-and-reference
- https://software.intel.com/en-us/cpp-compiler-developer-guide-and-reference
- https://software.intel.com/en-us/articles/intel-mkl-link-line-advisor
GNU Compiler
The following are common optimization flags to use when compiling with the GNU Compiler
Flag | Description |
---|---|
-march=processor | Generate instructions for the machine type processor . processor can be sandybridge, ivybridge, haswell, broadwell or skylake-avx512. |
-mtune=processor | Tune to processor everything applicable about the generated code, except for the ABI and the set of available instructions. processor can be sandybridge, ivybridge, haswell, broadwell or skylake-avx512. Some older programs/makefile might use -mcpu that is deprecated |
-Ofast | Disregard strict standards compliance. -Ofast enables all -O3 optimizations. It also enables optimizations that are not valid for all standard-compliant programs. It turns on -ffast-math and the Fortran-specific -fstack-arrays, unless -fmax-stack-var-size is specified, and -fno-protect-parens. |
-funroll-all-loops | Unroll all loops even if the number of iterations is uncertain when the loop is entered. |
-shared | Links to libraries dynamically |
-static | Links to libraries statically |
See https://gcc.gnu.org/onlinedocs/gcc-8.1.0/gcc/Invoking-GCC.html#Invoking-GCC for more detailed description and other options
NVIDIA HPC SDK Compiler
The following are common optimization flags to use when compiling with the PGI Compiler
Flag | Description |
---|---|
-acc | Enable OpenACC directives. |
=tesla(:tesla_suboptions) host | Specify the target accelerator. |
-tp <processor> | Specify the type(s) of the target processor(s). processor can be sandybridge-64, haswell-64, or skylake-64. |
-mtune=processor | Tune to processor everything applicable about the generated code, except for the ABI and the set of available instructions. processor can be sandybridge, ivybridge, haswell, broadwell or skylake-avx512. Some older programs/makefile might use -mcpu that is deprecated |
-fast | Generally optimal set of flags. |
-fastsse | Generally optimal set of flags for targets that include SSE/SSE2 capability. |
-Mipa | Invokes interprocedural analysis and optimization. |
-Munroll | Controls loop unrolling. |
-Minfo | Prints informational messages regarding optimization and code generation to standard output as compilation proceeds. |
-shared | Passed to the linker. Instructs the linker to generate a shared object file. Implies -fpic. |
-Bstatic | Statically link all libraries, including the PGI runtime. |
See https://www.pgroup.com/resources/docs/19.5/x86/pgi-ref-guide/index.htm for more detailed description and other options
Multi-architecture CPU optimization
CPU architectures change from generation to generation, affecting the data/instruction processing and adding/modifying CPU instructions. A common trend recently has been improving vectorization capabilities of the CPUs. Currently, Research Computing supports three generations of Intel CPUs - Sandybridge, Haswell/Broadwell and Skylake CPUs -- each of which shows incremental improvement in vectorization processing power. Starting with AVX, Intel CPUs feature increasingly complex logic of clock speed adjustments depending on how many CPU cores and vector units are bring used. Each core can have its frequency adjusted independently allowing for multiple users to run different workloads. In general, if less cores and vectorization is utilized, the CPU can run faster than when all cores and vector units are used. It therefore is important to optimize the code for the architecture of the CPU being used.
Intel and PGI compilers support building multiple optimized codes for various architectures into a single executable. GNU compilers do not support this option. Any application (except GROMACS) built with Intel and PGI compilers are optimized for Skylake, Haswell and Skylake CPU. Applications built with GNU compilers are optimized to run on Haswell (base CPU architecture of Sol). GROMACS compile option do not permit building multiple architecture executables. By default, GROMACS is built for Haswell/Broadwell CPUs with Sklylake optimized builds available (see modules with -avx512 suffix).
Intel Compilers
Intel builds executable optimized for a particular architecture by using the -ax flag, also known as automatic cpu dispatch. To build executables that vectorizes optimally to run on Sol (Haswell/Broadwell and Skylake), and supported faculty clusters (SandyBridge/IvyBridge), you need to add the -axCORE-AVX512,CORE-AVX2,AVX as a compiler option.
[2019-07-16 15:11.16] ~/Workshop/sum2017/saxpy/solution [alp514.sol](1029): ifort -axCOMMON-AVX512,CORE-AVX512,CORE-AVX2,CORE-AVX-I,AVX -o saxpy saxpy.f90 saxpy.f90(1): (col. 9) remark: MAIN__ has been targeted for automatic cpu dispatch
NVIDIA HPC SDK Compilers
NVIDIA compilers builds executable optimized for a particular architecture by bundling different architecture name to the -tp flag, also known as unified binary. To build executables that vectorizes optimally to run on Sol (Haswell/Broadwell and Skylake), and supported faculty clusters (SandyBridge/IvyBridge), you need to add the -tp=sandybridge-64,haswell-64,skylake-64 as a compiler option. To check if the code is being vectorized, add the -Minfo=vect flag
[2021-07-22 15:47.47] ~/Workshop/2021HPC/parprog/solution/saxpy [alp514.sol](1134): nvfortran -fastsse -tp=haswell-64 -Minfo -o saxpy saxpy.f90 saxpy: 11, Memory set idiom, loop replaced by call to __c_mset4 12, Memory set idiom, loop replaced by call to __c_mset4 16, Generated vector simd code for the loop
GNU Compilers
GNU compilers do not permit building a single optimized executable for multiple architectures. You need to build a separate executable for each CPU architecture using the -march=cpuarch flag where cpuarch can be sandybridge, ivybridge, haswell, broadwell or skylake.
OpenMP
OpenMP is an Application Program Interface (API) for thread based parallelism. It supports Fortran, C and C++ and uses a fork-join execution model. OpenMP structures are built with program directives, runtime libraries and environment variables. OpenMP is implemented in all major compiler suites and no separate module needs to be loaded. OpenMP permits incremental parallelization of serial by adding compiler directive that appear as comments and are only activated when the appropriate flags are added to compile command.
Compiling OpenMP Code
Different compilers have different OpenMP compile flags.
Compiler | OpenMP Flag |
---|---|
GNU | -fopenmp |
Intel | -qopenmp |
NVIDIA | -mp |
To compile OpenMP code, add the correct flag as listed above to the compile command.
[2018-02-22 08:47.56] ~/Workshop/2017XSEDEBootCamp/OpenMP/Solutions [alp514.sol-d118](845): icc -qopenmp -o laplacec laplace_omp.c
Running OpenMP Code
To run OpenMP code, you need to specify the number of OpenMP threads to run the code. By default, the number of OpenMP threads is equal to the total number of cores available on a node. Depending on your resources request, this may be greater than the number of cores requested. If the Intel module is loaded, then the number of OpenMP threads is set to 1. You can set the number of threads using the environmental variable, OMP_NUM_THREADS
BASH: export OMP_NUM_THREADS=4 CSH/TCSH: setenv OMP_NUM_THREADS 4
Alternatively, you can set the OMP_NUM_THREADS on the command line while running the executable.
[2018-02-22 08:48.09] ~/Workshop/2017XSEDEBootCamp/OpenMP/Solutions [alp514.sol-d118](846): OMP_NUM_THREADS=4 ./laplacec Maximum iterations [100-4000]? 1000 ---------- Iteration number: 100 ------------ [995,995]: 63.33 [996,996]: 72.67 [997,997]: 81.40 [998,998]: 88.97 [999,999]: 94.86 [1000,1000]: 98.67 ---------- Iteration number: 200 ------------ [995,995]: 79.11 [996,996]: 84.86 [997,997]: 89.91 [998,998]: 94.10 [999,999]: 97.26 [1000,1000]: 99.28 ---------- Iteration number: 300 ------------ [995,995]: 85.25 [996,996]: 89.39 [997,997]: 92.96 [998,998]: 95.88 [999,999]: 98.07 [1000,1000]: 99.49 ---------- Iteration number: 400 ------------ [995,995]: 88.50 [996,996]: 91.75 [997,997]: 94.52 [998,998]: 96.78 [999,999]: 98.48 [1000,1000]: 99.59 ---------- Iteration number: 500 ------------ [995,995]: 90.52 [996,996]: 93.19 [997,997]: 95.47 [998,998]: 97.33 [999,999]: 98.73 [1000,1000]: 99.66 ---------- Iteration number: 600 ------------ [995,995]: 91.88 [996,996]: 94.17 [997,997]: 96.11 [998,998]: 97.69 [999,999]: 98.89 [1000,1000]: 99.70 ---------- Iteration number: 700 ------------ [995,995]: 92.87 [996,996]: 94.87 [997,997]: 96.57 [998,998]: 97.95 [999,999]: 99.01 [1000,1000]: 99.73 ---------- Iteration number: 800 ------------ [995,995]: 93.62 [996,996]: 95.40 [997,997]: 96.91 [998,998]: 98.15 [999,999]: 99.10 [1000,1000]: 99.75 ---------- Iteration number: 900 ------------ [995,995]: 94.21 [996,996]: 95.81 [997,997]: 97.18 [998,998]: 98.30 [999,999]: 99.17 [1000,1000]: 99.77 ---------- Iteration number: 1000 ------------ [995,995]: 94.68 [996,996]: 96.15 [997,997]: 97.40 [998,998]: 98.42 [999,999]: 99.22 [1000,1000]: 99.78 Max error at iteration 1000 was 0.034767 Total time was 2.459961 seconds.
MPI
Message Passing Interface or MPI is a distributed memory parallel programming paradigm. MPI is a library that needs to be built for each available compiler. On Sol, the default MPI library is MVAPICH2 that is used to build applications. Users who develop their code can choose to use other MPI libraries - OpenMPI (not be be confused with OpenMP above) and MPICH.
Compiling MPI codes
To use MPI either for compiling or running MPI enabled applications, you need to load the appropriate module. All MPI modules have the nomenclature - library/version and are only available when the corresponding compiler/version is loaded. If a compiler module is not loaded, then the system compiler gcc 8.3.1 built MPI is available
Library | Version | Module |
---|---|---|
MVAPICH2 | 2.3.4 | mvapich2/2.3.4 |
MPICH | 3.3.2 | mpich/3.3.2 |
OpenMPI | 4.0.5 | openmpi/4.0.5 |
The compiled or built MPI libraries provide wrappers around the underlying compilers for linking to the various MPI libraries. The command used to compile codes depends on the source code language irrespective of the library being used
Language | Compile Command |
---|---|
Fortran | mpif90 |
C | mpicc |
C++ | mpicxx |
[2021-07-22 15:45.52] ~ [alp514.sol](1122): mpif90 -show /share/Apps/intel/2020/compilers_and_libraries_2020.3.275/linux/bin/intel64/ifort -lmpifort -lmpi -I/share/Apps/lusoft/opt/spack/linux-centos8-haswell/intel-20.0.3/mvapich2/2.3.4-wguydha/include -I/share/Apps/lusoft/opt/spack/linux-centos8-haswell/intel-20.0.3/mvapich2/2.3.4-wguydha/include -L/share/Apps/lusoft/opt/spack/linux-centos8-haswell/intel-20.0.3/mvapich2/2.3.4-wguydha/lib -Wl,-rpath -Wl,/share/Apps/lusoft/opt/spack/linux-centos8-haswell/intel-20.0.3/mvapich2/2.3.4-wguydha/lib
[2017-10-30 08:40.30] ~/Workshop/2017XSEDEBootCamp/MPI/Solutions [alp514.sol](1096): mpif90 -o laplace_f90 laplace_mpi.f90 [2017-10-30 08:40.45] ~/Workshop/2017XSEDEBootCamp/MPI/Solutions [alp514.sol](1097): mpicc -o laplace_c laplace_mpi.c
Running MPI Programs
- Every MPI implementation come with their own job launcher:
mpiexec
(MPICH,OpenMPI & MVAPICH2),mpirun
(OpenMPI) ormpirun_rsh
(MVAPICH2) - Example:
mpiexec [options] <program name> [program options]
- Required options: number of processes and list of hosts on which to run program
Option Description | mpiexec | mpirun | mpirun_rsh |
---|---|---|---|
run on x cores | -n x | -np x | -n x |
location of the hostfile | -f filename | -machinefile filename | -hostfile filename |
- To run a MPI code, you need to use the launcher from the same implementation that was used to compile the code.
- For e.g.: You cannot compile code with OpenMPI and run using the MPICH and MVAPICH2's launcher
- Since MVAPICH2 is based on MPICH, you can launch MVAPICH2 compiled code using MPICH's launcher.
- SLURM scheduler provides
srun
as a wrapper around all mpi launchers
[2018-02-22 08:48.27] ~/Workshop/2017XSEDEBootCamp/MPI/Solutions [alp514.sol-d118](848): mpicc -o laplacec laplace_mpi.c [2018-02-22 08:48.41] ~/Workshop/2017XSEDEBootCamp/MPI/Solutions [alp514.sol-d118](849): mpiexec -n 4 ./laplacec Maximum iterations [100-4000]? 1000 ---------- Iteration number: 100 ------------ [995,995]: 63.33 [996,996]: 72.67 [997,997]: 81.40 [998,998]: 88.97 [999,999]: 94.86 [1000,1000]: 98.67 ---------- Iteration number: 200 ------------ [995,995]: 79.11 [996,996]: 84.86 [997,997]: 89.91 [998,998]: 94.10 [999,999]: 97.26 [1000,1000]: 99.28 ---------- Iteration number: 300 ------------ [995,995]: 85.25 [996,996]: 89.39 [997,997]: 92.96 [998,998]: 95.88 [999,999]: 98.07 [1000,1000]: 99.49 ---------- Iteration number: 400 ------------ [995,995]: 88.50 [996,996]: 91.75 [997,997]: 94.52 [998,998]: 96.78 [999,999]: 98.48 [1000,1000]: 99.59 ---------- Iteration number: 500 ------------ [995,995]: 90.52 [996,996]: 93.19 [997,997]: 95.47 [998,998]: 97.33 [999,999]: 98.73 [1000,1000]: 99.66 ---------- Iteration number: 600 ------------ [995,995]: 91.88 [996,996]: 94.17 [997,997]: 96.11 [998,998]: 97.69 [999,999]: 98.89 [1000,1000]: 99.70 ---------- Iteration number: 700 ------------ [995,995]: 92.87 [996,996]: 94.87 [997,997]: 96.57 [998,998]: 97.95 [999,999]: 99.01 [1000,1000]: 99.73 ---------- Iteration number: 800 ------------ [995,995]: 93.62 [996,996]: 95.40 [997,997]: 96.91 [998,998]: 98.15 [999,999]: 99.10 [1000,1000]: 99.75 ---------- Iteration number: 900 ------------ [995,995]: 94.21 [996,996]: 95.81 [997,997]: 97.18 [998,998]: 98.30 [999,999]: 99.17 [1000,1000]: 99.77 ---------- Iteration number: 1000 ------------ [995,995]: 94.68 [996,996]: 96.15 [997,997]: 97.40 [998,998]: 98.42 [999,999]: 99.22 [1000,1000]: 99.78 Max error at iteration 1000 was 0.034767 Total time was 1.030180 seconds.