Compilers

Research Computing offers several tools to compile and debug software. All compiler suites contain compilers

  • GNU Compiler (version 8.3.1 and 9.3.0)
  • Intel Compiler (version 19.0.3, and 20.0.3)
  • Intel OneAPI Base and HPC Toolkit (version 2021.3.0)
  • NVIDIA HPC SDK (version 20.9)

for the following languages:

All compilers except GNU Compiler 8.3.1 (system default) are available via the module command

[2022-04-28 09:27.51] ~
[alp514.sol](1046): module av gcc intel nvhpc oneapi

------------------------------------------------------------------------------ /share/Apps/lusoft/share/spack/lmod/avx2/linux-centos8-x86_64/Core -------------------------------------------------------------------------------
   gcc/9.3.0               intel-mkl/2021.3.0 (D)    intel-tbb/2021.3.0 (D)    intel/20.0.3   (D)    nvhpc/20.9                 oneapi-inspector/2021.3.0    oneapi-mpi/2021.3.0      oneapi/2021.3.0
   intel-mkl/2020.3.279    intel-tbb/2020.3          intel/19.0.3              intel/2021.3.0        oneapi-advisor/2021.3.0    oneapi-itac/2021.3.0         oneapi-vtune/2021.5.0

  Where:
   D:  Default Module

Use "module spider" to find all possible modules and extensions.
Use "module keyword key1 key2 ..." to search for all possible modules matching any of the "keys".

To compile code, first load the appropriate module and then use the correct command to compile a Fortran/C/C++ code

Language

GNU

Intel

OneAPINVIDIA HPC SDK

Fortran

gfortran

ifort

ifcnvfortran

C

gcc

icc

icxnvc

C++

g++

icpc

icpxnvc++

Note

  1. Intel OneAPI compilers provide the legacy intel compilers for Fortran, C and C++. The description below for Intel Compilers are applicable for the legacy tools provided by OneAPI.
  2. The intel/2021.3.0 and oneapi/2021.3.0 modules are the same. However, intel/2021.3.0 is used by spack to build applications and needs to be loaded for module files to be displayed.
  3. nvc is the C compiler while nvcc is the CUDA compiler included in NVIDIA HPC SDK

Compiling and Running Serial Code

To compile a code, you need to provide the source code as an option to the compiler command. 

Usage
<compiler> <source code>

This will create an executable a.out. All compilers take options or compiler flags to specify libraries and provide an optimized executable.

Compiler Flags

Common to all compilers

Compiler Flags vary depending on the compiler. The following four compiler flags are common across the three compilers available on Sol

  • -o myexec: compile code and create an executable myexec. If this option is not given, then a default a.out is created.
  • -On: optmize code to level n.
    • n = 0: no optimization
    • n = 1: optmization for code size and execution time. Default when n is omitted
    • n = 2: more extensive optmization
    • n = 3: agressive optimization compared to n=2 and will increase compile time. Recommended for codes with loops with intensive floating point calculations  
  • -g: tells the compiler to generate a level of debugging information in the object file. Superseded by -O option only use while debugging code not for production code.
  • -mcmodel=mem_model: tells the compiler to use a specific memory model to generate code and store data.
    • mem_model=small resticts code and data to first 2GB of address space,
    • mem_model=medium restricts code to first 2GB address space, no restriction on data recommended
    • mem_model=large no restriction on code and data
  • -m64: Generate code for 64 bit architectures
  • -fpic/-fPIC: generate position independent code (PIC) suitable for use in a shared library. 
  • -l{libname}: link compiled code to a library called libname. e.g. to use lapack libraries, add -llapack as a compiler flag.
  • -L{directory path}: directory to search for libraries. e.g. -L/usr/lib64 -llapack will search for lapack libraries in /usr/lib64.
  • -I{directory path}: directory to search for include files and fortran modules.
Compiling Serial code with Intel and GNU compilers
ifort -o saxpyf saxpy.f90
gcc -o saxpyc saxpy.c

The above options create a serial executable that can run on only one processor or cpu. Code can also be compiled for running in parallel, either shared memory parallel i.e. same node but multiple processors using OpenMP or distributed parallel i.e. multiple cpus on same or multiple nodes using Message Passing Interface (MPI) libraries. 

Running Serial Code
[2018-02-22 08:47.27] ~/Workshop/2017XSEDEBootCamp/OpenMP
[alp514.sol-d118](842): icc -o laplacec laplace_serial.c
[2018-02-22 08:47.46] ~/Workshop/2017XSEDEBootCamp/OpenMP
[alp514.sol-d118](843): ./laplacec
Maximum iterations [100-4000]?
1000
---------- Iteration number: 100 ------------
[995,995]: 63.33  [996,996]: 72.67  [997,997]: 81.40  [998,998]: 88.97  [999,999]: 94.86  [1000,1000]: 98.67
---------- Iteration number: 200 ------------
[995,995]: 79.11  [996,996]: 84.86  [997,997]: 89.91  [998,998]: 94.10  [999,999]: 97.26  [1000,1000]: 99.28
---------- Iteration number: 300 ------------
[995,995]: 85.25  [996,996]: 89.39  [997,997]: 92.96  [998,998]: 95.88  [999,999]: 98.07  [1000,1000]: 99.49
---------- Iteration number: 400 ------------
[995,995]: 88.50  [996,996]: 91.75  [997,997]: 94.52  [998,998]: 96.78  [999,999]: 98.48  [1000,1000]: 99.59
---------- Iteration number: 500 ------------
[995,995]: 90.52  [996,996]: 93.19  [997,997]: 95.47  [998,998]: 97.33  [999,999]: 98.73  [1000,1000]: 99.66
---------- Iteration number: 600 ------------
[995,995]: 91.88  [996,996]: 94.17  [997,997]: 96.11  [998,998]: 97.69  [999,999]: 98.89  [1000,1000]: 99.70
---------- Iteration number: 700 ------------
[995,995]: 92.87  [996,996]: 94.87  [997,997]: 96.57  [998,998]: 97.95  [999,999]: 99.01  [1000,1000]: 99.73
---------- Iteration number: 800 ------------
[995,995]: 93.62  [996,996]: 95.40  [997,997]: 96.91  [998,998]: 98.15  [999,999]: 99.10  [1000,1000]: 99.75
---------- Iteration number: 900 ------------
[995,995]: 94.21  [996,996]: 95.81  [997,997]: 97.18  [998,998]: 98.30  [999,999]: 99.17  [1000,1000]: 99.77
---------- Iteration number: 1000 ------------
[995,995]: 94.68  [996,996]: 96.15  [997,997]: 97.40  [998,998]: 98.42  [999,999]: 99.22  [1000,1000]: 99.78

Max error at iteration 1000 was 0.034767
Total time was 4.099030 seconds.


Intel Compiler

The following are common optimization flags to use when compiling with the Intel Compiler

Flag

Description

-ipo

Enables interprocedural optimization between files.

-ax_code_/-x_code_

Tells the compiler to generate multiple, feature-specific auto-dispatch code paths for Intel processors if there is a performance benefit. code can be COMMON-AVX512, CORE-AVX512, CORE-AVX2, CORE-AVX-I or AVX

-xHost

Tells the compiler to generate instructions for the highest instruction set available on the compilation host processor. DO NOT USE THIS OPTION 

-march=processor

Tells the compiler to generate code for processors that support certain features. processor can be SANDYBRIDGE, IVYBRIDGE, HASWELL, BROADWELL or SKYLAKE-AVX512.

-mtune=processor

Performs optimizations for specific processors but does not cause extended instruction sets to be used. processor can be SANDYBRIDGE, IVYBRIDGE, HASWELL, BROADWELL or SKYLAKE-AVX512.

-fast

Maximizes speed across the entire program. Also sets -ipo, -O3, -no-prec-div, -static, -fp-model fast=2, and -xHost. Not recommended

-funroll-all-loops

Unroll all loops even if the number of iterations is uncertain when the loop is entered.

-mkl

Tells the compiler to link to certain libraries in the Intel Math Kernel Library (Intel MKL)

-shared-intel/libgcc

Links to Intel/GNU Libgcc libraries statically, use -shared to link all libraries dynamically

-static-intel/libgcc

Links to Intel/GNU libgcc libraries statically, use -static to link all libraries dynamically

For more details and additional compiler options, see 

GNU Compiler

The following are common optimization flags to use when compiling with the GNU Compiler

Flag

Description

-march=processor

Generate instructions for the machine type processor . processor can be sandybridge, ivybridge, haswell, broadwell or skylake-avx512.

-mtune=processor

Tune to processor everything applicable about the generated code, except for the ABI and the set of available instructions. processor can be sandybridge, ivybridge, haswell, broadwell or skylake-avx512. Some older programs/makefile might use -mcpu that is deprecated

-Ofast

Disregard strict standards compliance. -Ofast enables all -O3 optimizations. It also enables optimizations that are not valid for all standard-compliant programs. It turns on -ffast-math and the Fortran-specific -fstack-arrays, unless -fmax-stack-var-size is specified, and -fno-protect-parens.

-funroll-all-loops

Unroll all loops even if the number of iterations is uncertain when the loop is entered.

-shared

Links to libraries dynamically

-static

Links to libraries statically

See https://gcc.gnu.org/onlinedocs/gcc-8.1.0/gcc/Invoking-GCC.html#Invoking-GCC for more detailed description and other options

NVIDIA HPC SDK Compiler

The following are common optimization flags to use when compiling with the PGI Compiler

Flag

Description

-accEnable OpenACC directives.
=tesla(:tesla_suboptions) hostSpecify the target accelerator.

-tp <processor>

Specify the type(s) of the target processor(s). processor can be sandybridge-64, haswell-64, or skylake-64.

-mtune=processor

Tune to processor everything applicable about the generated code, except for the ABI and the set of available instructions. processor can be sandybridge, ivybridge, haswell, broadwell or skylake-avx512. Some older programs/makefile might use -mcpu that is deprecated

-fast

Generally optimal set of flags.

-fastsseGenerally optimal set of flags for targets that include SSE/SSE2 capability.
-MipaInvokes interprocedural analysis and optimization.

-Munroll

Controls loop unrolling.

-MinfoPrints informational messages regarding optimization and code generation to standard output as compilation proceeds.

-shared

Passed to the linker. Instructs the linker to generate a shared object file. Implies -⁠fpic.

-Bstatic

Statically link all libraries, including the PGI runtime.

See https://www.pgroup.com/resources/docs/19.5/x86/pgi-ref-guide/index.htm for more detailed description and other options

Multi-architecture CPU optimization

CPU architectures change from generation to generation, affecting the data/instruction processing and adding/modifying CPU instructions. A common trend recently has been improving vectorization capabilities of the CPUs. Currently, Research Computing supports three generations of Intel CPUs - Sandybridge, Haswell/Broadwell and Skylake CPUs -- each of which shows incremental improvement in vectorization processing power. Starting with AVX, Intel CPUs feature increasingly complex logic of clock speed adjustments depending on how many CPU cores and vector units are bring used. Each core can have its frequency adjusted independently allowing for multiple users to run different workloads. In general, if less cores and vectorization is utilized, the CPU can run faster than when all cores and vector units are used. It therefore is important to optimize the code for the architecture of the CPU being used. 


Intel and PGI compilers support building multiple optimized codes for various architectures into a single executable. GNU compilers do not support this option. Any application (except GROMACS) built with Intel and PGI compilers are optimized for Skylake, Haswell and Skylake CPU. Applications built with GNU compilers are optimized to run on Haswell (base CPU architecture of Sol). GROMACS compile option do not permit building multiple architecture executables. By default, GROMACS is built for Haswell/Broadwell CPUs with Sklylake optimized builds available (see modules with -avx512 suffix).


Intel Compilers

Intel builds executable optimized for a particular architecture by using the -ax flag, also known as automatic cpu dispatch. To build executables that vectorizes optimally to run on Sol (Haswell/Broadwell and Skylake), and supported faculty clusters (SandyBridge/IvyBridge), you need to add the -axCORE-AVX512,CORE-AVX2,AVX as a compiler option.

[2019-07-16 15:11.16] ~/Workshop/sum2017/saxpy/solution
[alp514.sol](1029): ifort -axCOMMON-AVX512,CORE-AVX512,CORE-AVX2,CORE-AVX-I,AVX -o saxpy saxpy.f90
saxpy.f90(1): (col. 9) remark: MAIN__ has been targeted for automatic cpu dispatch


NVIDIA HPC SDK Compilers

NVIDIA compilers builds executable optimized for a particular architecture by bundling different architecture name to the -tp flag, also known as unified binary. To build executables that vectorizes optimally to run on Sol (Haswell/Broadwell and Skylake), and supported faculty clusters (SandyBridge/IvyBridge), you need to add the -tp=sandybridge-64,haswell-64,skylake-64 as a compiler option. To check if the code is being vectorized, add the -Minfo=vect flag

[2021-07-22 15:47.47] ~/Workshop/2021HPC/parprog/solution/saxpy
[alp514.sol](1134): nvfortran -fastsse -tp=haswell-64 -Minfo -o saxpy saxpy.f90
saxpy:
     11, Memory set idiom, loop replaced by call to __c_mset4
     12, Memory set idiom, loop replaced by call to __c_mset4
     16, Generated vector simd code for the loop


GNU Compilers

GNU compilers do not permit building a single optimized executable for multiple architectures. You need to build a separate executable for each CPU architecture using the -march=cpuarch flag where cpuarch can be sandybridge, ivybridge, haswell, broadwell or skylake. 

OpenMP

OpenMP is an Application Program Interface (API) for thread based parallelism. It supports Fortran, C and C++ and uses a fork-join execution model. OpenMP structures are built with program directives, runtime libraries and environment variables. OpenMP is implemented in all major compiler suites and no separate module needs to be loaded. OpenMP permits incremental parallelization of serial by adding compiler directive that appear as comments and are only activated when the appropriate flags are added to compile command. 

Compiling OpenMP Code

Different compilers have different OpenMP compile flags.

Compiler

OpenMP Flag

GNU

-fopenmp

Intel

-qopenmp

NVIDIA

-mp

To compile OpenMP code, add the correct flag as listed above to the compile command.

Compiling OpenMP code with Intel Compiler
[2018-02-22 08:47.56] ~/Workshop/2017XSEDEBootCamp/OpenMP/Solutions
[alp514.sol-d118](845): icc -qopenmp -o laplacec laplace_omp.c

Running OpenMP Code

To run OpenMP code, you need to specify the number of OpenMP threads to run the code. By default, the number of OpenMP threads is equal to the total number of cores available on a node. Depending on your resources request, this may be greater than the number of cores requested. If the Intel module is loaded, then the number of OpenMP threads is set to 1. You can set the number of threads using the environmental variable, OMP_NUM_THREADS

BASH: export OMP_NUM_THREADS=4
CSH/TCSH: setenv OMP_NUM_THREADS 4

Alternatively, you can set the OMP_NUM_THREADS on the command line while running the executable.

Running OpenMP code on 4 threads
[2018-02-22 08:48.09] ~/Workshop/2017XSEDEBootCamp/OpenMP/Solutions
[alp514.sol-d118](846): OMP_NUM_THREADS=4 ./laplacec
Maximum iterations [100-4000]?
1000
---------- Iteration number: 100 ------------
[995,995]: 63.33  [996,996]: 72.67  [997,997]: 81.40  [998,998]: 88.97  [999,999]: 94.86  [1000,1000]: 98.67
---------- Iteration number: 200 ------------
[995,995]: 79.11  [996,996]: 84.86  [997,997]: 89.91  [998,998]: 94.10  [999,999]: 97.26  [1000,1000]: 99.28
---------- Iteration number: 300 ------------
[995,995]: 85.25  [996,996]: 89.39  [997,997]: 92.96  [998,998]: 95.88  [999,999]: 98.07  [1000,1000]: 99.49
---------- Iteration number: 400 ------------
[995,995]: 88.50  [996,996]: 91.75  [997,997]: 94.52  [998,998]: 96.78  [999,999]: 98.48  [1000,1000]: 99.59
---------- Iteration number: 500 ------------
[995,995]: 90.52  [996,996]: 93.19  [997,997]: 95.47  [998,998]: 97.33  [999,999]: 98.73  [1000,1000]: 99.66
---------- Iteration number: 600 ------------
[995,995]: 91.88  [996,996]: 94.17  [997,997]: 96.11  [998,998]: 97.69  [999,999]: 98.89  [1000,1000]: 99.70
---------- Iteration number: 700 ------------
[995,995]: 92.87  [996,996]: 94.87  [997,997]: 96.57  [998,998]: 97.95  [999,999]: 99.01  [1000,1000]: 99.73
---------- Iteration number: 800 ------------
[995,995]: 93.62  [996,996]: 95.40  [997,997]: 96.91  [998,998]: 98.15  [999,999]: 99.10  [1000,1000]: 99.75
---------- Iteration number: 900 ------------
[995,995]: 94.21  [996,996]: 95.81  [997,997]: 97.18  [998,998]: 98.30  [999,999]: 99.17  [1000,1000]: 99.77
---------- Iteration number: 1000 ------------
[995,995]: 94.68  [996,996]: 96.15  [997,997]: 97.40  [998,998]: 98.42  [999,999]: 99.22  [1000,1000]: 99.78

Max error at iteration 1000 was 0.034767
Total time was 2.459961 seconds.

MPI

Message Passing Interface or MPI is a distributed memory parallel programming paradigm. MPI is a library that needs to be built for each available compiler. On Sol, the default MPI library is MVAPICH2 that is used to build applications. Users who develop their code can choose to use other MPI libraries - OpenMPI (not be be confused with OpenMP above) and MPICH.

Compiling MPI codes

To use MPI either for compiling or running MPI enabled applications, you need to load the appropriate module. All MPI modules have the nomenclature - library/version and are only available when the corresponding compiler/version is loaded. If a compiler module is not loaded, then the system compiler gcc 8.3.1 built MPI is available

Library

Version

Module

MVAPICH2

2.3.4

mvapich2/2.3.4

MPICH

3.3.2

mpich/3.3.2

OpenMPI

4.0.5

openmpi/4.0.5

The compiled or built MPI libraries provide wrappers around the underlying compilers for linking to the various MPI libraries. The command used to compile codes depends on the source code language irrespective of the library being used

Language

Compile Command

Fortran

mpif90

C

mpicc

C++

mpicxx

[2021-07-22 15:45.52] ~
[alp514.sol](1122): mpif90 -show
/share/Apps/intel/2020/compilers_and_libraries_2020.3.275/linux/bin/intel64/ifort -lmpifort -lmpi -I/share/Apps/lusoft/opt/spack/linux-centos8-haswell/intel-20.0.3/mvapich2/2.3.4-wguydha/include -I/share/Apps/lusoft/opt/spack/linux-centos8-haswell/intel-20.0.3/mvapich2/2.3.4-wguydha/include -L/share/Apps/lusoft/opt/spack/linux-centos8-haswell/intel-20.0.3/mvapich2/2.3.4-wguydha/lib -Wl,-rpath -Wl,/share/Apps/lusoft/opt/spack/linux-centos8-haswell/intel-20.0.3/mvapich2/2.3.4-wguydha/lib
Compiling Fortran and C code with MPI
[2017-10-30 08:40.30] ~/Workshop/2017XSEDEBootCamp/MPI/Solutions
[alp514.sol](1096): mpif90 -o laplace_f90 laplace_mpi.f90 
[2017-10-30 08:40.45] ~/Workshop/2017XSEDEBootCamp/MPI/Solutions
[alp514.sol](1097): mpicc -o laplace_c laplace_mpi.c

Running MPI Programs

  • Every MPI implementation come with their own job launcher: mpiexec (MPICH,OpenMPI & MVAPICH2), mpirun (OpenMPI) or mpirun_rsh (MVAPICH2)
  • Example: mpiexec [options] <program name> [program options]
  • Required options: number of processes and list of hosts on which to run program


Option Description

mpiexec

mpirun

mpirun_rsh

run on x cores

-n x

-np x

-n x

location of the hostfile

-f filename

-machinefile filename

-hostfile filename

  • To run a MPI code, you need to use the launcher from the same implementation that was used to compile the code.
  • For e.g.: You cannot compile code with OpenMPI and run using the MPICH and MVAPICH2's launcher
    • Since MVAPICH2 is based on MPICH, you can launch MVAPICH2 compiled code using MPICH's launcher.
  • SLURM scheduler provides srun as a wrapper around all mpi launchers
[2018-02-22 08:48.27] ~/Workshop/2017XSEDEBootCamp/MPI/Solutions
[alp514.sol-d118](848): mpicc -o laplacec laplace_mpi.c
[2018-02-22 08:48.41] ~/Workshop/2017XSEDEBootCamp/MPI/Solutions
[alp514.sol-d118](849): mpiexec -n 4 ./laplacec
Maximum iterations [100-4000]?
1000
---------- Iteration number: 100 ------------
[995,995]: 63.33  [996,996]: 72.67  [997,997]: 81.40  [998,998]: 88.97  [999,999]: 94.86  [1000,1000]: 98.67
---------- Iteration number: 200 ------------
[995,995]: 79.11  [996,996]: 84.86  [997,997]: 89.91  [998,998]: 94.10  [999,999]: 97.26  [1000,1000]: 99.28
---------- Iteration number: 300 ------------
[995,995]: 85.25  [996,996]: 89.39  [997,997]: 92.96  [998,998]: 95.88  [999,999]: 98.07  [1000,1000]: 99.49
---------- Iteration number: 400 ------------
[995,995]: 88.50  [996,996]: 91.75  [997,997]: 94.52  [998,998]: 96.78  [999,999]: 98.48  [1000,1000]: 99.59
---------- Iteration number: 500 ------------
[995,995]: 90.52  [996,996]: 93.19  [997,997]: 95.47  [998,998]: 97.33  [999,999]: 98.73  [1000,1000]: 99.66
---------- Iteration number: 600 ------------
[995,995]: 91.88  [996,996]: 94.17  [997,997]: 96.11  [998,998]: 97.69  [999,999]: 98.89  [1000,1000]: 99.70
---------- Iteration number: 700 ------------
[995,995]: 92.87  [996,996]: 94.87  [997,997]: 96.57  [998,998]: 97.95  [999,999]: 99.01  [1000,1000]: 99.73
---------- Iteration number: 800 ------------
[995,995]: 93.62  [996,996]: 95.40  [997,997]: 96.91  [998,998]: 98.15  [999,999]: 99.10  [1000,1000]: 99.75
---------- Iteration number: 900 ------------
[995,995]: 94.21  [996,996]: 95.81  [997,997]: 97.18  [998,998]: 98.30  [999,999]: 99.17  [1000,1000]: 99.77
---------- Iteration number: 1000 ------------
[995,995]: 94.68  [996,996]: 96.15  [997,997]: 97.40  [998,998]: 98.42  [999,999]: 99.22  [1000,1000]: 99.78

Max error at iteration 1000 was 0.034767
Total time was 1.030180 seconds.