Page Comparison

Table of Contents

...

All compilers except GNU Compiler 8.3.1 (system default) are available via the module command

Code Block

language	bash

[20212022-0704-2228 1509:4027.3651] ~
[alp514.sol](11151046): module av gcc intel nvhpc oneapi

------------------------------------------------------------------------------ /share/Apps/lusoft/share/spack/lmod/avx2/linux-centos8-x86_64/Core -------------------------------------------------------------------------------
   gcc/9.3.0     intel-mkl/2020.3.279      intel/19.0.3    intel-mkl/202021.3.0.3 (D)    nvhpc/20.9intel-tbb/2021.3.0 (D)    intel/20.0.3   (D)    nvhpc/20.9                 oneapi-inspector/2021.3.0    oneapi-mpi/2021.3.0      oneapi/2021.3.0
   intel-mkl/2020.3.279    intel-tbb/2020.3          intel/19.0.3              intel/2021.3.0        oneapi-advisor/2021.3.0    oneapi-itac/2021.3.0         oneapi-vtune/2021.5.0

  Where:
   D:  Default Module

Use "module spider" to find all possible modules and extensions.
Use "module keyword key1 key2 ..." to search for all possible modules matching any of the "keys".

...

The following are common optimization flags to use when compiling with the PGI Compiler

Flag	Description
-acc	Enable OpenACC directives.
=tesla(:tesla_suboptions) host	Specify the target accelerator.
-tp <processor>	Specify the type(s) of the target processor(s). processor can be sandybridge-64, haswell-64, or skylake-64.
-mtune=processor	Tune to processor everything applicable about the generated code, except for the ABI and the set of available instructions. processor can be sandybridge, ivybridge, haswell, broadwell or skylake-avx512. Some older programs/makefile might use -mcpu that is deprecated
-fast	Generally optimal set of flags.
-fastsse	Generally optimal set of flags for targets that include SSE/SSE2 capability.
-Mipa	Invokes interprocedural analysis and optimization.
-Munroll	Controls loop unrolling.
-Minfo	Prints informational messages regarding optimization and code generation to standard output as compilation proceeds.
-shared	Passed to the linker. Instructs the linker to generate a shared object file. Implies -⁠fpic.
-Bstatic	Statically link all libraries, including the PGI runtime.

See https://www.pgroup.com/resources/docs/19.5/x86/pgi-ref-guide/index.htm for more detailed description and other options

Multi-architecture CPU optimization

CPU architectures change from generation to generation, affecting the data/instruction processing and adding/modifying CPU instructions. A common trend recently has been improving vectorization capabilities of the CPUs. Currently, Research Computing supports three generations of Intel CPUs - Sandybridge, Haswell/Broadwell and Skylake CPUs -- each of which shows incremental improvement in vectorization processing power. Starting with AVX, Intel CPUs feature increasingly complex logic of clock speed adjustments depending on how many CPU cores and vector units are bring used. Each core can have its frequency adjusted independently allowing for multiple users to run different workloads. In general, if less cores and vectorization is utilized, the CPU can run faster than when all cores and vector units are used. It therefore is important to optimize the code for the architecture of the CPU being used.

Intel and PGI compilers support building multiple optimized codes for various architectures into a single executable. GNU compilers do not support this option. Any application (except GROMACS) built with Intel and PGI compilers are optimized for Skylake, Haswell and Skylake CPU. Applications built with GNU compilers are optimized to run on Haswell (base CPU architecture of Sol). GROMACS compile option do not permit building multiple architecture executables. By default, GROMACS is built for Haswell/Broadwell CPUs with Sklylake optimized builds available (see modules with -avx512 suffix).

Intel Compilers

Intel builds executable optimized for a particular architecture by using the -ax flag, also known as automatic cpu dispatch. To build executables that vectorizes optimally to run on Sol (Haswell/Broadwell and Skylake), and supported faculty clusters (SandyBridge/IvyBridge), you need to add the -axCORE-AVX512,CORE-AVX2,AVX as a compiler option.

Code Block

language	bash

[2019-07-16 15:11.16] ~/Workshop/sum2017/saxpy/solution
[alp514.sol](1029): ifort -axCOMMON-AVX512,CORE-AVX512,CORE-AVX2,CORE-AVX-I,AVX -o saxpy saxpy.f90
saxpy.f90(1): (col. 9) remark: MAIN__ has been targeted for automatic cpu dispatch

NVIDIA HPC SDK Compilers

NVIDIA compilers builds executable optimized for a particular architecture by bundling different architecture name to the -tp flag, also known as unified binary. To build executables that vectorizes optimally to run on Sol (Haswell/Broadwell and Skylake), and supported faculty clusters (SandyBridge/IvyBridge), you need to add the -tp=sandybridge-64,haswell-64,skylake-64 as a compiler option. To check if the code is being vectorized, add the -Minfo=vect flag

Code Block

language	bash

[2021-07-22 15:47.47] ~/Workshop/2021HPC/parprog/solution/saxpy
[alp514.sol](1134): nvfortran -fastsse -tp=haswell-64 -Minfo -o saxpy saxpy.f90
saxpy:
     11, Memory set idiom, loop replaced by call to __c_mset4
     12, Memory set idiom, loop replaced by call to __c_mset4
     16, Generated vector simd code for the loop

GNU Compilers

GNU compilers do not permit building a single optimized executable for multiple architectures. You need to build a separate executable for each CPU architecture using the -march=cpuarch flag where cpuarch can be sandybridge, ivybridge, haswell, broadwell or skylake.

OpenMP

OpenMP is an Application Program Interface (API) for thread based parallelism. It supports Fortran, C and C++ and uses a fork-join execution model. OpenMP structures are built with program directives, runtime libraries and environment variables. OpenMP is implemented in all major compiler suites and no separate module needs to be loaded. OpenMP permits incremental parallelization of serial by adding compiler directive that appear as comments and are only activated when the appropriate flags are added to compile command.

Compiling OpenMP Code

Different compilers have different OpenMP compile flags.

...

Versions Compared

Old Version 25

New Version 26

Key

Multi-architecture CPU optimization

Intel Compilers

NVIDIA HPC SDK Compilers

GNU Compilers

OpenMP

Compiling OpenMP Code