5.3.3. KOKKOS package

The KOKKOS package was developed primarily by Christian Trott (Sandia) with contributions of various styles by others, including Sikandar Mashayak (UIUC), Stan Moore (Sandia), and Ray Shan (Sandia). The underlying Kokkos library was written primarily by Carter Edwards, Christian Trott, and Dan Sunderland (all Sandia).

The KOKKOS package contains versions of pair, fix, and atom styles that use data structures and macros provided by the Kokkos library, which is included with LAMMPS in lib/kokkos.

The Kokkos library is part of Trilinos and can also be downloaded from Github. Kokkos is a templated C++ library that provides two key abstractions for an application like LAMMPS. First, it allows a single implementation of an application kernel (e.g. a pair style) to run efficiently on different kinds of hardware, such as a GPU, Intel Phi, or many-core CPU.

The Kokkos library also provides data abstractions to adjust (at compile time) the memory layout of basic data structures like 2d and 3d arrays and allow the transparent utilization of special hardware load and store operations. Such data structures are used in LAMMPS to store atom coordinates or forces or neighbor lists. The layout is chosen to optimize performance on different platforms. Again this functionality is hidden from the developer, and does not affect how the kernel is coded.

These abstractions are set at build time, when LAMMPS is compiled with the KOKKOS package installed. All Kokkos operations occur within the context of an individual MPI task running on a single node of the machine. The total number of MPI tasks used by LAMMPS (one or multiple per compute node) is set in the usual manner via the mpirun or mpiexec commands, and is independent of Kokkos.

Kokkos currently provides support for 3 modes of execution (per MPI task). These are OpenMP (for many-core CPUs), Cuda (for NVIDIA GPUs), and OpenMP (for Intel Phi). Note that the KOKKOS package supports running on the Phi in native mode, not offload mode like the USER-INTEL package supports. You choose the mode at build time to produce an executable compatible with specific hardware.

Here is a quick overview of how to use the KOKKOS package for CPU acceleration, assuming one or more 16-core nodes. More details follow.

use a C++11 compatible compiler
make yes-kokkos
make mpi KOKKOS_DEVICES=OpenMP                 # build with the KOKKOS package
make kokkos_omp                                # or Makefile.kokkos_omp already has variable set
Make.py -v -p kokkos -kokkos omp -o mpi -a file mpi   # or one-line build via Make.py

mpirun -np 16 lmp_mpi -k on -sf kk -in in.lj              # 1 node, 16 MPI tasks/node, no threads
mpirun -np 2 -ppn 1 lmp_mpi -k on t 16 -sf kk -in in.lj   # 2 nodes, 1 MPI task/node, 16 threads/task
mpirun -np 2 lmp_mpi -k on t 8 -sf kk -in in.lj           # 1 node, 2 MPI tasks/node, 8 threads/task
mpirun -np 32 -ppn 4 lmp_mpi -k on t 4 -sf kk -in in.lj   # 8 nodes, 4 MPI tasks/node, 4 threads/task

specify variables and settings in your Makefile.machine that enable OpenMP, GPU, or Phi support
include the KOKKOS package and build LAMMPS
enable the KOKKOS package and its hardware options via the “-k on” command-line switch use KOKKOS styles in your input script

Here is a quick overview of how to use the KOKKOS package for GPUs, assuming one or more nodes, each with 16 cores and a GPU. More details follow.

discuss use of NVCC, which Makefiles to examine

use a C++11 compatible compiler
KOKKOS_DEVICES = Cuda, OpenMP
KOKKOS_ARCH = Kepler35
make yes-kokkos
make machine
Make.py -p kokkos -kokkos cuda arch=31 -o kokkos_cuda -a file kokkos_cuda

mpirun -np 1 lmp_cuda -k on t 6 -sf kk -in in.lj          # one MPI task, 6 threads on CPU
mpirun -np 4 -ppn 1 lmp_cuda -k on t 6 -sf kk -in in.lj   # ditto on 4 nodes

mpirun -np 2 lmp_cuda -k on t 8 g 2 -sf kk -in in.lj           # two MPI tasks, 8 threads per CPU
mpirun -np 32 -ppn 2 lmp_cuda -k on t 8 g 2 -sf kk -in in.lj   # ditto on 16 nodes

Here is a quick overview of how to use the KOKKOS package for the Intel Phi:

use a C++11 compatible compiler
KOKKOS_DEVICES = OpenMP
KOKKOS_ARCH = KNC
make yes-kokkos
make machine
Make.py -p kokkos -kokkos phi -o kokkos_phi -a file mpi

host=MIC, Intel Phi with 61 cores (240 threads/phi via 4x hardware threading):
mpirun -np 1 lmp_g++ -k on t 240 -sf kk -in in.lj           # 1 MPI task on 1 Phi, 1*240 = 240
mpirun -np 30 lmp_g++ -k on t 8 -sf kk -in in.lj            # 30 MPI tasks on 1 Phi, 30*8 = 240
mpirun -np 12 lmp_g++ -k on t 20 -sf kk -in in.lj           # 12 MPI tasks on 1 Phi, 12*20 = 240
mpirun -np 96 -ppn 12 lmp_g++ -k on t 20 -sf kk -in in.lj   # ditto on 8 Phis

Required hardware/software:

Kokkos support within LAMMPS must be built with a C++11 compatible compiler. If using gcc, version 4.8.1 or later is required.

To build with Kokkos support for CPUs, your compiler must support the OpenMP interface. You should have one or more multi-core CPUs so that multiple threads can be launched by each MPI task running on a CPU.

To build with Kokkos support for NVIDIA GPUs, NVIDIA Cuda software version 6.5 or later must be installed on your system. See the discussion for the GPU package for details of how to check and do this.

Note

For good performance of the KOKKOS package on GPUs, you must have Kepler generation GPUs (or later). The Kokkos library exploits texture cache options not supported by Telsa generation GPUs (or older).

To build with Kokkos support for Intel Xeon Phi coprocessors, your sysmte must be configured to use them in “native” mode, not “offload” mode like the USER-INTEL package supports.

Building LAMMPS with the KOKKOS package:

You must choose at build time whether to build for CPUs (OpenMP), GPUs, or Phi.

You can do any of these in one line, using the src/Make.py script, described in Section 2.4 of the manual. Type “Make.py -h” for help. If run from the src directory, these commands will create src/lmp_kokkos_omp, lmp_kokkos_cuda, and lmp_kokkos_phi. Note that the OMP and PHI options use src/MAKE/Makefile.mpi as the starting Makefile.machine. The CUDA option uses src/MAKE/OPTIONS/Makefile.kokkos_cuda.

The latter two steps can be done using the “-k on”, “-pk kokkos” and “-sf kk” command-line switches respectively. Or the effect of the “-pk” or “-sf” switches can be duplicated by adding the package kokkos or suffix kk commands respectively to your input script.

Or you can follow these steps:

CPU-only (run all-MPI or with OpenMP threading):

cd lammps/src
make yes-kokkos
make kokkos_omp

CPU-only (only MPI, no threading):

cd lammps/src
make yes-kokkos
make kokkos_mpi

Intel Xeon Phi (Intel Compiler, Intel MPI):

cd lammps/src
make yes-kokkos
make kokkos_phi

CPUs and GPUs (with MPICH):

cd lammps/src
make yes-kokkos
make kokkos_cuda_mpich

These examples set the KOKKOS-specific OMP, MIC, CUDA variables on the make command line which requires a GNU-compatible make command. Try “gmake” if your system’s standard make complains.

Note

If you build using make line variables and re-build LAMMPS twice with different KOKKOS options and the *same* target, e.g. g++ in the first two examples above, then you *must* perform a “make clean-all” or “make clean-machine” before each build. This is to force all the KOKKOS-dependent files to be re-compiled with the new options.

Note

Currently, there are no precision options with the KOKKOS package. All compilation and computation is performed in double precision.

There are other allowed options when building with the KOKKOS package. As above, they can be set either as variables on the make command line or in Makefile.machine. This is the full list of options, including those discussed above, Each takes a value shown below. The default value is listed, which is set in the lib/kokkos/Makefile.kokkos file.

#Default settings specific options #Options: force_uvm,use_ldg,rdc

KOKKOS_DEVICES, values = OpenMP, Serial, Pthreads, Cuda, default = OpenMP
KOKKOS_ARCH, values = KNC, SNB, HSW, Kepler, Kepler30, Kepler32, Kepler35, Kepler37, Maxwell, Maxwell50, Maxwell52, Maxwell53, ARMv8, BGQ, Power7, Power8, default = none
KOKKOS_DEBUG, values = yes, no, default = no
KOKKOS_USE_TPLS, values = hwloc, librt, default = none
KOKKOS_CUDA_OPTIONS, values = force_uvm, use_ldg, rdc

KOKKOS_DEVICE sets the parallelization method used for Kokkos code (within LAMMPS). KOKKOS_DEVICES=OpenMP means that OpenMP will be used. KOKKOS_DEVICES=Pthreads means that pthreads will be used. KOKKOS_DEVICES=Cuda means an NVIDIA GPU running CUDA will be used.

If KOKKOS_DEVICES=Cuda, then the lo-level Makefile in the src/MAKE directory must use “nvcc” as its compiler, via its CC setting. For best performance its CCFLAGS setting should use -O3 and have a KOKKOS_ARCH setting that matches the compute capability of your NVIDIA hardware and software installation, e.g. KOKKOS_ARCH=Kepler30. Note the minimal required compute capability is 2.0, but this will give signicantly reduced performance compared to Kepler generation GPUs with compute capability 3.x. For the LINK setting, “nvcc” should not be used; instead use g++ or another compiler suitable for linking C++ applications. Often you will want to use your MPI compiler wrapper for this setting (i.e. mpicxx). Finally, the lo-level Makefile must also have a “Compilation rule” for creating *.o files from *.cu files. See src/Makefile.cuda for an example of a lo-level Makefile with all of these settings.

KOKKOS_USE_TPLS=hwloc binds threads to hardware cores, so they do not migrate during a simulation. KOKKOS_USE_TPLS=hwloc should always be used if running with KOKKOS_DEVICES=Pthreads for pthreads. It is not necessary for KOKKOS_DEVICES=OpenMP for OpenMP, because OpenMP provides alternative methods via environment variables for binding threads to hardware cores. More info on binding threads to cores is given in Section 5.3.

KOKKOS_ARCH=KNC enables compiler switches needed when compling for an Intel Phi processor.

KOKKOS_USE_TPLS=librt enables use of a more accurate timer mechanism on most Unix platforms. This library is not available on all platforms.

KOKKOS_DEBUG is only useful when developing a Kokkos-enabled style within LAMMPS. KOKKOS_DEBUG=yes enables printing of run-time debugging information that can be useful. It also enables runtime bounds checking on Kokkos data structures.

KOKKOS_CUDA_OPTIONS are additional options for CUDA.

For more information on Kokkos see the Kokkos programmers’ guide here: /lib/kokkos/doc/Kokkos_PG.pdf.

Run with the KOKKOS package from the command line:

The mpirun or mpiexec command sets the total number of MPI tasks used by LAMMPS (one or multiple per compute node) and the number of MPI tasks used per node. E.g. the mpirun command in MPICH does this via its -np and -ppn switches. Ditto for OpenMPI via -np and -npernode.

When using KOKKOS built with host=OMP, you need to choose how many OpenMP threads per MPI task will be used (via the “-k” command-line switch discussed below). Note that the product of MPI tasks * OpenMP threads/task should not exceed the physical number of cores (on a node), otherwise performance will suffer.

When using the KOKKOS package built with device=CUDA, you must use exactly one MPI task per physical GPU.

When using the KOKKOS package built with host=MIC for Intel Xeon Phi coprocessor support you need to insure there are one or more MPI tasks per coprocessor, and choose the number of coprocessor threads to use per MPI task (via the “-k” command-line switch discussed below). The product of MPI tasks * coprocessor threads/task should not exceed the maximum number of threads the coproprocessor is designed to run, otherwise performance will suffer. This value is 240 for current generation Xeon Phi(TM) chips, which is 60 physical cores * 4 threads/core. Note that with the KOKKOS package you do not need to specify how many Phi coprocessors there are per node; each coprocessors is simply treated as running some number of MPI tasks.

You must use the “-k on” command-line switch to enable the KOKKOS package. It takes additional arguments for hardware settings appropriate to your system. Those arguments are documented here. The two most commonly used options are:

-k on t Nt g Ng

The “t Nt” option applies to host=OMP (even if device=CUDA) and host=MIC. For host=OMP, it specifies how many OpenMP threads per MPI task to use with a node. For host=MIC, it specifies how many Xeon Phi threads per MPI task to use within a node. The default is Nt = 1. Note that for host=OMP this is effectively MPI-only mode which may be fine. But for host=MIC you will typically end up using far less than all the 240 available threads, which could give very poor performance.

The “g Ng” option applies to device=CUDA. It specifies how many GPUs per compute node to use. The default is 1, so this only needs to be specified is you have 2 or more GPUs per compute node.

The “-k on” switch also issues a “package kokkos” command (with no additional arguments) which sets various KOKKOS options to default values, as discussed on the package command doc page.

Use the “-sf kk” command-line switch, which will automatically append “kk” to styles that support it. Use the “-pk kokkos” command-line switch if you wish to change any of the default package kokkos optionns set by the “-k on” command-line switch.

Note that the default for the package kokkos command is to use “full” neighbor lists and set the Newton flag to “off” for both pairwise and bonded interactions. This typically gives fastest performance. If the newton command is used in the input script, it can override the Newton flag defaults.

However, when running in MPI-only mode with 1 thread per MPI task, it will typically be faster to use “half” neighbor lists and set the Newton flag to “on”, just as is the case for non-accelerated pair styles. You can do this with the “-pk” command-line switch.

Or run with the KOKKOS package by editing an input script:

The discussion above for the mpirun/mpiexec command and setting appropriate thread and GPU values for host=OMP or host=MIC or device=CUDA are the same.

You must still use the “-k on” command-line switch to enable the KOKKOS package, and specify its additional arguments for hardware options appopriate to your system, as documented above.

Use the suffix kk command, or you can explicitly add a “kk” suffix to individual styles in your input script, e.g.

pair_style lj/cut/kk 2.5

You only need to use the package kokkos command if you wish to change any of its option defaults, as set by the “-k on” command-line switch.

Speed-ups to expect:

The performance of KOKKOS running in different modes is a function of your hardware, which KOKKOS-enable styles are used, and the problem size.

Generally speaking, the following rules of thumb apply:

When running on CPUs only, with a single thread per MPI task, performance of a KOKKOS style is somewhere between the standard (un-accelerated) styles (MPI-only mode), and those provided by the USER-OMP package. However the difference between all 3 is small (less than 20%).
When running on CPUs only, with multiple threads per MPI task, performance of a KOKKOS style is a bit slower than the USER-OMP package.
When running large number of atoms per GPU, KOKKOS is typically faster than the GPU package.
When running on Intel Xeon Phi, KOKKOS is not as fast as the USER-INTEL package, which is optimized for that hardware.

See the Benchmark page of the LAMMPS web site for performance of the KOKKOS package on different hardware.

Guidelines for best performance:

Here are guidline for using the KOKKOS package on the different hardware configurations listed above.

Many of the guidelines use the package kokkos command See its doc page for details and default settings. Experimenting with its options can provide a speed-up for specific calculations.

Running on a multi-core CPU:

If N is the number of physical cores/node, then the number of MPI tasks/node * number of threads/task should not exceed N, and should typically equal N. Note that the default threads/task is 1, as set by the “t” keyword of the “-k” command-line switch. If you do not change this, no additional parallelism (beyond MPI) will be invoked on the host CPU(s).

You can compare the performance running in different modes:

run with 1 MPI task/node and N threads/task
run with N MPI tasks/node and 1 thread/task
run with settings in between these extremes

Examples of mpirun commands in these modes are shown above.

When using KOKKOS to perform multi-threading, it is important for performance to bind both MPI tasks to physical cores, and threads to physical cores, so they do not migrate during a simulation.

If you are not certain MPI tasks are being bound (check the defaults for your MPI installation), binding can be forced with these flags:

OpenMPI 1.8: mpirun -np 2 -bind-to socket -map-by socket ./lmp_openmpi ...
Mvapich2 2.0: mpiexec -np 2 -bind-to socket -map-by socket ./lmp_mvapich ...

For binding threads with the KOKKOS OMP option, use thread affinity environment variables to force binding. With OpenMP 3.1 (gcc 4.7 or later, intel 12 or later) setting the environment variable OMP_PROC_BIND=true should be sufficient. For binding threads with the KOKKOS pthreads option, compile LAMMPS the KOKKOS HWLOC=yes option, as discussed in Section 2.3.4 of the manual.

Running on GPUs:

Insure the -arch setting in the machine makefile you are using, e.g. src/MAKE/Makefile.cuda, is correct for your GPU hardware/software (see this section of the manual for details).

The -np setting of the mpirun command should set the number of MPI tasks/node to be equal to the # of physical GPUs on the node.

Use the “-k” command-line switch to specify the number of GPUs per node, and the number of threads per MPI task. As above for multi-core CPUs (and no GPU), if N is the number of physical cores/node, then the number of MPI tasks/node * number of threads/task should not exceed N. With one GPU (and one MPI task) it may be faster to use less than all the available cores, by setting threads/task to a smaller value. This is because using all the cores on a dual-socket node will incur extra cost to copy memory from the 2nd socket to the GPU.

Examples of mpirun commands that follow these rules are shown above.

Note

When using a GPU, you will achieve the best performance if your input script does not use any fix or compute styles which are not yet Kokkos-enabled. This allows data to stay on the GPU for multiple timesteps, without being copied back to the host CPU. Invoking a non-Kokkos fix or compute, or performing I/O for thermo or dump output will cause data to be copied back to the CPU.

You cannot yet assign multiple MPI tasks to the same GPU with the KOKKOS package. We plan to support this in the future, similar to the GPU package in LAMMPS.

You cannot yet use both the host (multi-threaded) and device (GPU) together to compute pairwise interactions with the KOKKOS package. We hope to support this in the future, similar to the GPU package in LAMMPS.

Running on an Intel Phi:

Kokkos only uses Intel Phi processors in their “native” mode, i.e. not hosted by a CPU.

As illustrated above, build LAMMPS with OMP=yes (the default) and MIC=yes. The latter insures code is correctly compiled for the Intel Phi. The OMP setting means OpenMP will be used for parallelization on the Phi, which is currently the best option within Kokkos. In the future, other options may be added.

Current-generation Intel Phi chips have either 61 or 57 cores. One core should be excluded for running the OS, leaving 60 or 56 cores. Each core is hyperthreaded, so there are effectively N = 240 (4*60) or N = 224 (4*56) cores to run on.

The -np setting of the mpirun command sets the number of MPI tasks/node. The “-k on t Nt” command-line switch sets the number of threads/task as Nt. The product of these 2 values should be N, i.e. 240 or 224. Also, the number of threads/task should be a multiple of 4 so that logical threads from more than one MPI task do not run on the same physical core.

Examples of mpirun commands that follow these rules are shown above.

5.3.3.1. Restrictions

As noted above, if using GPUs, the number of MPI tasks per compute node should equal to the number of GPUs per compute node. In the future Kokkos will support assigning multiple MPI tasks to a single GPU.

Currently Kokkos does not support AMD GPUs due to limits in the available backend programming models. Specifically, Kokkos requires extensive C++ support from the Kernel language. This is expected to change in the future.