Computational & Technology Resources
an online resource for computational,
engineering & technology publications |
|
Civil-Comp Conferences
ISSN 2753-3239 CCC: 2
PROCEEDINGS OF THE ELEVENTH INTERNATIONAL CONFERENCE ON ENGINEERING COMPUTATIONAL TECHNOLOGY Edited by: B.H.V. Topping and P. Iványi
Paper 8.5
MRT Lattice Boltzmann Method on multiple Graphics Processing Units with halo sharing over PCI-e for non-contiguous memory T.O.M. Forslund
Division of Fluid and Experimental Mechanics LuleƄ University of Technology, Sweden T.O.M. Forslund, "MRT Lattice Boltzmann Method on multiple
Graphics Processing Units with halo sharing over
PCI-e for non-contiguous memory", in B.H.V. Topping, P. Iványi, (Editors), "Proceedings of the Eleventh International Conference on Engineering Computational Technology", Civil-Comp Press, Edinburgh, UK,
Online volume: CCC 2, Paper 8.5, 2022, doi:10.4203/ccc.2.8.5
Keywords: lattice Boltzmann method, GPU, multi-GPU programming.
Abstract
The Lattice Boltzmann Method (LBM) has been shown to be well suited for
implementation on Graphics Processing Units (GPUs). The benefit of GPU
implementations compared to CPU is in the reduction of computational time, by as
much as 2 orders of magnitude. This staggering difference is due to how computations
for LBM are both explicit and local, meaning that it can make full use of the GPUs
capabilities, like most other cellular automata methods. Although GPUs have a
significantly larger performance in terms of floating-point operations per second
(FLOPS) compared to a CPU it has two significant drawbacks; First, the complexity
of the calculations is limited due to the relative simplicity of the GPU core design
compared to a CPU, secondly, the memory of a GPU is usually limited in comparison,
ranging from a few GB up to ???? 100 GB for high-end enterprise cards. Because the
LBM method is suitable for execution on GPUs the first point is not necessary to
consider. But the second point becomes a limitation as larger, or more highly resolved
computational domains are of interest. This can be remedied by distributing the
computations across several GPUs executing in parallel. The GPUs share values in
overlapping regions called halo-values that need to be transferred each time step. If
the memory is contiguous then each transfer can be executed as a single efficient
memory transfer call that utilizes the PCI-e lanes efficiently. If this is not the case then
support exists for copying of so-called strided memory which has a constant offset
between values for either single strided (2D) or double strided (3D). These functions
practically result in bad PCI-e lane utilization and to remedy this a method is
proposed, the halo-values are calculated and packed into a contiguous memory buffer
that is then communicated between the GPUs via the PCI-e lanes. It is shown that the
method introduces some additional overhead compared to single GPU execution but
maintains a reasonable 70% performance compared to the single GPU case.
download the full-text of this paper (PDF, 6 pages, 316 Kb)
go to the previous paper |
|