Computational & Technology Resources
an online resource for computational,
engineering & technology publications
Civil-Comp Proceedings
ISSN 1759-3433
CCP: 107
PROCEEDINGS OF THE FOURTH INTERNATIONAL CONFERENCE ON PARALLEL, DISTRIBUTED, GRID AND CLOUD COMPUTING FOR ENGINEERING
Edited by:
Paper 30

Mesh Renumbering Methods and Performance with OpenMP/MPI in Code Saturne

P. Trespeuch1, Y. Fournier2, C. Evangelinos3 and P. Vezolle4

1CS Information Systems, Le Plessis Robinson, France
2EDF R&D, Département Mécanique des Fluides, Energies et Environnement, Chatou Cedex, France
3IBM Research, Cambridge, Massachusetts, United States of America
4IBM France, Montpellier, France

Full Bibliographic Reference for this paper
P. Trespeuch, Y. Fournier, C. Evangelinos, P. Vezolle, "Mesh Renumbering Methods and Performance with OpenMP/MPI in Code Saturne", in , (Editors), "Proceedings of the Fourth International Conference on Parallel, Distributed, Grid and Cloud Computing for Engineering", Civil-Comp Press, Stirlingshire, UK, Paper 30, 2015. doi:10.4203/ccp.107.30
Keywords: computational fluid dynamics, Code_Saturne, OpenMP, sparse matrix vector product, benchmarks, renumbering algorithms.

Summary
The scale of computational fluid dynamics (CFD) simulation problems is rapidly increasing as a result of the requirements for higher spatial resolution, varied turbulence models, and more detailed physics. As is the case with many CFD Navier-Stokes tools, EDF's Code_Saturne which is also one of the two CFD software packages of the PRACE benchmark, is parallelized using domain partitioning and MPI. On large systems with thousands of compute nodes, even with simulations employing multi-billion cell meshes a pure MPI approach will not able to fully take advantage of the multiple levels of parallelism and the steady increase in the number of cores per processor. To tackle this problem the most popular approach is to introduce a hybrid MPI/OpenMP approach. Code_Saturne implements a three-dimensional general finite volume solver with conformal and non-conformal meshes. The computation time is dominated by the linear equation solvers, mainly for the pressure and to a lesser degree by gradient reconstructions. The thread-level parallelism was mainly applied on computational loops which iterate over the cells or faces in the cell-centred formulation. A general loop transformation was implemented to allow a wide range of methods to control memory indirect addressing conflicts between threads, while minimizing code changes. In this paper different mesh renumbering algorithms are presented to generate threads (multipass approach with METIS, SCOTCH partitioning or space filling Morton curves, Cuthill McKee approach), while exploiting communication overlapping. Performance, scalability and comparison results are presented on an Intel x86 cluster (with three generations of Intel Xeon processor: Westmere, Ivy Bridge and Haswell) and IBM Blue Gene/Q systems. A very significant part of the total execution time is spent in sparse matrix-vector products. It is shown that this product can behave as a stream kernel benchmark and therefore depends on the memory system performance. It is pointed out that significant performance degradation occurs per core depending on the number of cores used per node. Results on several Intel Xeon generations are provided as well as hardware counter analysis.

purchase the full-text of this paper (price £20)

go to the previous paper
go to the next paper
return to the table of contents
return to the book description
purchase this book (price £45 +P&P)