Computational & Technology Resources
an online resource for computational,
engineering & technology publications |
|
Civil-Comp Proceedings
ISSN 1759-3433 CCP: 111
PROCEEDINGS OF THE FIFTH INTERNATIONAL CONFERENCE ON PARALLEL, DISTRIBUTED, GRID AND CLOUD COMPUTING FOR ENGINEERING Edited by:
Paper 38
Exploiting Functional Directives to Achieve MPI Parallelization D. Rubio Bonilla and C.W. Glass
HLRS - University of Stuttgart, Germany D. Rubio Bonilla, C.W. Glass, "Exploiting Functional Directives to Achieve MPI
Parallelization", in , (Editors), "Proceedings of the
Fifth International Conference
on
Parallel, Distributed, Grid and Cloud Computing
for Engineering", Civil-Comp Press, Stirlingshire, UK, Paper 38, 2017. doi:10.4203/ccp.111.38
Keywords: high performance computing, functional programming, MPI, parallelization,
programming models.
Summary
In the last years CPU manufactures have not been able to substantially increase the
Instructions Per Cycle of CPU cores. Trying to overcome this situation manufacturers
have increased the raw performance of HPC systems by simultaneously increasing
the amount of processors, by multiplying the number of cores in each processor and
integrating specialized accelerators such as GPGPUs, FPGAs and other ASICs with
specialized instruction sets. To be able to exploit the new hardware capabilities applications
have to be specifically written with parallelism, to deal with the increasing
number of cores available and also need to have parts of the source code written in
specialized languages to make use of the integrated accelerators.
This creates a major paradigm shift from compute centric to communication centric execution to which most programming models are not properly aligned yet: classical models are geared towards optimizing the computational operations, assuming data access is almost for free. Languages like C, for example, assume that variables are immediately available and are accessed synchronously. The new situation implies that the data is going to be distributed across the system, and communication latency will have a big impact. A bad data distribution will result in a continue exchange while processing units are idling waiting for the data to compute. Most programming models do not convey the necessary dependency information to the compiler that must be careful not to make wrong assumptions. There are successful attempts, such as OpenMP, to exploit parallelism by introducing structural information of the application. Research projects, such as POLCA, have develop means to introduce functionallike semantics, in the form of directives, to procedural code that describe the structural behavior of the application with the aim to allow compilers to perform aggressive code transformations that increase performance and allow portability across different architectures. These functional semantics are based on Higher-Order Functions (HOFs), which are functions that can take as parameters, or return as result, other functions instead of a value. Thanks to this property the directives can be interlinked to create hierarchical structures that can be analyzed at different levels (compared to the flat structure created by OpenMP). At the same times the HOFs have a very clear execution structure that is well understood and can be manipulated to obtain different execution structures that are equivalent but have different properties (memory usage, degree of parallelization or communication pattern among others). In this paper we present how the functional directives based on Higher-Order Functions can be applied to procedural code to obtain the application’s hierarchical. Then we will demonstrate how the structure is analyzed to find the parallelism and its data dependencies. After it, we will follow the process that compilers can take to exploit this information to adapt the original source code to port it to non-shared memory address space HPC clusters. This steps involves the detection of the communication pattern based on the data flow, partitioning on the data, modification of the data structures and the introduction of the MPI (Message Passing) calls. To finalize we will present the execution results of the MPI code generated from non-parallelized C code (N-Body and 3D Heat Diffusion) following this process. The results show that the parallelization in large HPC clusters is correct and with performance comparable to hand tuned versions, with almost equivalent scalability and energy consumption. purchase the full-text of this paper (price £22)
go to the previous paper |
|