Computational & Technology Resources
an online resource for computational,
engineering & technology publications
Civil-Comp Proceedings
ISSN 1759-3433
CCP: 95
PROCEEDINGS OF THE SECOND INTERNATIONAL CONFERENCE ON PARALLEL, DISTRIBUTED, GRID AND CLOUD COMPUTING FOR ENGINEERING
Edited by:
Paper 54

Timing Collective Communications in an Empirical Optimization Framework

K. Benkert1, E. Gabriel2 and S. Roller3

1High Performance Computing Center Stuttgart (HLRS), Stuttgart, Germany
2Parallel Software Technologies Laboratory, Department of Computer Science, University of Houston, United States of America
3German Research School for Simulation Sciences, Aachen, Germany

Full Bibliographic Reference for this paper
K. Benkert, E. Gabriel, S. Roller, "Timing Collective Communications in an Empirical Optimization Framework", in , (Editors), "Proceedings of the Second International Conference on Parallel, Distributed, Grid and Cloud Computing for Engineering", Civil-Comp Press, Stirlingshire, UK, Paper 54, 2011. doi:10.4203/ccp.95.54
Keywords: empirical optimization, abstract data and communication library, collective communication, NAS parallel benchmarks.

Summary
This paper investigates how to assess the performance of different implementation alternatives for an empirical optimization library for collective MPI communications, namely the Abstract Data and Communication Library (ADCL). The first and simplest timing technique is to embrace the collective communication with timing routines and use the maximum local execution time as an estimator for the total execution time of the collective communication. A second solution adds synchronizations before the calls to the timing routines. The synchronizations create additional overhead during the search phase and might introduce systematic errors since application features such as process arrival patterns are annulated before the measurement is started.

The third options is based on the new timer object in ADCL. The timer object allows the measurement of the codelet plus its environment, e.g. one or more iterations in the code. Its purpose is to capture the original execution behavior of the program and avoid distortions caused by synchronizations directly before and after the execution of the codelet. However, the user now has the responsibility to select the right code portion to measure.

To investigate the different timing techniques, we use the MPI FFT Benchmark from the NAS Parallel Benchmarks 3.0 which involves an all-to-all communication operation. Tests have been executed on a wide variety of parallel platforms. For each system, the FFT benchmark has been executed for various classes, different numbers of processes, and multiple MPI implementations. We demonstrated that the accuracy of performance prediction for performance data generated with this timer object is superior to the other techniques. An enhanced version of the timer object nearly exactly reproduced quantitatively the performance data obtained from longer runs for stable test environments.

purchase the full-text of this paper (price £20)

go to the previous paper
go to the next paper
return to the table of contents
return to the book description
purchase this book (price £85 +P&P)