Computational & Technology Resources
an online resource for computational,
engineering & technology publications |
|
Civil-Comp Proceedings
ISSN 1759-3433 CCP: 95
PROCEEDINGS OF THE SECOND INTERNATIONAL CONFERENCE ON PARALLEL, DISTRIBUTED, GRID AND CLOUD COMPUTING FOR ENGINEERING Edited by:
Paper 51
Refactoring of the Basic BLAS Library Routines for Automatic Optimal Performance on Different Multicore PC Platforms J. Magiera and M. Chmielik
Department of Civil Engineering, Cracow University of Technology, Cracow, Poland J. Magiera, M. Chmielik, "Refactoring of the Basic BLAS Library Routines for Automatic Optimal Performance on Different Multicore PC Platforms", in , (Editors), "Proceedings of the Second International Conference on Parallel, Distributed, Grid and Cloud Computing for Engineering", Civil-Comp Press, Stirlingshire, UK, Paper 51, 2011. doi:10.4203/ccp.95.51
Keywords: multicore systems, Win32 multithreaded programming, OpenMP, Parallel Extensions to .NET, parallel BLAS libraries.
Summary
The key observation that ignited this effort was that achieving satisfactory results from the parallelised versions of the applications is not very easy and the frameworks/compilers, which abound today, do not guarantee, too, that the effort will be successful from the beginning. With proper care and attention, accompanied by a deep understanding of a particular architecture it is, of course, possible to tune the application and get very good results, but it still will not necessarily lead to such a performance increase on the other multicore systems.
Because in majority of cases, scientific and engineering software relies heavily on the standard matrix algebra libraries such as BLAS, and because the matrix computations offer a natural parallelism, it was assumed here that there is an opportunity to provide a performance gain for all legacy applications built around the BLAS libraries, attainable at low cost, by a simple operation of providing new parallelised versions of the libraries that would automatically scale up on a range of multicore architectures. The key issue here is thus providing a mechanism for auto-tuning of the kernel library performance on a great variety of multicore hardware. This is done here by not optimising a particular routine for a particular hardware, a strategy used for instance in the ATLAS libraries [1], but by providing a parallel system profiler application (PSPA), implementing basic BLAS routines parallelised with a number of parallel frameworks and software engineering tools that would measure of a system performance for a number of cases of varying BLAS levels, problem sizes (granularities), operating systems settings and so on, and after creating a profile for a system, this information will be used for selecting a correct version of the parallelised BLAS routine guarantying the best performance for the problem considered and the particular features of a system. The paper presents the results of profiling for a number of systems that confirmed that this strategy might at present serve for attaining reasonable results at almost no programming effort and thus it might serve for speeding up the parallelised legacy applications on a variety of multicore hardware. References
purchase the full-text of this paper (price £20)
go to the previous paper |
|