Computational Technology Resources - CCP

Keywords: multicore systems, Win32 multithreaded programming, OpenMP, Parallel Extensions to .NET, parallel BLAS libraries.

Summary

The key observation that ignited this effort was that achieving satisfactory results from the parallelised versions of the applications is not very easy and the frameworks/compilers, which abound today, do not guarantee, too, that the effort will be successful from the beginning. With proper care and attention, accompanied by a deep understanding of a particular architecture it is, of course, possible to tune the application and get very good results, but it still will not necessarily lead to such a performance increase on the other multicore systems.

Because in majority of cases, scientific and engineering software relies heavily on the standard matrix algebra libraries such as BLAS, and because the matrix computations offer a natural parallelism, it was assumed here that there is an opportunity to provide a performance gain for all legacy applications built around the BLAS libraries, attainable at low cost, by a simple operation of providing new parallelised versions of the libraries that would automatically scale up on a range of multicore architectures. The key issue here is thus providing a mechanism for auto-tuning of the kernel library performance on a great variety of multicore hardware. This is done here by not optimising a particular routine for a particular hardware, a strategy used for instance in the ATLAS libraries [1], but by providing a parallel system profiler application (PSPA), implementing basic BLAS routines parallelised with a number of parallel frameworks and software engineering tools that would measure of a system performance for a number of cases of varying BLAS levels, problem sizes (granularities), operating systems settings and so on, and after creating a profile for a system, this information will be used for selecting a correct version of the parallelised BLAS routine guarantying the best performance for the problem considered and the particular features of a system.

The paper presents the results of profiling for a number of systems that confirmed that this strategy might at present serve for attaining reasonable results at almost no programming effort and thus it might serve for speeding up the parallelised legacy applications on a variety of multicore hardware.

References

1: J. Demmel, J. Dongarra, V. Eijkhout, E. Fuentes, A. Petitet, R. Vuduc, R. Whaley, K. Yelick, "Selfadapting linear algebra algorithms and software", Proceedings of the IEEE, 93(2), 293-312, 2005. doi:10.1109/JPROC.2004.840848

purchase the full-text of this paper (price £20)

go to the previous paper
go to the next paper
return to the table of contents
return to the book description
purchase this book (price £85 +P&P)

	Computational & Technology Resources an online resource for computational, engineering & technology publications
	not logged in - login
Front Page Browse CCP CSETS CTR IJRT Other Authors Search Purchase Guide FAQ Contact us	Civil-Comp Proceedings ISSN 1759-3433 CCP: 95 PROCEEDINGS OF THE SECOND INTERNATIONAL CONFERENCE ON PARALLEL, DISTRIBUTED, GRID AND CLOUD COMPUTING FOR ENGINEERING Edited by: Paper 51 Refactoring of the Basic BLAS Library Routines for Automatic Optimal Performance on Different Multicore PC Platforms J. Magiera and M. Chmielik Department of Civil Engineering, Cracow University of Technology, Cracow, Poland doi:10.4203/ccp.95.51 purchase the full-text of this paper Full Bibliographic Reference for this paper J. Magiera, M. Chmielik, "Refactoring of the Basic BLAS Library Routines for Automatic Optimal Performance on Different Multicore PC Platforms", in , (Editors), "Proceedings of the Second International Conference on Parallel, Distributed, Grid and Cloud Computing for Engineering", Civil-Comp Press, Stirlingshire, UK, Paper 51, 2011. doi:10.4203/ccp.95.51 Keywords: multicore systems, Win32 multithreaded programming, OpenMP, Parallel Extensions to .NET, parallel BLAS libraries. Summary The key observation that ignited this effort was that achieving satisfactory results from the parallelised versions of the applications is not very easy and the frameworks/compilers, which abound today, do not guarantee, too, that the effort will be successful from the beginning. With proper care and attention, accompanied by a deep understanding of a particular architecture it is, of course, possible to tune the application and get very good results, but it still will not necessarily lead to such a performance increase on the other multicore systems. Because in majority of cases, scientific and engineering software relies heavily on the standard matrix algebra libraries such as BLAS, and because the matrix computations offer a natural parallelism, it was assumed here that there is an opportunity to provide a performance gain for all legacy applications built around the BLAS libraries, attainable at low cost, by a simple operation of providing new parallelised versions of the libraries that would automatically scale up on a range of multicore architectures. The key issue here is thus providing a mechanism for auto-tuning of the kernel library performance on a great variety of multicore hardware. This is done here by not optimising a particular routine for a particular hardware, a strategy used for instance in the ATLAS libraries [1], but by providing a parallel system profiler application (PSPA), implementing basic BLAS routines parallelised with a number of parallel frameworks and software engineering tools that would measure of a system performance for a number of cases of varying BLAS levels, problem sizes (granularities), operating systems settings and so on, and after creating a profile for a system, this information will be used for selecting a correct version of the parallelised BLAS routine guarantying the best performance for the problem considered and the particular features of a system. The paper presents the results of profiling for a number of systems that confirmed that this strategy might at present serve for attaining reasonable results at almost no programming effort and thus it might serve for speeding up the parallelised legacy applications on a variety of multicore hardware. References 1 J. Demmel, J. Dongarra, V. Eijkhout, E. Fuentes, A. Petitet, R. Vuduc, R. Whaley, K. Yelick, "Selfadapting linear algebra algorithms and software", Proceedings of the IEEE, 93(2), 293-312, 2005. doi:10.1109/JPROC.2004.840848 purchase the full-text of this paper (price £20) go to the previous paper go to the next paper return to the table of contents return to the book description purchase this book (price £85 +P&P)
Back to top	©Civil-Comp Limited 2023 - terms & conditions