Computational & Technology Resources
an online resource for computational,
engineering & technology publications |
|
Civil-Comp Proceedings
ISSN 1759-3433 CCP: 101
PROCEEDINGS OF THE THIRD INTERNATIONAL CONFERENCE ON PARALLEL, DISTRIBUTED, GRID AND CLOUD COMPUTING FOR ENGINEERING Edited by:
Paper 40
Programming Finite Element Methods for ccNUMA Processors E. Borin1 and P. Devloo2
1Institute of Computing, University of Campinas, Brazil
E. Borin, P. Devloo, "Programming Finite Element Methods for ccNUMA Processors", in , (Editors), "Proceedings of the Third International Conference on Parallel, Distributed, Grid and Cloud Computing for Engineering", Civil-Comp Press, Stirlingshire, UK, Paper 40, 2013. doi:10.4203/ccp.101.40
Keywords: parallel programming, parallel processing, cache-coherent non uniform memory access, finite element methods, multi-core, shared memory.
Summary
Recent multi-core
designs migrated from symmetric multi processing to cache coherent non uniform
memory access architectures. In this paper we discuss performance issues that
arise when designing parallel finite element method programs for a 64-core
ccNUMA computer and explore solutions for these issues. First we present an
overview of the computer architecture and show that highly parallel code that
does not take into account the aspects of the system memory organization scales
poorly, achieving only 2.8x speedup when running with 64 threads. Then, we
discuss how we identified the sources of overhead and evaluate two possible
solutions for the problem. The first one consists of distributing the data
evenly among the memory banks using the numactl tool and the second
consists of using the libnuma to properly schedule threads and related
data on local CPUs and memory banks to take advantage of the memory subsystem
parallelism and reduce the average memory access latency. We show that the first
approach is able to boost the performance by 10.6x only by changing the way we
invoke the program on the command line and that the second approach is able to
further boost the performance by 30.9x at the expense of changing the
applications code. Finally, we argue that the issues reported only happen for
large data sets and conclude with recommendations to help programmers to design
algorithms and programs that perform well on this type of machine.
purchase the full-text of this paper (price £20)
go to the previous paper |
|