Computational Technology Resources - CCP

Keywords: domain decomposition, FETI, Total FETI, TFETI, FLLOP, PETSc, parallel direct solver, pseudoinverse, natural coarse space matrix, coarse problem.

Summary

In domain decomposition method, the natural effort for large problems solved on the massively parallel computers is to maximize the number of subdomains so that sizes of subdomain stiffness matrices are reduced which accelerates the primal operations. On the other hand, the negative effect of that are the growing dimensions of objects that couple the subdomains, i.e. operators whose domain or image is either the dual space (whose dimension is number of Lagrange multipliers on subdomains interfaces) or the kernel of the stiffness matrix.

We found out that particularly an application of the projector onto the natural coarse space can become a severe bottleneck of the FETI method, especially the part called the coarse problem (CP) solution. Concerning very large problems, these operations can start to dominate in computation times, destroy scalability or even cause out-of-memory error if we do not parallelize them. According to the observations, the matrix-vector and matrix-transpose-vector multiplications take approximately the same time for different coarse space matrix distributions. So the action time and level of communication depends primarily on the implementation of the CP solution. It cannot be solved sequentially due to local memory requirements but on the other hand use of all processes leads to a substantial communication overhead. Furthermore, the precision of its solution should be much higher than the required precision of dual solution else it causes divergence of the top- level FETI dual solver.

In this paper, we compare an effect of a choice of the parallel LU direct solver from a set available in PETSc on HECToR (MUMPS, SuperLU) on the performance of CP solution using two strategies:

CP matrix factorization and standard forward/backward substitutions, and
CP matrix factorization, explicit distributed dense CP matrix inverse assembly and dense matrix-vector multiplications.

They were implemented in our FLLOP library (FETI Light Layer on top of PETSc). As a benchmark, a model three-dimensional linear elasticity problem decomposed into 8,000 subdomains was chosen. For the used Cray XE6 architecture, vendor supplied libraries and our PETSc-based implementation FLLOP we can recommend following approaches:

SuperLU_DIST on 10 parallel subcommunicators (each with 800 cores),
MUMPS on 1000 parallel subcommunicators (each with 8 cores).

purchase the full-text of this paper (price £20)

go to the previous paper
go to the next paper
return to the table of contents
return to the book description
purchase this book (price £40 +P&P)

	Computational & Technology Resources an online resource for computational, engineering & technology publications
	not logged in - login
Front Page Browse CCP CSETS CTR IJRT Other Authors Search Purchase Guide FAQ Contact us	Civil-Comp Proceedings ISSN 1759-3433 CCP: 101 PROCEEDINGS OF THE THIRD INTERNATIONAL CONFERENCE ON PARALLEL, DISTRIBUTED, GRID AND CLOUD COMPUTING FOR ENGINEERING Edited by: Paper 6 A Comparison of FETI Natural Coarse Space Projector Implementation Strategies V. Hapla^1,2 and D. Horak^1,2 ¹Department of Applied Mathematics, VSB-Technical University of Ostrava, Ostrava, Czech Republic ² Centre of Excellence IT4Innovations, VSB-Technical University of Ostrava, Ostrava, Czech Republic doi:10.4203/ccp.101.6 purchase the full-text of this paper Full Bibliographic Reference for this paper V. Hapla, D. Horak, "A Comparison of FETI Natural Coarse Space Projector Implementation Strategies", in , (Editors), "Proceedings of the Third International Conference on Parallel, Distributed, Grid and Cloud Computing for Engineering", Civil-Comp Press, Stirlingshire, UK, Paper 6, 2013. doi:10.4203/ccp.101.6 Keywords: domain decomposition, FETI, Total FETI, TFETI, FLLOP, PETSc, parallel direct solver, pseudoinverse, natural coarse space matrix, coarse problem. Summary In domain decomposition method, the natural effort for large problems solved on the massively parallel computers is to maximize the number of subdomains so that sizes of subdomain stiffness matrices are reduced which accelerates the primal operations. On the other hand, the negative effect of that are the growing dimensions of objects that couple the subdomains, i.e. operators whose domain or image is either the dual space (whose dimension is number of Lagrange multipliers on subdomains interfaces) or the kernel of the stiffness matrix. We found out that particularly an application of the projector onto the natural coarse space can become a severe bottleneck of the FETI method, especially the part called the coarse problem (CP) solution. Concerning very large problems, these operations can start to dominate in computation times, destroy scalability or even cause out-of-memory error if we do not parallelize them. According to the observations, the matrix-vector and matrix-transpose-vector multiplications take approximately the same time for different coarse space matrix distributions. So the action time and level of communication depends primarily on the implementation of the CP solution. It cannot be solved sequentially due to local memory requirements but on the other hand use of all processes leads to a substantial communication overhead. Furthermore, the precision of its solution should be much higher than the required precision of dual solution else it causes divergence of the top- level FETI dual solver. In this paper, we compare an effect of a choice of the parallel LU direct solver from a set available in PETSc on HECToR (MUMPS, SuperLU) on the performance of CP solution using two strategies: CP matrix factorization and standard forward/backward substitutions, and CP matrix factorization, explicit distributed dense CP matrix inverse assembly and dense matrix-vector multiplications. They were implemented in our FLLOP library (FETI Light Layer on top of PETSc). As a benchmark, a model three-dimensional linear elasticity problem decomposed into 8,000 subdomains was chosen. For the used Cray XE6 architecture, vendor supplied libraries and our PETSc-based implementation FLLOP we can recommend following approaches: SuperLU_DIST on 10 parallel subcommunicators (each with 800 cores), MUMPS on 1000 parallel subcommunicators (each with 8 cores). purchase the full-text of this paper (price £20) go to the previous paper go to the next paper return to the table of contents return to the book description purchase this book (price £40 +P&P)
Back to top	©Civil-Comp Limited 2023 - terms & conditions