Richard Vondráček Department of Structural Mechanics Faculty of Civil Engineering

The Use of The Sparse Direct Solver in The Engineering Applications of The Finite Element MethodPoužití přímého řídkého řešiče v inženýrských aplikacích metody konečných prvků Richard Vondráček Department of Structural Mechanics Faculty of Civil Engineering Czech Technical University in Prague

Outline • Solutions of SLE • Direct sparse solver • Matrix storage formats • Ordering algorithms in general • Block storage • Symbolical factorization • Sparse numerical factorization • Preliminary results • Future goals

Methods for the solution of the sparse system of the linear equations • Iterative • Jacobi, Gauss-Seidel, SOR • CG, GMRES • Multigrid • Direct • Factorization type • LU, LDLT, LLT ( left-looking / right-looking ) • QR • Algorithm (matrix storage scheme) • Skyline (L1) • Frontal (L3) • Multifrontal (L3) • Sparse (L0..L3) • Combined • Incomplete factorizations ( preconditioned CG ) • Domain decompositions - FETI, DP-FETI

Sparse direct solver motivation • Fast factorization • Robust implementation • Unsymmetrical, indefinite, poorly conditioned problems • Lower memory consumption • Enable factorization of bigger tasks • Practical with existing FEM solvers • Compatible with domain decomposition algorithms • Freely available source and documentation

Sparse direct solver • Sparse LDLT, LLT, LU decomposition • Compressed columns storage • Block storage scheme • less indices • efficiency in floating point operations • locality of data is good for memory access caching • Symbolic factorization • AMD ordering (T.A.Davis, P.Amestoy and I.S.Duff) • Minimizes directly the fill-in • Quotient graph • Pattern of matrix L

Matrix ordering algorithms • Profile minimizing • Cuthill-McKee • Sloan • Spectral envelope reduction • Fill-in minimizing • Minimum degree (local → global) • AMD • MMD • Nested Disection (global → local) • Recursive Graph Bisection • Recursive Coordinate Bisection • Recursive Spectral Bisection • Multisection

Matrix storage schemes - orderings Example 2D plane stress mechanical problem Sparse stiffness matrix

Matrix storage schemes - orderings Original ordering + fill-in skyline Original ordering + fill-in sparse direct

Matrix storage schemes - orderings Reversed Cuthill-McKee skyline AMD + fill-in sparse direct

Matrix storage schemes - orderings Recursive Graph Bisection sparse direct Principle of the bisection

Block sparsematrix storage scheme The size of the blocks corresponds with the number of DOFs ofmesh node. The size depends on the physics of the underlying problem. • less indices • efficiency in floating point operations • locality of data is good for memory access caching • performance is tradeoff ( fast arith. vs. wasted arith. ) b×n 2×2, 3×3,.. ..,b×b

Block sparsematrix storage scheme Care must be taken about how the individual equations of the sparse system fit into the regular block structure. b×n Equations eliminated due to the DBC should not destroy the block structure !!

Block sparsematrix storage scheme u1a u2a pa k11ab k12ab c1ab u1b Indefinite systems which have regular “indefinite contributions” from each mesh node can be a-priory preordered to ensure the numerical stability of the decomposition k21ab k22ab c2ab u2b c1ab c2ab 0 pb b

Sparse solution strategy clear matrix L select DOFs permutation and blocksize create coarse connectivity matrix load nonzero matrix A entries into prepared matrix L symbolical factorization finds the block ordering and L pattern numerical factorization forward / backward substitutions for different RHSs allocate memory for the block sparse matrix L

Symbolical factorization • Based on a modified Approximate Minimum Degree algorithm • AMD ordering (T.A.Davis, P.Amestoy and I.S.Duff) (colamd) • Principle of the minimum degree • Finds the optimal equations preordering • Determines the final memory requirements • Gives us the structure of the factored matrix L

Symbolical factorization Connectivity graph used for symbolic factorization Original graph Elimination graph Quotient graph eliminate

Approximate minimum degree Quotient graph in k-th step of factorization

Approximate minimum degree Important sets True external degree (minimum degree)

Numerical factorization • Left-looking decomposition • Three embeded “for” loops • Indirect addressing in the inner loop due to the compressed column format • No further memory allocation • Preoptimized block multiplications • Use of temporary index maps

Numerical factorization Order in which the left-looking factorization updates the matrix entries in the two outer loops

Numerical factorization The purpose of the factorization inner loop in a dense matrix

Numerical factorization Computational effort of the skyline solver in the inner loop

Numerical factorization Computational effort of the sparse direct solver in the inner loop

Numerical experiments • Regular domains and structured meshes • General plate problem (Mindlin theory) • General threedimensional problem

Numerical experiments Parameters of used cluster

Regular domains and structuredmeshes

Regular domains and structuredmeshes Regular domains and structuredmeshes Memory demands of the Schur complement method

Regular domains and structuredmeshes Regular domains and structuredmeshes Time demands of the Schur complement method in seconds

Regular domains and structuredmeshes Regular domains and structuredmeshes Memory demands of the DP-FETI method

Regular domains and structuredmeshes Regular domains and structuredmeshes Time requirements of the DP-FETI method in seconds

General plate problem (Mindlin theory)

General threedimensional problem

Decomposition into subdomains

General threedimensional problem

General threedimensional problem General domain and unstructuredmeshes The Schur complement method

Preliminary conclusions • Competitive with skyline already for small tasks (but not for all) • Clear winner for larger tasks • If the system matrix fits well into the blocks we obtain optimal performance • Always faster RHS solution (forw., back. substitution) than skyline • Can be used as IC preconditioner for CG • Lower memory consumption • Allows bigger tasks without pagefile swapping

Future work and possible research • Study and find best practices for • block size influence to the performance • domain shape influence to the sparse direct vs. skyline effectivity • performance in nonlinear mechanical problems • performance in indefinite problems (incompressible media) • Try variable block size storage schema • Implement automatic block utilization algorithm • LU for primary domain decomposition • Implement block skyline and compare performance • ? Combined storage schema sparse + skyline ?

Thank you !

Richard Vondráček Department of Structural Mechanics Faculty of Civil Engineering