280 likes | 396 Views
Massively parallel implementation of Total-FETI DDM with application to medical image registration. Michal Merta Alena Vašatová Václav Hapla David Horák. DD21, Rennes, France. Motivation. solution of large-scale scientific and engineering problems possibly hundreds of millions DOFs
E N D
Massively parallel implementation of Total-FETI DDM with application to medical image registration Michal Merta Alena Vašatová Václav Hapla David Horák DD21, Rennes, France
Motivation • solution of large-scale scientific and engineering problems • possibly hundreds of millions DOFs • linear problems • non-linear problems • non-overlapping, FETI methods with up to tens of thousands of subdomains • usage of PRACE Tier-1 and Tier-0 HPC systems
PETSc(Portable, Extensible Toolkit for Scientific computation) developed by Argonne National Laboratory data structures and routines for the scalable parallel solution of scientific applications modeled by PDE coded primarily in C language,but good FORTRANsupport, can also be called from C++ and Python codes currentversion is 3.2www.mcs.anl.gov/petsc petsc-dev (development branch) is intensively evolving code and mailing lists open to anybody
PETSccomponents seq. / par.
Trilinos developed by Sandia National Laboratories collection of relatively independent packages toolkit for basic linear algebra operations, direct and iterative solvers for linear systems, PDE discretization utilities,mesh generation tools etc. objectoriented design, high modularity, use of modern C++ features (templating) mainly inC++ (Fortran and Python bindings) current version 10.10 trilinos.sandia.gov
BothPETSc and Trilinos… are parallelized on the data level (vectors & matrices) using MPI use BLAS andLAPACK– de facto standard for dense LA have their own implementation of sparse BLAS include robustpreconditioners,linearsolvers (direct and iterative) and nonlinear solvers can cooperate with many other external solvers and libraries (e.g. MATLAB, MUMPS, UMFPACK, …) support CUDA and hybrid parallelization are licensed as open-source
Primal discretized formulation The FEM discretization with a suitable numbering of nodes results in the QP problem:
Dual discretized formulation(homogenized) QP problem again, but with lower dimension and simpler constraints
Primal data distribution,F action * very sparse … straightforwardmatrix distribution, given by a decomposition blockdiagonal embarrassinglyparallel
Coarse projector action * ? ? ? … can easilytake 85 % of computation time if not properlyparallelized!
G preprocessing and action preprocessing action ?
Coarse problempreprocessing and action preprocessing 2 action 1 ? 3 Currentlyused variant: B2 (PPAM 2011)
HECToR phase 3 (XE6) the UK's largest, fastest and most powerful supercomputer supplied by Cray Inc., operated by EPCC uses the latest AMD "Bulldozer" multicoreprocessor architecture 704 compute blades each blade with 4 compute nodes giving a total of 2816 compute nodes each node with two 16-core AMD Opteron 2.3GHz Interlagosprocessors→32 cores per node total of 90112 cores each 16-core processor shares 16Gb of memory, in total 60 Tb theoretical peak performance over 800 Tflops www.hector.ac.uk
Benchmark K+ implemented as direct solve (LU) of regularized K built-in CG routine used(PETSc.KSP, Trilinos.Belos) E = 1e6, = 0.3, g= 9.81 ms-2 computed @ HECToR
Application to image registration Process of integrating information from two (or more) different images Images from different sensors, different angles or/and times
Application to image registration • In medicine: • Monitoring of growth of a tumour • Therapy valuation • Comparison of patient data with anathomical atlas • Data from magnetic resonance (MR), computer tomography (CT), positron emission tomography (PET)
Elastic registration The task is to minimize the distance between two images
Elastic registration Parallelization using TFETI method
Results stopping criterion: ||rk|| / || r0|| < 1e-5
Conclusion and future work To consolidate PETSc & Trilinos TFETI implementation to the form of extensions or packages To further optimize the codes using core-hours on Tier-1/Tier-0 systems (PRACE DECI Initiative, HPC-Europa2) To extend image registration to 3D data
References KOZUBEK T. et al. Total FETI domain decomposition method and its massively parallel implementation. Accepted for publishing in Advances in Engineering Software. HORAK, D.; HAPLA, V. TFETI coarse space projectors parallelization strategies. Accepted for publishing in the proceedings of PPAM 2011, Springer LNCS, 2012. Zitova B., Flusser J., Image registration methods: a survey, Image and Vision Computing, Vol.21, No.11, 2003, pp. 977-100.