1 / 28

Massively parallel implementation of Total-FETI DDM with application to medical image registration

Massively parallel implementation of Total-FETI DDM with application to medical image registration. Michal Merta Alena Vašatová Václav Hapla David Horák. DD21, Rennes, France. Motivation. solution of large-scale scientific and engineering problems possibly hundreds of millions DOFs

cala
Download Presentation

Massively parallel implementation of Total-FETI DDM with application to medical image registration

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Massively parallel implementation of Total-FETI DDM with application to medical image registration Michal Merta Alena Vašatová Václav Hapla David Horák DD21, Rennes, France

  2. Motivation • solution of large-scale scientific and engineering problems • possibly hundreds of millions DOFs • linear problems • non-linear problems • non-overlapping, FETI methods with up to tens of thousands of subdomains • usage of PRACE Tier-1 and Tier-0 HPC systems

  3. PETSc(Portable, Extensible Toolkit for Scientific computation) developed by Argonne National Laboratory data structures and routines for the scalable parallel solution of scientific applications modeled by PDE coded primarily in C language,but good FORTRANsupport, can also be called from C++ and Python codes currentversion is 3.2www.mcs.anl.gov/petsc petsc-dev (development branch) is intensively evolving code and mailing lists open to anybody

  4. PETSccomponents seq. / par.

  5. Trilinos developed by Sandia National Laboratories collection of relatively independent packages toolkit for basic linear algebra operations, direct and iterative solvers for linear systems, PDE discretization utilities,mesh generation tools etc. objectoriented design, high modularity, use of modern C++ features (templating) mainly inC++ (Fortran and Python bindings) current version 10.10 trilinos.sandia.gov

  6. Trilinoscomponents

  7. BothPETSc and Trilinos… are parallelized on the data level (vectors & matrices) using MPI use BLAS andLAPACK– de facto standard for dense LA have their own implementation of sparse BLAS include robustpreconditioners,linearsolvers (direct and iterative) and nonlinear solvers can cooperate with many other external solvers and libraries (e.g. MATLAB, MUMPS, UMFPACK, …) support CUDA and hybrid parallelization are licensed as open-source

  8. Problem of elastostatics f

  9. TFETI decomposition

  10. Primal discretized formulation The FEM discretization with a suitable numbering of nodes results in the QP problem:

  11. Dual discretized formulation(homogenized) QP problem again, but with lower dimension and simpler constraints

  12. Primal data distribution,F action * very sparse … straightforwardmatrix distribution, given by a decomposition blockdiagonal embarrassinglyparallel

  13. Coarse projector action * ? ? ? … can easilytake 85 % of computation time if not properlyparallelized!

  14. G preprocessing and action preprocessing action ?

  15. Coarse problempreprocessing and action preprocessing 2 action 1 ? 3 Currentlyused variant: B2 (PPAM 2011)

  16. Coarse problem

  17. HECToR phase 3 (XE6) the UK's largest, fastest and most powerful supercomputer supplied by Cray Inc., operated by EPCC uses the latest AMD "Bulldozer" multicoreprocessor architecture 704 compute blades each blade with 4 compute nodes giving a total of 2816 compute nodes each node with two 16-core AMD Opteron 2.3GHz Interlagosprocessors→32 cores per node total of 90112 cores each 16-core processor shares 16Gb of memory, in total 60 Tb theoretical peak performance over 800 Tflops www.hector.ac.uk

  18. Benchmark K+ implemented as direct solve (LU) of regularized K built-in CG routine used(PETSc.KSP, Trilinos.Belos) E = 1e6, = 0.3, g= 9.81 ms-2 computed @ HECToR

  19. Results

  20. Application to image registration Process of integrating information from two (or more) different images Images from different sensors, different angles or/and times

  21. Application to image registration • In medicine: • Monitoring of growth of a tumour • Therapy valuation • Comparison of patient data with anathomical atlas • Data from magnetic resonance (MR), computer tomography (CT), positron emission tomography (PET)

  22. Elastic registration The task is to minimize the distance between two images

  23. Elastic registration Parallelization using TFETI method

  24. Results stopping criterion: ||rk|| / || r0|| < 1e-5

  25. Solution

  26. Conclusion and future work To consolidate PETSc & Trilinos TFETI implementation to the form of extensions or packages To further optimize the codes using core-hours on Tier-1/Tier-0 systems (PRACE DECI Initiative, HPC-Europa2) To extend image registration to 3D data

  27. References KOZUBEK T. et al. Total FETI domain decomposition method and its massively parallel implementation. Accepted for publishing in Advances in Engineering Software. HORAK, D.; HAPLA, V. TFETI coarse space projectors parallelization strategies. Accepted for publishing in the proceedings of PPAM 2011, Springer LNCS, 2012. Zitova B., Flusser J., Image registration methods: a survey, Image and Vision Computing, Vol.21, No.11, 2003, pp. 977-100.

  28. Thank you for your attention!

More Related