Massively parallel implementation of Total-FETI DDM with application to medical image registration

Massively parallel implementation of Total-FETI DDM with application to medical image registration Michal Merta Alena Vašatová Václav Hapla David Horák DD21, Rennes, France

Motivation • solution of large-scale scientific and engineering problems • possibly hundreds of millions DOFs • linear problems • non-linear problems • non-overlapping, FETI methods with up to tens of thousands of subdomains • usage of PRACE Tier-1 and Tier-0 HPC systems

PETSc(Portable, Extensible Toolkit for Scientific computation) developed by Argonne National Laboratory data structures and routines for the scalable parallel solution of scientific applications modeled by PDE coded primarily in C language,but good FORTRANsupport, can also be called from C++ and Python codes currentversion is 3.2www.mcs.anl.gov/petsc petsc-dev (development branch) is intensively evolving code and mailing lists open to anybody

PETSccomponents seq. / par.

Trilinos developed by Sandia National Laboratories collection of relatively independent packages toolkit for basic linear algebra operations, direct and iterative solvers for linear systems, PDE discretization utilities,mesh generation tools etc. objectoriented design, high modularity, use of modern C++ features (templating) mainly inC++ (Fortran and Python bindings) current version 10.10 trilinos.sandia.gov

Trilinoscomponents

BothPETSc and Trilinos… are parallelized on the data level (vectors & matrices) using MPI use BLAS andLAPACK– de facto standard for dense LA have their own implementation of sparse BLAS include robustpreconditioners,linearsolvers (direct and iterative) and nonlinear solvers can cooperate with many other external solvers and libraries (e.g. MATLAB, MUMPS, UMFPACK, …) support CUDA and hybrid parallelization are licensed as open-source

Problem of elastostatics f

TFETI decomposition

Primal discretized formulation The FEM discretization with a suitable numbering of nodes results in the QP problem:

Dual discretized formulation(homogenized) QP problem again, but with lower dimension and simpler constraints

Primal data distribution,F action * very sparse … straightforwardmatrix distribution, given by a decomposition blockdiagonal embarrassinglyparallel

Coarse projector action * ? ? ? … can easilytake 85 % of computation time if not properlyparallelized!

G preprocessing and action preprocessing action ?

Coarse problempreprocessing and action preprocessing 2 action 1 ? 3 Currentlyused variant: B2 (PPAM 2011)

Coarse problem

HECToR phase 3 (XE6) the UK's largest, fastest and most powerful supercomputer supplied by Cray Inc., operated by EPCC uses the latest AMD "Bulldozer" multicoreprocessor architecture 704 compute blades each blade with 4 compute nodes giving a total of 2816 compute nodes each node with two 16-core AMD Opteron 2.3GHz Interlagosprocessors→32 cores per node total of 90112 cores each 16-core processor shares 16Gb of memory, in total 60 Tb theoretical peak performance over 800 Tflops www.hector.ac.uk

Benchmark K+ implemented as direct solve (LU) of regularized K built-in CG routine used(PETSc.KSP, Trilinos.Belos) E = 1e6, = 0.3, g= 9.81 ms-2 computed @ HECToR

Results

Application to image registration Process of integrating information from two (or more) different images Images from different sensors, different angles or/and times

Application to image registration • In medicine: • Monitoring of growth of a tumour • Therapy valuation • Comparison of patient data with anathomical atlas • Data from magnetic resonance (MR), computer tomography (CT), positron emission tomography (PET)

Elastic registration The task is to minimize the distance between two images

Elastic registration Parallelization using TFETI method

Results stopping criterion: ||rk|| / || r0|| < 1e-5

Solution

Conclusion and future work To consolidate PETSc & Trilinos TFETI implementation to the form of extensions or packages To further optimize the codes using core-hours on Tier-1/Tier-0 systems (PRACE DECI Initiative, HPC-Europa2) To extend image registration to 3D data

References KOZUBEK T. et al. Total FETI domain decomposition method and its massively parallel implementation. Accepted for publishing in Advances in Engineering Software. HORAK, D.; HAPLA, V. TFETI coarse space projectors parallelization strategies. Accepted for publishing in the proceedings of PPAM 2011, Springer LNCS, 2012. Zitova B., Flusser J., Image registration methods: a survey, Image and Vision Computing, Vol.21, No.11, 2003, pp. 977-100.

Thank you for your attention!

Massively parallel implementation of Total-FETI DDM with application to medical image registration

Massively parallel implementation of Total-FETI DDM with application to medical image registration

Presentation Transcript

Massively Parallel Processors

Copy Number Alterations with Massively Parallel Sequencing

Medical Image Registration: Concepts and Implementation

Medical Image Registration: A Survey

A Survey of Medical Image Registration

Medical Image Registration A Quick Win

Programming Massively Parallel Graphics Processors

CUDA Lecture 1 Introduction to Massively Parallel Computing

Massively parallel implementation of Total-FETI DDM with application to medical image registration

Mass Market Applications of Massively Parallel Computing

Application to Registration

Image Registration and Application to Image-Guided Neurosurgery Lara Vigneron

Theoretical limitations of massively parallel biology

Medical Image Registration

Massively Parallel Signature Sequencing (MPSS)

Medical Image Registration

CM-5 Massively Parallel Supercomputer

Multiobjective Optimization for Reconﬁgurable Implementation of Medical Image Registration

Image Registration

Image Registration

Medical Image Registration A Quick Win

Image Registration