Tomographic mammography parallelization Juemin Zhang (NU) Tao Wu (MGH) Waleed Meleis (NU)

Tomographic mammography parallelization Juemin Zhang (NU) Tao Wu (MGH) Waleed Meleis (NU) David Kaeli (NU)

Parallelization of SSI Applications We have developed profile-guided parallelization techniques to rapidly characterize program control flow and data flow, and use this information to guide parallelization We have already sped up a number of CenSSIS applications, including: finite-difference time domain steepest descent fast multi-pole method photo simulation ellipsoid algorithm We target Beowulf clusters running Linux We utilize MPICH as our middleware

Tomographic mammography 3D image reconstruction from x-ray projections Used to detect and diagnose breast cancer Based on well-developed mammography techniques Exposes tissue structure using multiple projections from different angles Advantages Accuracy: provides at least as much useful information than x-ray film Flexibility: digital image manipulation, digital storage Provides structural information: using layered images Safe: low-dose x-ray Lower cost: compared to MRI

Y No Yes Image acquisition and reconstruction process X-ray source Acquisition: 11 uniform angular samples along Y-axis X-ray projection: breast tissue density absorption radiograph Algorithm: constrained non-linear convergence and iterative process Initialization Set 3D volume Forward 3D volume Compute projections Backward x-ray projections detector Z Correct 3D volume Y Satisfied? Exit X

Reconstruction and Parallelization Reconstruction algorithm: Maximum likelihood expectation maximization (ML-EM) High resolution image Computationally intensive: 3 hours serial execution on 2.2GHz Pentium 4 workstation, using 2GB memory The need for speed: Large number of medical cases Execution time increases as a function of breast size Real-time application: computer-guided needle biopsy breast surgery Research motivation Computation vs. communication Platforms vs. parallelization methods

exchange data Overlap area Parallelization approaches Third approach: Non-overlapped with inter-node communication (no redundant computation, more communication) First approach: No inter-node communication (more computation, no communication) Second approach: Overlap with inter-node communication Reduce communication data Segmentation along Y-axis Using redundant computation to replace communication Segmenting along x-ray beam

Implementation and tests Serial code provided by T. Wu at MGH Programming model C++ and message passing interface (MPI) Globus tool kits: MPICH-G2 over NPACI Grid, in progress Test input data set Phantom data set: 1600x2034x45 A large patient data set: 1040x2034x70 Test platforms

Partitioning methods comparison Input data set phantom 1600x2034x45 Platform: UIUC NCSA Titan cluster Non-overlap method out-performs other two methods The best parallel runtime is under 3 minutes using 64 processors Three methods show very similar speedup trends Given additional processors, non-overlap method yields higher performance increase than other methods

Platform performance comparisonusing non-overlap method Input data set: phantom 1600x2034x45 Platforms: SGI Altix system UIUC NCSA Titan cluster UIUC NCSA IBM p690 Pentium 4 cluster at MGH Number of processors: 32 Algorithm: Non-overlap with inter-node communication partition method Computation: SGI Altix with Itanium 2 processor outperforms the other CPUs Communication: shared memory platforms have very low communication overhead Over 2 times performance difference between SGI Altix and Pentium IV cluster

Platform performance comparison using no inter-node communication Input data set: phantom 1600x2034x45 Platform: SGI Altix system UIUC NCSA Titan cluster UIUC NCSA IBM p690 Pentium 4 cluster at MGH Number of processors: 32 Algorithm: overlap without inter-node communications Computation: significant differences between Titan, IBM p690 and P4 clusters Synchronization: more waiting time accumulated at the end iterations SGI Altix performance remains similar to non-overlap method

Platform and parallel partitioning method performance comparison Input data set: phantom 1600x2034x45 Platform: Pentium 4 cluster at MGH UIUC NCSA IBM p690 UIUC NCSA Titan cluster SGI Altix Number of processors: 32 Computation power dominant performances Inter-node communication and non-overlap methods lead to higher performance on some platforms

Summary and future work • Over 180X speedup vs. serial implementation 1. Phantom data set: 1600x2034x45 • 1 minute using 64 processors on SGI Altix 2. A large patient data set: 1040x2034x70 • 1.5 minutes using 64 processors on SGI Altix • Joint SPIE paper with T. Wu at MGH: “A parallel reconstruction method for digital tomosynthesis mammography,” 2004 SPIE Workshop on Medical Imaging • Future work: • Real-time application: computer-guided needle biopsy • Goal: 5~10 seconds delay or less • Evaluation of computation reduction effects on image quality • Move code to a Grid environment (underway)

Tomographic mammography parallelization Juemin Zhang (NU) Tao Wu (MGH) Waleed Meleis (NU)

Tomographic mammography parallelization Juemin Zhang (NU) Tao Wu (MGH) Waleed Meleis (NU)

Presentation Transcript

CONJUGATE ADDITION

Currency – Ngultrum Notes

t/n

t/n

t/n

EXAMPLE: Problem 5.5

NU Values

u

Pour commencer: Unscramble the vocabulary words. El iamenc Nu euctra Nu ilfm d’oerhurr

CONJUGATE ADDITION

Metal-Insulator Transitions in Complex Oxides probed by Compton Scattering