300 likes | 446 Views
Power-Efficient Medical Image Processing using PUMA. Ganesh Dasika , Kevin Fan 1 , Scott Mahlke. University of Michigan Advanced Computer Architecture Laboratory. 1 Parakinetics, Inc. The Advent of the GPGPU. Increasingly popular substrate for HPC Astrophysics Weather Prediction EDA
E N D
Power-Efficient Medical Image Processing using PUMA GaneshDasika, Kevin Fan1, Scott Mahlke University of Michigan Advanced Computer Architecture Laboratory 1Parakinetics, Inc.
The Advent of the GPGPU • Increasingly popular substrate for HPC • Astrophysics • Weather Prediction • EDA • Financial instrument pricing • Medical Imaging
Advantages of GPGPUs • High degree of parallelism • Data-level • Thread-level • High bandwidth • Commodity products • Increasingly programmable
Disadvantages of GPGPUs • Gap between computation and bandwidth • 933 GFLOPS : 142 GB/s bandwidth (0.15B of data per FLOP, ~26:1 Compute:Mem Ratio) • Very high power consumption • Graphics-specific hardware • Multiple thread contexts • Large register files and memories • Fully general datapath Inefficiencies in all general-purpose architectures
Programmability vs Efficiency? FPGAs Highly efficient, some programmability General PurposeProcessors DSPs Domain-specific Accelerators, GPGPUs Flexibility ??? Loop Accelerators, ASICs Efficiency
Medical Image Reconstruction • Compute intensive loops • 32-bit floating point code • High data/bandwidth requirements • Increased demand for portability, low power • Much current research focuses on using GPGPUs for this domain
CT Image reconstruction • X-Ray emitters and receptors on opposite sides of patients • Received x-ray intensity corresponds to tissue density • Multiple scans (“slices”) taken around patient put together to reconstruct 1 2D-image
Projection & Sinogram Sinogram:All projections y Projection:All ray-sums in a direction P(t) t p x f(x,y) t X-rays Sinogram
Example: Backprojection Sinogram Backprojected Image
Example:Filtered Backprojection Filtered Sinogram Reconstructed Image
Reconstruction: Solve for m’s X-Ray Emitter 22 12 “Human Body“ 10 15 Detector Values 16 22 11 10 Densities
? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? Real Reconstruction Problem 100’s of diagonals @ 100’s of angles 712 199 255 • Intensity measured • Rays transmitted through multiple “pixels” • Find individual “pixel” values from transmission data 534 417 512 values 364 555 501 355 512 values
Medical Imaging Applications • Image reconstruction for MRI/CT/PET scans • Large amounts of Vector/Thread-level parallelism • FP-intensive kernels • Often requiring math library functions • Data-intensive (~5:1 compute:mem ratio)
Current Concerns: Portability/Power • Currently, most scans require moving patient to imaging room • Consumes time • Stress on patient • Studies show benefits of portable, bed-side scanners: • 86% increase in patients suitable for post-stroke thrombolytic therapy [Weinreb et al, RSNA] • 80-100% drop in scan-related complications [Gunnarsson et al, J. of Neurosurgery] • New X-Ray emitters push for mAs of current use
Current Concerns: Performance • High-accuracy CT algorithms take too long • Iterative forward/backward projection • ~Hours on modern CT scanners instead of minutes • Interventional radiology • Scans currently takes minutes, but should take seconds • CT-Flouroscopy • Several scans done in succession
Flexibility • Software algorithms change over time • NRE • Time-to-market
PUMA • Tiled architecture • Bandwidth-matched for improved efficiency • Each tile is a “Programmable Loop Accelerator” Extern. Interface … Disk Mem CPU
Programmable Loop Accelerator • Generalize accelerator without losing efficiency FPGAs General PurposeProcessors DSPs Domain-specific Accelerators, GPGPUs Flexibility ??? Programmable Loop Accelerators Loop Accelerators, ASICs Efficiency, Performance
Designing Loop Accelerators Local Mem << MEM * … … … … … … … … … … … … CRF Point-to-point Connections BR + + MEM … … & Local Mem Hardware Loop C Code
Loop Accelerator Architecture CRF Point-to-point Connections … … … … … … FSM Local Mem BR + & MEM Controlsignals • Hardware realization of modulo scheduled loop • Parameterized hardware: • FUs • Shift Register Files • Static Control • Point-to-point Interconnect
Programmable Loop-Accelerator Architecture CRF Literals Point-to-point Connections Ring … … … … … … Control Memory FSM Local Mem + & BR +/- &/| MEM Controlsignals RR SRF RR SRF SRF RR SRF RR LA PLA • Functionality • Storage • Connectivity • Control Custom FU set Generalized FUs + MOVs Limited size, no addr. Rotating Reg. Files Point-to-point Ring + Port-swapping Hardwired Control Lit. Reg. File + Control Mem
MRI.FH PLA • ~0.6 mm2 per tile • 38 FUs • 128 32-bit registers • Inter-FU BW 1 TB/sec
Performance on MRI.FH PLA Unschedulable II preserved II doubled
PUMA System Design • 5 systems designed around 5 benchmarks • Each composed of identical tiles • Assume same B/W as GTX280 (142 GB/s) • # Tiles based on B/W requirements of benchmark Extern. Interface … Disk Mem CPU
System Performance 4W 3W 2.8W 2.3W 2.7W
Performance vs. GPGPU 2X performance of GTS 250 63% performance of GTX 295
Efficiency vs. GPGPU 54X 22X
Conclusions • Power-efficient accelerator for medical imaging • ASIC-like efficiency with programmability • 63-201% of GPU performance • 22-54X GPU Performance/Power efficiency