Parallel Beam Back Projection: Implementation

Parallel Beam Back Projection:Implementation Srdjan Coric Miriam Leeser Eric Miller

Outline • Annapolis Wildstar • “Simple Architecture” • algorithm • datapath • Performance • Results • Parallelism extraction • “Advanced Architecture 4x” • datapath • Performance • Results • Implementation issues • Future directions

Sinogram data address generation Sinogram data retrieval Sinogram data prefetch Linear interpolation Data accumulation Data read Data write Data Flow

LUT1 starting position Critical error-accumulation path LUT1 quantization error Bit reduction error LUT2 quantization error LUT3 quantization error 5 10 . LUT1: 15 1 . LUT2: 15 . 2 LUT3: Interpolation factor errorCorner starting position

“Simple Architecture” Datapath

Performance Results: Software vs. FPGA Hardware • Software - Floating point - 450 MHz Pentium : ~ 240 s • Software - Floating point - 1 GHz Dual Pentium : ~ 94 s • Software - Fixed point - 450 MHz Pentium : ~ 50 s • Software - Fixed point - 1 GHz Dual Pentium : ~ 28 s • Hardware - 50 MHz : ~ 5.4 s Parameters: 1024 projections 1024 samples per projection 512*512 pixels image 9-bit sinogram data 3-bit interpolation factor

Original image Hardware output image Zoom: ~200% Grayscale range < Pixel value range (heart features in focus)

Original image Hardware output image Zoom: ~200% Grayscale range < Pixel value range (lung features in focus)

Original image - Hardware output image

Memory bandwidth requirements at 50 MHz (for data accumulation) Case 1: 0.4 GB/s Case 2: 1.6 GB/s Case 3: 0.4 GB/s Memory bandwidth limit 1.2 GB/s Parallelism Issues Case 1: No parallelism extracted Case 2: Pixel level parallelism extracted Case 3: Projection level parallelism extracted Projections Image columns V1 Image rows V3 V2 T~k1*V1 T~k1*V2 T~k2*V3 k1 <k2, V2 =V3 =V1 /4, T=Execution time

Simple Architecture Advanced Architecture - Data Path projection parallelism extracted

Performance Results: Software vs. FPGA Hardware • Software - Floating point - 450 MHz Pentium : ~ 240 s • Software - Floating point - 1 GHz Dual Pentium : ~ 94 s • Software - Fixed point - 450 MHz Pentium : ~ 50 s • Software - Fixed point - 1 GHz Dual Pentium : ~ 28 s • Hardware - 50 MHz : ~ 5.4 s • Hardware (Advanced Architecture) - 50 MHz : ~ 1.3 s Parameters: 1024 projections 1024 samples per projection 512*512 pixels image 9-bit sinogram data 3-bit interpolation factor

Implementation Issues - fanout - prj_num(3) fanout = 1565 ! routing delay = 7.913 ns (~39.99%)

Implementation Issues - fanout - odd_2_A_4[4] fanout = 144 !

Memory Bridges Stuff 3 architectures implemented: • “Simple Architecture” = non-parallel (on slide 6) • “Advanced Architecture” = 4-way parallel (slide 12) • “Bridge Free Advanced Arch” = as B but contains no memory bridges (all design buffers in BlockRAMs) from PCI bus to memory banks required for Host-Memory communication. Bridges are separate design that is downloaded before (after) design C is downloaded so that input data can be stored to (output data read from) memories on the WildStar board. Virtex1000 resource utilization: • 11% logic, 90% BlockRAMs (with bridges) • 39% logic, 100% BlockRAMs • 21% logic, 100% BlockRAMs

Floorplan of the “Bridge Free Advanced Architecture” (design C on the previous slide)

Future Directions • Graduate

Parallel Beam Back Projection: Implementation

Parallel Beam Back Projection: Implementation

Presentation Transcript

PROJECTION

Orthographic Projection

Projections and clipping in 3D

Beam Design

ENERGY MEDICINE India 2018

Scenario Building

Parallel Projections

Engineering Graphics

Hex Beam

Projection Matrices

Electronic Projection Technology

The Self-Modulation I nstability

MCMC Using Parallel Computation

A Parallel Implementation of MSER detection

Dependency tree projection across parallel texts

Parallelization of System Matrix generation code

Projection Pursuit

ISOMETRIC PROJECTION Part II

3D Viewing ( From 3D to 2D)

Beam Deflections

Implementation of a Parallel K-Nearest Neighbor Algorithm Using MPI