10 likes | 133 Views
Improving the Performance of Parallel Backprojection on a Reconfigurable Supercomputer. Ben Cordes, Miriam Leeser, Eric Miller Northeastern University, Boston MA {bcordes,mel,elmiller}@ece.neu.edu. Richard Linderman Air Force Research Laboratory, Rome NY richard.linderman@rl.af.mil.
E N D
Improving the Performance of Parallel Backprojection on a Reconfigurable Supercomputer Ben Cordes, Miriam Leeser, Eric MillerNortheastern University, Boston MA{bcordes,mel,elmiller}@ece.neu.edu Richard LindermanAir Force Research Laboratory, Rome NYrichard.linderman@rl.af.mil Old Block Diagram New Block Diagram Backprojection is an image reconstruction algorithm that is used in a number of applications, including synthetic aperture radar (SAR). Backprojection for SAR contains a high degree of parallelism, which makes it well-suited for implementation on reconfigurable devices. We have previously reported results of an implementation of backprojection for the AFRL Heterogeneous High Performance Cluster (HHPC), a supercomputer that combines traditional CPUs with reconfirgurable computing resources. Using 32 hybrid (CPU+FPGA) nodes, we achieved a 26x performance gain over a single CPU-only node. In this work we achieve significant speedup by eliminating the performance bottlenecks that were previously experienced. A number of improvements in the system are projected to provide greater than 500x speedup over single-node software when included in a highly parallelized implementation. Input BlockRAMs Input BlockRAMs AddressGenerator AddressGenerator Swath LUT Swath LUT StagingBlockRAMs PCI Bus PCI Bus Target Memories Target Memories HHPC Cluster Architecture • 48-node Beowulf cluster • Dual 2.2GHz Xeon CPUs • Linux operating system • Annapolis Microsystems Wildstar II/PCI FPGA board • Gigabit Ethernet interface board • Myrinet (MPI) interface board Target Output FIFO Gigibit Ethernet Myrinet Key Improvements to Single-Node Performance Old Design New Design • PIO transfer mode • Requires two-step staged transfers because of limited address space • Processing controlled by Host PC • PIO data transfer requires blocking API calls • Host issues “run” commands to FPGA at each processing step • FPGA processes four projections per iteration • Total processing requires 1024 projections to be processed • If 4 projections are processed, 1024/4=256 iterations are required • DMA transfer mode • Allows direct memory loading • More efficient utilization of PCI bus • Shifts performance bottleneck onto data processing • Improves transfer speeds 15x Host-to-FPGA, 45x FPGA-to-Host • Processing controlled by FPGA • DMA data transfers are mastered by the FPGA • Host PC is freed to perform other work during processing • 30% speedup in processing time • FPGA processes eight projections per iteration • Single-step processing time increased slightly even though twice as much data is processed • Overall data transfer time unaffected • 32% speedup in overall processing time Data Block 0Host-Stage RAM(a) Data Block 1Host-Stage RAM(b) Data Block 2Host-Stage RAM(a) Host Data Block 0Stage RAM(a)-Input RAM Data Block 1Stage RAM(b)-Input RAM Parallel Implementation Improvements FPGA Block 0Host-Input Block 1Host-Input Block 2Host-Input Block 3Host-Input FPGA • Removed dependency on clustered filesystem • Previous system read input data from disk • Significant source of non-deterministic runtime • Impact: as much as 10x overall performance improvement • Improved data distribution model: Swathbuckler project • Previous work distributed projection data from a single node • New model allows incoming data to be streamed to each node • Non-recurring setup time can be amortized across multiple runs • Output images are not collected; they remain on the individual nodes • Publish-subscribe system provides consumers with needed images • Impact: approximately 2x improvement • Improved single-node implementation • See detail at right • Impact: approximately 2.5x improvement Host+FPGA Transfer Data RunIteration Transfer Data RunIteration PrepareData “Go” Host Transfer Data RunIteration Transfer Data FPGA Transfer4 Projections RunIteration Transfer4 Projections RunIteration FPGA Transfer8 Projections RunIteration Repeat 1024/4 = 256 times FPGA Projected Parallel Performance Repeat 1024/8 = 128 times Single-node performance: 5.3x over software Single-node performance: 2.1x over software References • A. Conti, B. Cordes, M. Leeser, E. Miller, and R. Linderman, “Adapting Parallel Backprojection to an FPGA Enhanced Distributed Computing Environment”. Ninth Annual Workshop on High-Performance Embedded Computing (HPEC), September 2005. • S. Tucker, R. Vienneau, J. Corner, and R. Linderman, “Swathbuckler: HPC Processing and Information Exploitation”. Proceedings of the 2006 IEEE Radar Conference, April 2006.