1 / 1

PIO transfer mode Requires two-step staged transfers because of limited address space

Improving the Performance of Parallel Backprojection on a Reconfigurable Supercomputer. Ben Cordes, Miriam Leeser, Eric Miller Northeastern University, Boston MA {bcordes,mel,elmiller}@ece.neu.edu. Richard Linderman Air Force Research Laboratory, Rome NY richard.linderman@rl.af.mil.

mandell
Download Presentation

PIO transfer mode Requires two-step staged transfers because of limited address space

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Improving the Performance of Parallel Backprojection on a Reconfigurable Supercomputer Ben Cordes, Miriam Leeser, Eric MillerNortheastern University, Boston MA{bcordes,mel,elmiller}@ece.neu.edu Richard LindermanAir Force Research Laboratory, Rome NYrichard.linderman@rl.af.mil Old Block Diagram New Block Diagram Backprojection is an image reconstruction algorithm that is used in a number of applications, including synthetic aperture radar (SAR). Backprojection for SAR contains a high degree of parallelism, which makes it well-suited for implementation on reconfigurable devices. We have previously reported results of an implementation of backprojection for the AFRL Heterogeneous High Performance Cluster (HHPC), a supercomputer that combines traditional CPUs with reconfirgurable computing resources. Using 32 hybrid (CPU+FPGA) nodes, we achieved a 26x performance gain over a single CPU-only node. In this work we achieve significant speedup by eliminating the performance bottlenecks that were previously experienced. A number of improvements in the system are projected to provide greater than 500x speedup over single-node software when included in a highly parallelized implementation. Input BlockRAMs Input BlockRAMs AddressGenerator AddressGenerator Swath LUT Swath LUT StagingBlockRAMs PCI Bus PCI Bus Target Memories Target Memories HHPC Cluster Architecture • 48-node Beowulf cluster • Dual 2.2GHz Xeon CPUs • Linux operating system • Annapolis Microsystems Wildstar II/PCI FPGA board • Gigabit Ethernet interface board • Myrinet (MPI) interface board Target Output FIFO Gigibit Ethernet Myrinet Key Improvements to Single-Node Performance Old Design New Design • PIO transfer mode • Requires two-step staged transfers because of limited address space • Processing controlled by Host PC • PIO data transfer requires blocking API calls • Host issues “run” commands to FPGA at each processing step • FPGA processes four projections per iteration • Total processing requires 1024 projections to be processed • If 4 projections are processed, 1024/4=256 iterations are required • DMA transfer mode • Allows direct memory loading • More efficient utilization of PCI bus • Shifts performance bottleneck onto data processing • Improves transfer speeds 15x Host-to-FPGA, 45x FPGA-to-Host • Processing controlled by FPGA • DMA data transfers are mastered by the FPGA • Host PC is freed to perform other work during processing • 30% speedup in processing time • FPGA processes eight projections per iteration • Single-step processing time increased slightly even though twice as much data is processed • Overall data transfer time unaffected • 32% speedup in overall processing time Data Block 0Host-Stage RAM(a) Data Block 1Host-Stage RAM(b) Data Block 2Host-Stage RAM(a) Host Data Block 0Stage RAM(a)-Input RAM Data Block 1Stage RAM(b)-Input RAM Parallel Implementation Improvements FPGA Block 0Host-Input Block 1Host-Input Block 2Host-Input Block 3Host-Input FPGA • Removed dependency on clustered filesystem • Previous system read input data from disk • Significant source of non-deterministic runtime • Impact: as much as 10x overall performance improvement • Improved data distribution model: Swathbuckler project • Previous work distributed projection data from a single node • New model allows incoming data to be streamed to each node • Non-recurring setup time can be amortized across multiple runs • Output images are not collected; they remain on the individual nodes • Publish-subscribe system provides consumers with needed images • Impact: approximately 2x improvement • Improved single-node implementation • See detail at right • Impact: approximately 2.5x improvement Host+FPGA Transfer Data RunIteration Transfer Data RunIteration PrepareData “Go” Host Transfer Data RunIteration Transfer Data FPGA Transfer4 Projections RunIteration Transfer4 Projections RunIteration FPGA Transfer8 Projections RunIteration Repeat 1024/4 = 256 times FPGA Projected Parallel Performance Repeat 1024/8 = 128 times Single-node performance: 5.3x over software Single-node performance: 2.1x over software References • A. Conti, B. Cordes, M. Leeser, E. Miller, and R. Linderman, “Adapting Parallel Backprojection to an FPGA Enhanced Distributed Computing Environment”. Ninth Annual Workshop on High-Performance Embedded Computing (HPEC), September 2005. • S. Tucker, R. Vienneau, J. Corner, and R. Linderman, “Swathbuckler: HPC Processing and Information Exploitation”. Proceedings of the 2006 IEEE Radar Conference, April 2006.

More Related