320 likes | 492 Views
Reconfigurable Computing: Current Status and Potential for Spacecraft Computing Systems. Rod Barto NASA/GSFC Office of Logic Design Spacecraft Digital Electronics 3312 Moonlight El Paso, Texas 79904. Reconfigurable Computing is….
E N D
Reconfigurable Computing:Current Status and Potential for Spacecraft Computing Systems Rod Barto NASA/GSFC Office of Logic Design Spacecraft Digital Electronics 3312 Moonlight El Paso, Texas 79904 127-MAPLD2005
Reconfigurable Computing is… • A design methodology by which computational components can be arranged in several ways to perform various computing tasks • Two types of reconfigurable computing: • Static, i.e., the computing system is configured before launch • Dynamic, i.e., the computing system can be reconfigured after launch 127-MAPLD2005
Static Reconfigurability • Several examples exist, e.g., Cray • Typically processing modules connected by an intercommunication mechanism, e.g., Ethernet • Goals are • To reduce system development costs • To provide higher performance computing 127-MAPLD2005
Dynamic Reconfigurability (DR) • Processing modules that can be reconfigured in flight • Goal is to provide processing support for algorithms that do not map well onto general purpose computers using reduced amounts of hardware 127-MAPLD2005
Outline of Paper • Discuss the computation of a series of algorithms on general purpose, special purpose, and DR computers • Calculate the execution time of an image processing algorithm on a concept DR computer • Compare the reconfiguration time of a Xilinx FPGA with the algorithm execution time calculated in section 2. • Obtain an extremely rough estimate of image processing algorithm execution time on a flight computer • Conclude that the DR computer described offers higher performance than does the flight computer 127-MAPLD2005
Section 1:Algorithm Execution on General Purpose (GP), Special Purpose (SP), and DR Computers 127-MAPLD2005
Output Input f1 f2 fn Processing example • A computing function is the composition of n algorithms executed serially • Can be executed on a general purpose computer (GP) or a special purpose computer (SP) 127-MAPLD2005
Output Input f1 f2 fn Execution on a GP Computer Processing time of each stage = ti, i=1..n Total processing time = Latency time = GP computer must execute processing stages sequentially, and cannot exploit parallelism in overall computing function 127-MAPLD2005
Output Input f1 f2 fn Processing on an SP Processor Each stage is an independently operating processor designed specifically for the algorithm it executes Processing time of each stage = ti, i=1..n Results appear at rate of one per max(ti), 1=1..n Latency time = • Performance increase comes from two factors: • Pipelining of constituent algorithms exploiting parallelism • Processors being designed specifically for their algorithms 127-MAPLD2005
Processing on a DR Computer • Two processing elements alternately process and reconfigure, i.e., fodd executes one algorithm while feven reconfigures for the next algorithm, etc. fodd Output Input feven 127-MAPLD2005
fodd f1 R f3 R fn R f2 R f4 R feven time R = reconfiguration Results appear at rate of one per Latency = DR Computer Processing Flow Performance increase comes from configuring processors specifically for the algorithm they are executing Do not get increase from exploiting parallelism. 127-MAPLD2005
Section 2:Execution Time of an Image Processing Algorithm on a Concept DR Computer 127-MAPLD2005
DR Computer Concept FPGA0 • RAM0 is source for FPGA0, destination for FPGFA1, etc. • Processing elements are implemented in FPGAs • FPGA0 and FPGA1 alternately process and reconfigure, as previously discussed. • Input and output not shown RAM0 RAM1 FPGA1 127-MAPLD2005
AlgorithmExample: 3x3 Image Convolution • Shifting in 1 row at a time pixel-serial, and parallel shifting into the upper 3 row registers, the rows are shifted around through the convolution processor. All the row registers and processing is inside the FPGA. The results are written to the destination RAM after a latency of 3 row reads. Circular shift rows through convolution processor row i-1 row i row i+1 one pixel 3x3 convolution processor Parallel shift rows up row i+2 Source RAM Image width in pixels Destination RAM 127-MAPLD2005
Convolution Operation Pixel array Convolution mask Used, for example, to compute the intensity gradient (derivative) at pixel (i,j) Result = P(i-1,j-1)*m11+P(i-1,j)*m12+P(i-1,j-1)*m13+…+P(i+1,j+1)*m33 127-MAPLD2005
P(i+1,j+1) P(i+1,j) P(i+1,j-1) P(i-1,j-1) P(i,j+1) P(i-1,j) P(i,j) P(i-1,j+1) P(i,j-1) * * * * * * * * * m21 m32 m31 m11 m23 m22 m33 m13 m12 Convolution Calculation • Arithmetic processing may require some pipelining Result(I,j) 127-MAPLD2005
Convolution Timing • Total time = latency+processing = 20.971 msec • This assumes we can get pixels into the FPGA at a 20 nsec/pixel rate • Latency = time to read 3 rows: • 1024 pixels *3 rows * 20 nsec/pixel = 61 usec • Processing = time to stream remaining 1021 rows through and process: • 1024 * 1021 * 20 nsec = 20.910 msec • Larger convolutions (e.g., 7x7) have longer latencies, but same computation time • Calculation is for a mono image, stereo image would take twice as long. 127-MAPLD2005
Section 3:Comparing the Reconfiguration Time of a Xilinx FPGA With the Algorithm Execution Time Calculated in Section 2. 127-MAPLD2005
DR Computer Processing Element:Virtex-4 LX FPGA • Eight versions: • XC4VLX15, -25, -40, -60, -80, -100, -160, -200 • Logic hierarchically arranged: • 2 flip-flops per slice • 4 slices per CLB 127-MAPLD2005
Time to Configure FPGA • FPGA Configuration Sequence PROG_B INIT_B CCLK DONE Tpl Tconfig Total Configuration Time 127-MAPLD2005
Configuration Timing: Tpl • Tpl = 0.5 usec/frame • “frame” is a unit of configuration RAM • Tpl period clears configuration RAM 127-MAPLD2005
Configuration Timing: Tconfig • FPGA programmed by bitstream • CCLK (programming CLK) can run at 100 MHz • Parallel mode loads 8 bits per CCLK 127-MAPLD2005
Total Configuration Time • Plus some extra time amounting to a few CCLK cycles (@ 10 nsec each) 127-MAPLD2005
Processing and Reconfiguration Time Comparison • Convolution execution is faster than reconfiguration • Convolution = 21 msec mono, 42 msec stereo • Reconfiguration = 81 msec • Assuming -200 device • Processing shown is well within FPGA’s capabilities • More complex algorithms may require use of FPGA performance features • Much higher internal clock rates • Large internal RAM • Dedicated arithmetic support in –SX series • What this shows is that it’s reasonable to consider alternating execution and reconfiguration of two FPGAs 127-MAPLD2005
ROUGH ESTIMATE Section 4: An Extremely Rough Estimate of Image Processing Algorithm Execution Time on a Flight Computer 127-MAPLD2005
ROUGH ESTIMATE GP Computing Performance Estimate • DANGER: really rough estimate! • Based on data from this paper: • “Stereo Vision and Rover Navigation Software for Planetary Exploration”, Steven B. Goldberg, Indelible Systems; Mark Maimone, Larry Matthies, JPL; 2002 IEEE Aerospace Conference • Available at robotics.jpl.nasa.gov/people/mwm/visnavsw/aero.pdf • Describes processing and algorithms to be used on 2004 Rover missions, and Rover requirements. 127-MAPLD2005
ROUGH ESTIMATE Published Vision Algorithm Timing • Timed on Pentium III 700 MHz CPU, 32K L1 cache, 256K L2 cache, 512M RAM, Win2K • algorithms explicitly timed (names from paper): • The Gaussian and most vision algorithms involve neighborhood operations that are comparable to an image convolution of some size 127-MAPLD2005
ROUGH ESTIMATE Flight Computer Performance • Flight processor is RAD6000 • GESTALT Navigation algorithm timed on 3 processors: Assume that the RAD6000 takes 7 times as long as the 500 MHz Pentium 127-MAPLD2005
ROUGH ESTIMATE Final Peformance Estimate • Assume RAD6000 time = 7 times the 500 MHz Pentium time • Assume 500 MHz Pentium time = 7/5=1.4 times the 700 MHz Pentium time • Then, RAD6000 time is 1.4*7=9.8 times the 700 MHz Pentium time • Vision algorithm timing can be estimated as follows: Remember: This is a really rough estimate!! 127-MAPLD2005
Section 5: Conclusions 127-MAPLD2005
ROUGH ESTIMATE What We Have Shown • We have shown that the concept DR computer presented executes a 3x3 neighborhood-type algorithm “a lot” faster than it appears that a RAD6000 executes what are probably a bunch of neighborhood algorithms. • The reader is cautioned to not try to quantify what “a lot” means based on the data given here. • But, it’s a good enough estimate to tell us that this is worth looking into in more detail. 127-MAPLD2005
Conclusions • Xilinx-based DR computer shows promise for performance enhancement of a vision system • By extension, the DR computer shows promise for the performance enhancement of other algorithms. 127-MAPLD2005