Reconfigurable Computing: Current Status and Potential for Spacecraft Computing Systems

Reconfigurable Computing:Current Status and Potential for Spacecraft Computing Systems Rod Barto NASA/GSFC Office of Logic Design Spacecraft Digital Electronics 3312 Moonlight El Paso, Texas 79904 127-MAPLD2005

Reconfigurable Computing is… • A design methodology by which computational components can be arranged in several ways to perform various computing tasks • Two types of reconfigurable computing: • Static, i.e., the computing system is configured before launch • Dynamic, i.e., the computing system can be reconfigured after launch 127-MAPLD2005

Static Reconfigurability • Several examples exist, e.g., Cray • Typically processing modules connected by an intercommunication mechanism, e.g., Ethernet • Goals are • To reduce system development costs • To provide higher performance computing 127-MAPLD2005

Dynamic Reconfigurability (DR) • Processing modules that can be reconfigured in flight • Goal is to provide processing support for algorithms that do not map well onto general purpose computers using reduced amounts of hardware 127-MAPLD2005

Outline of Paper • Discuss the computation of a series of algorithms on general purpose, special purpose, and DR computers • Calculate the execution time of an image processing algorithm on a concept DR computer • Compare the reconfiguration time of a Xilinx FPGA with the algorithm execution time calculated in section 2. • Obtain an extremely rough estimate of image processing algorithm execution time on a flight computer • Conclude that the DR computer described offers higher performance than does the flight computer 127-MAPLD2005

Section 1:Algorithm Execution on General Purpose (GP), Special Purpose (SP), and DR Computers 127-MAPLD2005

Output Input f1 f2 fn Processing example • A computing function is the composition of n algorithms executed serially • Can be executed on a general purpose computer (GP) or a special purpose computer (SP) 127-MAPLD2005

Output Input f1 f2 fn Execution on a GP Computer Processing time of each stage = ti, i=1..n Total processing time = Latency time = GP computer must execute processing stages sequentially, and cannot exploit parallelism in overall computing function 127-MAPLD2005

Output Input f1 f2 fn Processing on an SP Processor Each stage is an independently operating processor designed specifically for the algorithm it executes Processing time of each stage = ti, i=1..n Results appear at rate of one per max(ti), 1=1..n Latency time = • Performance increase comes from two factors: • Pipelining of constituent algorithms exploiting parallelism • Processors being designed specifically for their algorithms 127-MAPLD2005

Processing on a DR Computer • Two processing elements alternately process and reconfigure, i.e., fodd executes one algorithm while feven reconfigures for the next algorithm, etc. fodd Output Input feven 127-MAPLD2005

fodd f1 R f3 R fn R f2 R f4 R feven time R = reconfiguration Results appear at rate of one per Latency = DR Computer Processing Flow Performance increase comes from configuring processors specifically for the algorithm they are executing Do not get increase from exploiting parallelism. 127-MAPLD2005

Section 2:Execution Time of an Image Processing Algorithm on a Concept DR Computer 127-MAPLD2005

DR Computer Concept FPGA0 • RAM0 is source for FPGA0, destination for FPGFA1, etc. • Processing elements are implemented in FPGAs • FPGA0 and FPGA1 alternately process and reconfigure, as previously discussed. • Input and output not shown RAM0 RAM1 FPGA1 127-MAPLD2005

AlgorithmExample: 3x3 Image Convolution • Shifting in 1 row at a time pixel-serial, and parallel shifting into the upper 3 row registers, the rows are shifted around through the convolution processor. All the row registers and processing is inside the FPGA. The results are written to the destination RAM after a latency of 3 row reads. Circular shift rows through convolution processor row i-1 row i row i+1 one pixel 3x3 convolution processor Parallel shift rows up row i+2 Source RAM Image width in pixels Destination RAM 127-MAPLD2005

Convolution Operation Pixel array Convolution mask Used, for example, to compute the intensity gradient (derivative) at pixel (i,j) Result = P(i-1,j-1)*m11+P(i-1,j)*m12+P(i-1,j-1)*m13+…+P(i+1,j+1)*m33 127-MAPLD2005

P(i+1,j+1) P(i+1,j) P(i+1,j-1) P(i-1,j-1) P(i,j+1) P(i-1,j) P(i,j) P(i-1,j+1) P(i,j-1) * * * * * * * * * m21 m32 m31 m11 m23 m22 m33 m13 m12 Convolution Calculation • Arithmetic processing may require some pipelining Result(I,j) 127-MAPLD2005

Convolution Timing • Total time = latency+processing = 20.971 msec • This assumes we can get pixels into the FPGA at a 20 nsec/pixel rate • Latency = time to read 3 rows: • 1024 pixels *3 rows * 20 nsec/pixel = 61 usec • Processing = time to stream remaining 1021 rows through and process: • 1024 * 1021 * 20 nsec = 20.910 msec • Larger convolutions (e.g., 7x7) have longer latencies, but same computation time • Calculation is for a mono image, stereo image would take twice as long. 127-MAPLD2005

Section 3:Comparing the Reconfiguration Time of a Xilinx FPGA With the Algorithm Execution Time Calculated in Section 2. 127-MAPLD2005

DR Computer Processing Element:Virtex-4 LX FPGA • Eight versions: • XC4VLX15, -25, -40, -60, -80, -100, -160, -200 • Logic hierarchically arranged: • 2 flip-flops per slice • 4 slices per CLB 127-MAPLD2005

Time to Configure FPGA • FPGA Configuration Sequence PROG_B INIT_B CCLK DONE Tpl Tconfig Total Configuration Time 127-MAPLD2005

Configuration Timing: Tpl • Tpl = 0.5 usec/frame • “frame” is a unit of configuration RAM • Tpl period clears configuration RAM 127-MAPLD2005

Configuration Timing: Tconfig • FPGA programmed by bitstream • CCLK (programming CLK) can run at 100 MHz • Parallel mode loads 8 bits per CCLK 127-MAPLD2005

Total Configuration Time • Plus some extra time amounting to a few CCLK cycles (@ 10 nsec each) 127-MAPLD2005

Processing and Reconfiguration Time Comparison • Convolution execution is faster than reconfiguration • Convolution = 21 msec mono, 42 msec stereo • Reconfiguration = 81 msec • Assuming -200 device • Processing shown is well within FPGA’s capabilities • More complex algorithms may require use of FPGA performance features • Much higher internal clock rates • Large internal RAM • Dedicated arithmetic support in –SX series • What this shows is that it’s reasonable to consider alternating execution and reconfiguration of two FPGAs 127-MAPLD2005

ROUGH ESTIMATE Section 4: An Extremely Rough Estimate of Image Processing Algorithm Execution Time on a Flight Computer 127-MAPLD2005

ROUGH ESTIMATE GP Computing Performance Estimate • DANGER: really rough estimate! • Based on data from this paper: • “Stereo Vision and Rover Navigation Software for Planetary Exploration”, Steven B. Goldberg, Indelible Systems; Mark Maimone, Larry Matthies, JPL; 2002 IEEE Aerospace Conference • Available at robotics.jpl.nasa.gov/people/mwm/visnavsw/aero.pdf • Describes processing and algorithms to be used on 2004 Rover missions, and Rover requirements. 127-MAPLD2005

ROUGH ESTIMATE Published Vision Algorithm Timing • Timed on Pentium III 700 MHz CPU, 32K L1 cache, 256K L2 cache, 512M RAM, Win2K • algorithms explicitly timed (names from paper): • The Gaussian and most vision algorithms involve neighborhood operations that are comparable to an image convolution of some size 127-MAPLD2005

ROUGH ESTIMATE Flight Computer Performance • Flight processor is RAD6000 • GESTALT Navigation algorithm timed on 3 processors: Assume that the RAD6000 takes 7 times as long as the 500 MHz Pentium 127-MAPLD2005

ROUGH ESTIMATE Final Peformance Estimate • Assume RAD6000 time = 7 times the 500 MHz Pentium time • Assume 500 MHz Pentium time = 7/5=1.4 times the 700 MHz Pentium time • Then, RAD6000 time is 1.4*7=9.8 times the 700 MHz Pentium time • Vision algorithm timing can be estimated as follows: Remember: This is a really rough estimate!! 127-MAPLD2005

Section 5: Conclusions 127-MAPLD2005

ROUGH ESTIMATE What We Have Shown • We have shown that the concept DR computer presented executes a 3x3 neighborhood-type algorithm “a lot” faster than it appears that a RAD6000 executes what are probably a bunch of neighborhood algorithms. • The reader is cautioned to not try to quantify what “a lot” means based on the data given here. • But, it’s a good enough estimate to tell us that this is worth looking into in more detail. 127-MAPLD2005

Conclusions • Xilinx-based DR computer shows promise for performance enhancement of a vision system • By extension, the DR computer shows promise for the performance enhancement of other algorithms. 127-MAPLD2005

Reconfigurable Computing: Current Status and Potential for Spacecraft Computing Systems

Reconfigurable Computing: Current Status and Potential for Spacecraft Computing Systems

Presentation Transcript

Reconfigurable Computing - Pipelined Systems

Reconfigurable Computing

Reconfigurable Computing

Reconfigurable Computing

Reconfigurable Computing

Reconfigurable Computing

ENG6530 Reconfigurable Computing Systems

Reconfigurable computing

ENG6530 Reconfigurable Computing Systems

ENG6530 Reconfigurable Computing Systems

ENG6530 Reconfigurable Computing Systems

Reconfigurable Computing

ENG6530 Reconfigurable Computing Systems

Reconfigurable Computing

ENG6530 Reconfigurable Computing Systems

Operating Systems for Reconfigurable Computing Systems

ENG6090 Reconfigurable Computing Systems

ENG6530 Reconfigurable Computing Systems

Reconfigurable Computing: Current Status and Potential for Spacecraft Computing Systems

ENG6530 Reconfigurable Computing Systems

Reconfigurable Computing

Reconfigurable Computing