200 likes | 437 Views
Implementation of High-Rate JPEG2000 Coding on a Virtex-2 Pro Reconfigurable Computing Board. Presented by Damon Van Buren SEAKR Engineering MAPLD 2004 Submission 133. The Sensor Bandwidth Problem. Commercial satellite imaging systems are experiencing growth in imaging capability...
E N D
Implementation of High-Rate JPEG2000 Coding on a Virtex-2 Pro Reconfigurable Computing Board Presented by Damon Van Buren SEAKR Engineering MAPLD 2004 Submission 133
The Sensor Bandwidth Problem • Commercial satellite imaging systems are experiencing growth in imaging capability... • Higher resolution: < 1 m • Larger images: >10k image width and height • More spectral components • Panchromatic • Red/Green/Blue • Multi-spectral • Improved capabilities are leading to high sensor data rates • Data output rates > 2 Gbps for some systems • Providing storage and downlink bandwidth for the data is becoming a significant challenge for system designers • The largest data recorders can store less than 20 minutes of data at 2 Gbps • Downlinks must be several hundred Mbps to downlink 15 minutes of data in under an hour • Data storage and high-bandwidth downlinks require lots of power • By reducing the amount of image data, compression provides a solution to the bandwidth problem!
Desired Compressor Features • Real Time • Compression must be performed in real time, prior to storage. • High throughput (> 2 Gbps) • Excellent Performance in Lossy and Lossless Modes • Purchasers of satellite imagery are sensitive to reductions in image quality caused by lossy compression. • Scientific users prefer undistorted data (bit true). • Space-Qualified • Must survive hazards of launch and space operation, including radiation. • Low Risk • Satellite imaging companies seek high reliability solutions.. • Low Cost • Commercial customers require cost effective solutions. • Flexible • The ability to support varying compression ratios and contents would allow more effective use of available storage and bandwidth.
JPEG2000 Algorithm • JPEG2000 is an excellent choice for satellite image compression. • Latest still image compression standard from the JPEG committee • Meets two key requirements for satellite image compression: • Excellent performance in both lossy and lossless modes. • ~1.7 to 1 lossless compression for typical satellite imagery - 70% improvement! • Visually lossless compression > 2 to 1 - 100% improvement in storage and downlink performance. • Very flexible: • Many options for compressed images. • Other advantages: • International Standard • Wavelet based • High quality lossy images with comp. ratios > 100:1 • Packet oriented • Allows random access to the compressed code stream. • Makes compressed data more robust in the presence of bit errors. • Allows selection of image quality, spatial region, resolution, and color component after compression.
JPEG2000 Implementation Challenges • JPEG2000 is a very complex algorithm. • More Features = More Complexity. • Operation intensive • Several hundred operations per pixel, because each bit must be processed many times, for the wavelet transform, entropy coding, MQ coding, packet generation, etc. • Complex • Many different stages to produce compressed output. • Wavelet transform. • Quantization. • Context generation. • Arithmetic coding. • Packet generation. • Many parameters must be tracked individually for each code block (64x64). • Memory intensive • Each pixel must be accessed many times, so many small buffers are needed to get good throughput. • Few processors are capable of implementing JPEG2000 at high rates!
High-Performance Processing Using Xilinx FPGAs • Xilinx FPGAs have many advantages for fast parallel processing: • Millions of gates. • System clocks of several hundred MHz. • High speed I/O • 622 Mbps LVDS • Multi-Gigabit serial I/O • Hundreds of internal block RAMS. • Hundreds of internal 18 bit multipliers. • Xilinx FPGAs are available in a space qualified versions: • Radiation testing is complete on the Virtex and Virtex-II devices. • ~200 kRad total dose, latchup immune. • Radiation testing to begin on the Virtex-II Pro devices soon. • Xilinx FPGAs are very flexible, reducing risk: • May be re-programmed an infinite number of times. • Configurations may be uploaded at any time during the mission to fix errors or add new capability. • Xilinx FPGAs are the best solution for fast compression in space!
Challenges for Xilinx Use in Space • The effects of radiation in spacecraft electronics are well known. • Caused primarily by charged particles. • May cause permanent damage over time by ionizing SiO2 (total dose). • May also cause errors in digital logic by upsetting registers (single event effects). • Mitigation techniques are used to reduce or eliminate the effect of radiation upsets. • Triple Modular Redundancy (TMR) uses voting to select the correct output from 3 separate instances of the design. • Mitigation of radiation effects in SRAM-based FPGAs presents an additional challenge: • As with other digital electronics, the functional logic of the device is susceptible to upset, however... • Another layer of logic (configuration logic) controls the routing of the part, giving the device its capability to be reprogrammed to perform different functions. • Configuration logic is also susceptible to radiation upsets. • Xilinx FPGAs require system level mitigation strategies in addition to the device level mitigation techniques (such as TMR) that are commonly used for space electronics. • Configuration data must be continuously re-written, or scrubbed using a read-and-correct approach.
SEAKR’s RCC Board Processing Solutions • SEAKR has developed a line of Reconfigurable Computing (RCC) products based on the Xilinx FPGAs. • RCC 1 – 4x Virtex 1000s • RCC 2 – 4x Virtex II 6000s • RCC 3 (NTRCC) – 4x Virtex II Pro 70/100s • Boards include system-level upset mitigation (scrub) for the Xilinx devices. • Configuration data is continuously read and checked for errors. • Errors are corrected by overwriting the corrupted frames, without interrupting the operation of the device. • Other devices on board employ radiation mitigation strategies as well: • Radiation hardened • EDAC • Boards also have dedicated resources to support high-performance processing: • High speed I/O. • External memories. • Industry standard form-factor: 6U Compact PCI.
Network RCC (NTRCC) • Four Xilinx XC2VP70-6FF1704 FPGA CO-Processors • Design compatible with XC2VP100-6FF1706 and V2P-X • (4) banks of 1Mx36 Quad Data Rate (QDR) SRAMs for each COP • 512MB of DDRII Shared SDRAM memory for prototype • 1GB of 128M x 64 EDAC (R-S) Protected DDRII SDRAM shared memory (19.2Gbps @150MHz) using 1Gbit memory • Network IF • (2) parallel 16bit RapidIO ports to front panel (8 Gbps) • (1) 4x3.125 Gbps serial port to front panel (>10Gbps) • 4x3.125 Gbps ports from NIC to each COP (>10Gbps) • 4x3.125 Gbps ports from each COP to each neighbor COP (>10Gbps) • Shared Data Buses • Cop Interconnect Bus (~4.224 Gbps) • cPCI 32bit 33Mhz • Read and write COP configurations via cPCI • Extended 6U form factor • Configuration RAM SEU detection and correction • DDRII SDRAM on configuration controller for shadow config program storage • Non-Volatile memory for 16 different configurations (1 Gbit Flash)
NTRCC Layout • 24 Layer board • MicroVias, blind vias, via-in-pad • High speed 3.125 Gbps Serial links • 82 pages of schematic capture • 10 weeks of PCB layout time
Implementation of the JPEG2000 Algorithm • The JPEG2000 core has been in development for over a year. • Eventual target data rate 600 Mbps/device. • Written in VHDL. • Simulations performed in Modelsim. • Synthesis in Synplify_Pro. • Targeted to the NTRCC-R summer ‘04. • Targeted to a reduced version of the NTRCC with a single coprocessor. • Take advantage of improved external memory throughput. • Ultimately use the high-speed serial I/O to move image information on the board. • Designed for high throughput. • Cycle efficient coding style. • Highly parallel design. • Pipelined architecture. • Rolling wavelet transform. • Designed for flexible output file format. • Output is divided into quality layers for easy selection of compression ratio.
JPEG2000 Coding Steps • Image is broken into tiles • Tiles are wavelet transformed • 5/3 reversible or 9/7 irreversible, also user defined. • Selectable number of transform levels. • Each subband from the transform is further broken up into code blocks (typically 32x32 or 64x64) for entropy coding. • Each code block is entropy coded, starting from the top bit plane and working down. • The current bit of each pixel is passed to an arithmetic coder, along with context information. • The MQ encoder takes advantage of any skewing of the probability for each context, and adapts contexts as the coding progresses. • Packets are formed by combining the entropy coder outputs from a single resolution. • Tile parts are formed from all the packet in a given bit plane.
JPEG2000 Architecture Drivers • To achieve high data rates, the processing must be paralleled as much as possible. • The “tall pole in the tent” is the arithmetic coding, because the coding of a single data bit with its context can take several clock cycles. • Significance propagation coding is also a challenge, because each coefficient must be accessed many times, as each bit plane is processed. • Other operations, such as wavelet transform, code block loading, and packet generation are much more efficient, and require fewer parallel paths. • A pipelined architecture with many entropy coders in parallel was used to achieve the required throughput.
Architecture Description • Processes 256x256 tiles. • Pipelined architecture, using separate external memories for image, tile, and compressed data storage. • 19 Entropy coders working in parallel to improve throughput, one for each code block. • 64x64 code blocks. • FIFO buffering between the stages improves data flow efficiency. • A rolling wavelet transform is used to reduce memory accesses and improve efficiency. • Entropy coder outputs are formed into layers, giving each tile a progressive output format. • Tile parts are interleaved as the image tiles are processed. • Performs lossy or lossless compression.
NTRCC-R Implementation Results • The JPEG2000 encoder was targeted to the V2Pro 70 FPGA on the NTRCC-R. • Lossless or Lossy compression. • Data precision up to 13 bits. • Simulation and Routing Results: • Slices: 30043 out of 33088, 90% • Block RAMS: 148 out of 328, 45% • Max system clock ~43 MHz without optimization. • Hardware Throughput: • ~140 Mbps w/ 33 MHz clock (depending on image.) • ~180 Mbps w/ 43 Mhz clock.
JPEG2000 Floorplan • The Pro 70 Device is quite full!
Planned Improvements • Optimize design to hit 66 MHz. • Un-optimized design will operate at up to 43 MHz. • Use of asynchronous fifos will allow optimal clocking of various parts of the design. • Improve pipelining of code block loader and wavelet transform. • Allow “autonomous” operation of each stage, so that operations take place as soon as input data and output buffers are ready. • Make use of additional QDR SRAMs available to each coprocessor by creating separate buffers for wavelet transform and packetizer output. • NTRCC has 4 QDR memories for each coprocessor. • Arithmetic coder bypass. • Arithmetic coder requires > 2 cycles per bit coded, on average. • 9/7 wavelet transform with quantization. • Use of the 9/7 wavelet results in better SNR and max error performance for lossy compression. • Add RapidIO serial interface to Network Interface Chip (NIC).
Conclusions • The JPEG2000 core is expected to provide a valuable option for satellite imagery systems. • Compression will result in a dramatic improvement in system performance. • Lossless compression will allow ~70% more image data to be stored and downlinked by a system. • Lossy compression will allow even greater improvements. • NTRCC hardware is an excellent platform for the compressor. • High bandwidth interconnect and I/O (several Gbps). • High bandwidth external memories. • Excellent processing capability with the Virtex-II Pro devices. • The sky’s the limit! • Target rate of 600 Mbps per device appears to be a realistic goal. • Some improvements are left to be made to the clock rate and pipelining of the design.