1 / 17

Design and Tradeoff Analysis of JPEG-2000 on Hardware-Reconfigurable Systems

Design and Tradeoff Analysis of JPEG-2000 on Hardware-Reconfigurable Systems. Ryan DeVille, Vikas Aggarwal, Ian Troxel, and Alan D. George High-performance Computing and Simulation (HCS) Research Laboratory Department of Electrical and Computer Engineering University of Florida. Introduction.

Download Presentation

Design and Tradeoff Analysis of JPEG-2000 on Hardware-Reconfigurable Systems

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.


Presentation Transcript

  1. Design and Tradeoff Analysisof JPEG-2000 onHardware-Reconfigurable Systems Ryan DeVille, Vikas Aggarwal, Ian Troxel, and Alan D. George High-performance Computing and Simulation (HCS) Research Laboratory Department of Electrical and Computer Engineering University of Florida

  2. Introduction • JPEG-2000 Encoding • State-of-the-art low bit-rate compression algorithm • Progressive transmission by quality, resolution, component, or spatial locality • Spatially random access to bitstream • Region of interest coding • Motivation for porting JPEG-2000 to RC systems • High-performance and low-cost solution is attractive for airborne and satellite imaging systems • Speedup readily available with fine-grain and coarse-grain parallelism opportunities

  3. Related Research • EBCOT Encoder designs • Group of Column optimization method • Previous RC Designs • Space systems prototype [5] • Scalable Entropy Encoder [6] • Dual Processing Elements Architecture [7] • 2D Discrete Wavelet Transform designs • Several mimic early VLSI designs [8, 9] • Multiple architecture designs classifications [10] • Direct • 1D, transpose, perform another 1D • Intrinsically slow • Separate serial and parallel filters or parallel row, parallel column filters • Processes along rows and columns • Represents significant performance improvement • Symmetrically extended • Improves processing efficiency, especially towards center of image

  4. JPEG-2000 Encoder Design & Develop. • Software code profiling first used to determine effort distribution • Previous research efforts show that DWT and Tier1 encoding consume 80-85% of execution time • Current profiling results with Jasper and OpenJPEG show that >90% of execution time spent in DWT and Tier1 • Benchmark images selected from Kodak Lossless True Color Image Suite, JasPer benchmark images, standard image processing images (lena, etc.) Jasper Execution Time Profile

  5. Discrete Wavelet Transform (DWT) • Features • Second-most computationally intensive block in compression process • Transforms each component tile data into coefficients • Reversible transform involves all integer operations • Represents high- and low-frequency components of image • Amenable to compression – results in better compression ratios • Recursive application yields frequency bands at multiple resolutions • Operation • 2D transform achieved by successively applying 1D transform in X&Y directions • Each 1D transform consist of • Filtering step • De-interleave step: reorganizing of data in bands • Available data and functional parallelism can be exploited a3LH a3LL a3HL a3HH a2LH a1LH a2HH a2HL a1HL a1HH

  6. DWT Hardware Architecture • Challenges presented by DWT • Parallel processing limited by memory bandwidth requirements • Some sequential nature in processing involved • Design features • Data-level parallelism exploited by operating on multiple “tiles” • Function-level parallelism exploited by pipelining different processing step • Data reuse eliminates extra read cycles • Internal architecture • Each tile is entirely stored in single Block RAM to minimize data movement • Overlapped processing to further reduce latency

  7. Embedded Block Coding with Optimized Truncation (EBCOT): Tier-1 • Features • Specially adapted arithmetic coder • Four bit-plane coding primitives • Three coding passes for each bit-plane (except the most significant) • Operation • Coding passes: CUP begins at most significant bit plane • Iteratively perform coding passes over remaining bit planes • Coding-pass-generated context and bit data serially encoded and compressed by arithmetic encoder • Flush and reset arithmetic coder at completion

  8. Tier-1 Encoding Hardware Architecture • Challenges presented by Tier-1 encoding: • Serial process – creation of current MQ context data directly depends upon previous pass results • “Bursty” communication – contextual data from a pass short, semi-continuous bursts • Large amounts of data and flags must be stored through multiple iterations of algorithm, requiring high memory bandwidth • Internal architecture (high-level) • Retrieve current stripe from memory for processing • Data is operated in a pipelined fashion through registers • Context and data information sent to queues • Serializing agent: arithmetic entropy encoder • MQ Input Controller regulates input to arithmetic entropy encoder, insuring correct operation • Data from arithmetic entropy encoder is written to a separate, final buffer Design decision to use MQ encoder as serializing agent saves area and BlockRAM space without sacrificing too much performance.

  9. Target HPEC Platform • High-Perf. Embedded Computing: Nallatech BenNUEY w/ BenBLUE-II • Three FPGAs (all Xilinx Virtex2 6000, -4) • Single “user” FPGA on BenNUEY PCI board • Dual FPGAs on BenBLUE-II daughter card • Low bandwidth to system memory through 64/66 MHz PCI bus connection • Large memory storage capability with 12 MB SRAM (166 MHz, ZBT) • Advantages/Disadvantages • High configuration time (PCI bus + chained JTAG interface) • Large memory storage helps alleviate strain on PCI bus • Very good IO interface support with proprietary tools * Diagram shown here only reflects those buses actually used in the design; other communication schemes are available.

  10. DWT Single FPGA Results Results for single DWT module design for BenNUEY board operating at 80 MHz Note: software solution comes from exec. on server with 2.4 GHz Xeon CPU Resource Utilization on Virtex2 6000 -4 Results for Eight DWT modules design for BenNUEY board operating at 40 MHz

  11. Tier-1 Encoding Current Results Results for Tier1 module design for BenNUEY board operating at 90 MHz Note: software solution comes from execution on server with 2.4 GHz Xeon Processor Profiling shows performance projections with DMA transfer times included. * Results synthesized with Synplify Pro 7.7.1, PAR with Xilinx ISE 6.3

  12. Conclusions from HPEC Platform • Multi-chip system offers resources for increased parallelism or a multi-component application • Order of magnitude improvement in total computation time • Faster computation times on FPGA • But communication overhead severely hinders performance improvement • Low-bandwidth PCI interconnect not amenable to designs with challenging memory demands

  13. Target HPC Platform SGI Altix w/ RASC extension • High-Performance Computing: SGI Altix 350 with FPGA Brick • Single FPGA: Virtex2 6000 (-6 speed grade) • Approximately 33% of chip used for SGI’s RASC system layer • Two algorithm clock speeds: 200 MHz and 100 MHz • High bandwidth to system memory through proprietary NUMAlink interconnect (12.8 GB/s) through Scalable System Port (6.4 GB/s) • 3 banks of QDR SRAM (6 MB each) with a full bandwidth of 9.6 GB/s (1.6 GB/s for each read and write) • Advantages/Disadvantages • Extremely low reconfiguration time • High memory bandwidth greatly helps memory-intensive apps, such as JPEG-2K * Diagram shown here only reflects those buses actually used in the design; other communication schemes are available.

  14. Performance Projections Profile shows projections for no-latency, infinite-bandwidth interconnect. • NUMAlink interconnect • Approximate order-of-magnitude improvement of transfers in similar designs • Mitigates communication overhead bottleneck

  15. Lessons Learned and Conclusions • Lessons Learned • HW/SW codesign • Shared-memory systems more amenable to closely-coupled processing associated with communication-sensitive RC applications • PCI boards for servers effective when tasks are offloaded for processing with minimal or masked communication • Memory bandwidth constrains parallelism in DWT design • Serializing agent (arithmetic coder) in Tier-1 design is key limit to performance improvement • Conclusions • Identifying and accelerating key components yields better system performance (with a wary eye on Amdahl’s Law) • Performance enhancements achieved mostly through functional parallelism due to sequential processing constraints

  16. Future Work and Acknowledgments • Future Work: • Full system implementation on SGI Altix with RASC • Region of Interest capability • Lossy encoding and rate capability • MCT and Tier-2 encoding on FPGA as well • Single FPGA JPEG-2000 encoding application • Acknowledgments • We wish to thank the following vendors for equipment and/or tools in support of this research: • SGI • Nallatech • Xilinx • Aldec • Special thanks to SGI Digital Media group, SGI RASC engineers for their help and suggestions

  17. References [1] Adams, M.D. and Ward, R.K., “JasPer: a portable flexible open-source software tool kit for image coding/process”, in IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP’04), pp. 241-244, May 2004. [2] OpenJPEG. http://www.opegjpeg.org/ [3] Liu, L., Li, D., Li, Z., Wang, Z. and Chen, H., “A VLSI architecture of EBCOT encoder for JPEG2000”, in 5th International Conference on ASIC, pp. 882-885, Oct. 2003. [4] Chen, K., Lian, C., Chen, H., and L. Chen, “Analysis and architecture design of EBCOT for JPEG-2000,” in IEEE International Symposium on Circuits and Systems, vol. 2, pp. 765-768, May 2001. [5] Van Buren, D., “A high-rate JPEG2000 compression system for space”, in IEEE Aerospace Conference, March 2005. [6] Aouadi, I., and Hammami, O., “Analysis and hardware design of a scalable dual JPEG-2000 entropy coder”, in Euromicro Symposium onDigital System Design (DSD 2004), pp. 227-233, Sept. 2004. [7] Gangadhar, M. and Bhatia, D., “FPGA based EBCOT architecture for JPEG 2000”, in IEEE International Conference on Field-Programmable Technology (FPT’03), pp. 228-233, Dec. 2003 [8] Hung, K., Huang Y., Truong, T., Wang, C., “FPGA implementation for 2D discrete wavelet transform”, in Electronics Letters, pp. 639-640, April 1998. [9] Lakshminarayanan, G. Venkataramani, B. Senthil Kumar, J., Yousuf, A.K. and Sriram, G., “Design and FPGA implementation of image block encoders with 2D-DWT”, in Conference on Convergent Technologies for Asia-Pacific Region (TENCON 2003), pp. 1015-1019, Oct. 2003. [10] McCanny, P., Masud, S., and McCanny, J., “Design and implementation of the symmetrically extended 2-D wavelet transform”, in IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP’02), vol. 3, pp. 3108-31111, May 2002. [11] D. Taubman, “High performance scalable image compression with EBCOT,” in IEEE Trans. Image Processing, vol. 9, pp. 1158-1170, July 2000. [12] I.E.G. Richardson, Video Codec Design: Developing Image and Video Compression Systems. Chichester, West Sussex, New York: John Wiley and Sons, Ltd (UK), 2002. [13] T. Acharya and P.-S. Tsai, JPEG 2000 Standard for image Compression: Concepts, Algorithms, and VLSI Architectures. Hoboken, New Jersey: John Wiley and Sons, Inc., 2005.

More Related