170 likes | 259 Views
Design and Tradeoff Analysis of JPEG-2000 on Hardware-Reconfigurable Systems. Ryan DeVille, Vikas Aggarwal, Ian Troxel, and Alan D. George High-performance Computing and Simulation (HCS) Research Laboratory Department of Electrical and Computer Engineering University of Florida. Introduction.
E N D
Design and Tradeoff Analysisof JPEG-2000 onHardware-Reconfigurable Systems Ryan DeVille, Vikas Aggarwal, Ian Troxel, and Alan D. George High-performance Computing and Simulation (HCS) Research Laboratory Department of Electrical and Computer Engineering University of Florida
Introduction • JPEG-2000 Encoding • State-of-the-art low bit-rate compression algorithm • Progressive transmission by quality, resolution, component, or spatial locality • Spatially random access to bitstream • Region of interest coding • Motivation for porting JPEG-2000 to RC systems • High-performance and low-cost solution is attractive for airborne and satellite imaging systems • Speedup readily available with fine-grain and coarse-grain parallelism opportunities
Related Research • EBCOT Encoder designs • Group of Column optimization method • Previous RC Designs • Space systems prototype [5] • Scalable Entropy Encoder [6] • Dual Processing Elements Architecture [7] • 2D Discrete Wavelet Transform designs • Several mimic early VLSI designs [8, 9] • Multiple architecture designs classifications [10] • Direct • 1D, transpose, perform another 1D • Intrinsically slow • Separate serial and parallel filters or parallel row, parallel column filters • Processes along rows and columns • Represents significant performance improvement • Symmetrically extended • Improves processing efficiency, especially towards center of image
JPEG-2000 Encoder Design & Develop. • Software code profiling first used to determine effort distribution • Previous research efforts show that DWT and Tier1 encoding consume 80-85% of execution time • Current profiling results with Jasper and OpenJPEG show that >90% of execution time spent in DWT and Tier1 • Benchmark images selected from Kodak Lossless True Color Image Suite, JasPer benchmark images, standard image processing images (lena, etc.) Jasper Execution Time Profile
Discrete Wavelet Transform (DWT) • Features • Second-most computationally intensive block in compression process • Transforms each component tile data into coefficients • Reversible transform involves all integer operations • Represents high- and low-frequency components of image • Amenable to compression – results in better compression ratios • Recursive application yields frequency bands at multiple resolutions • Operation • 2D transform achieved by successively applying 1D transform in X&Y directions • Each 1D transform consist of • Filtering step • De-interleave step: reorganizing of data in bands • Available data and functional parallelism can be exploited a3LH a3LL a3HL a3HH a2LH a1LH a2HH a2HL a1HL a1HH
DWT Hardware Architecture • Challenges presented by DWT • Parallel processing limited by memory bandwidth requirements • Some sequential nature in processing involved • Design features • Data-level parallelism exploited by operating on multiple “tiles” • Function-level parallelism exploited by pipelining different processing step • Data reuse eliminates extra read cycles • Internal architecture • Each tile is entirely stored in single Block RAM to minimize data movement • Overlapped processing to further reduce latency
Embedded Block Coding with Optimized Truncation (EBCOT): Tier-1 • Features • Specially adapted arithmetic coder • Four bit-plane coding primitives • Three coding passes for each bit-plane (except the most significant) • Operation • Coding passes: CUP begins at most significant bit plane • Iteratively perform coding passes over remaining bit planes • Coding-pass-generated context and bit data serially encoded and compressed by arithmetic encoder • Flush and reset arithmetic coder at completion
Tier-1 Encoding Hardware Architecture • Challenges presented by Tier-1 encoding: • Serial process – creation of current MQ context data directly depends upon previous pass results • “Bursty” communication – contextual data from a pass short, semi-continuous bursts • Large amounts of data and flags must be stored through multiple iterations of algorithm, requiring high memory bandwidth • Internal architecture (high-level) • Retrieve current stripe from memory for processing • Data is operated in a pipelined fashion through registers • Context and data information sent to queues • Serializing agent: arithmetic entropy encoder • MQ Input Controller regulates input to arithmetic entropy encoder, insuring correct operation • Data from arithmetic entropy encoder is written to a separate, final buffer Design decision to use MQ encoder as serializing agent saves area and BlockRAM space without sacrificing too much performance.
Target HPEC Platform • High-Perf. Embedded Computing: Nallatech BenNUEY w/ BenBLUE-II • Three FPGAs (all Xilinx Virtex2 6000, -4) • Single “user” FPGA on BenNUEY PCI board • Dual FPGAs on BenBLUE-II daughter card • Low bandwidth to system memory through 64/66 MHz PCI bus connection • Large memory storage capability with 12 MB SRAM (166 MHz, ZBT) • Advantages/Disadvantages • High configuration time (PCI bus + chained JTAG interface) • Large memory storage helps alleviate strain on PCI bus • Very good IO interface support with proprietary tools * Diagram shown here only reflects those buses actually used in the design; other communication schemes are available.
DWT Single FPGA Results Results for single DWT module design for BenNUEY board operating at 80 MHz Note: software solution comes from exec. on server with 2.4 GHz Xeon CPU Resource Utilization on Virtex2 6000 -4 Results for Eight DWT modules design for BenNUEY board operating at 40 MHz
Tier-1 Encoding Current Results Results for Tier1 module design for BenNUEY board operating at 90 MHz Note: software solution comes from execution on server with 2.4 GHz Xeon Processor Profiling shows performance projections with DMA transfer times included. * Results synthesized with Synplify Pro 7.7.1, PAR with Xilinx ISE 6.3
Conclusions from HPEC Platform • Multi-chip system offers resources for increased parallelism or a multi-component application • Order of magnitude improvement in total computation time • Faster computation times on FPGA • But communication overhead severely hinders performance improvement • Low-bandwidth PCI interconnect not amenable to designs with challenging memory demands
Target HPC Platform SGI Altix w/ RASC extension • High-Performance Computing: SGI Altix 350 with FPGA Brick • Single FPGA: Virtex2 6000 (-6 speed grade) • Approximately 33% of chip used for SGI’s RASC system layer • Two algorithm clock speeds: 200 MHz and 100 MHz • High bandwidth to system memory through proprietary NUMAlink interconnect (12.8 GB/s) through Scalable System Port (6.4 GB/s) • 3 banks of QDR SRAM (6 MB each) with a full bandwidth of 9.6 GB/s (1.6 GB/s for each read and write) • Advantages/Disadvantages • Extremely low reconfiguration time • High memory bandwidth greatly helps memory-intensive apps, such as JPEG-2K * Diagram shown here only reflects those buses actually used in the design; other communication schemes are available.
Performance Projections Profile shows projections for no-latency, infinite-bandwidth interconnect. • NUMAlink interconnect • Approximate order-of-magnitude improvement of transfers in similar designs • Mitigates communication overhead bottleneck
Lessons Learned and Conclusions • Lessons Learned • HW/SW codesign • Shared-memory systems more amenable to closely-coupled processing associated with communication-sensitive RC applications • PCI boards for servers effective when tasks are offloaded for processing with minimal or masked communication • Memory bandwidth constrains parallelism in DWT design • Serializing agent (arithmetic coder) in Tier-1 design is key limit to performance improvement • Conclusions • Identifying and accelerating key components yields better system performance (with a wary eye on Amdahl’s Law) • Performance enhancements achieved mostly through functional parallelism due to sequential processing constraints
Future Work and Acknowledgments • Future Work: • Full system implementation on SGI Altix with RASC • Region of Interest capability • Lossy encoding and rate capability • MCT and Tier-2 encoding on FPGA as well • Single FPGA JPEG-2000 encoding application • Acknowledgments • We wish to thank the following vendors for equipment and/or tools in support of this research: • SGI • Nallatech • Xilinx • Aldec • Special thanks to SGI Digital Media group, SGI RASC engineers for their help and suggestions
References [1] Adams, M.D. and Ward, R.K., “JasPer: a portable flexible open-source software tool kit for image coding/process”, in IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP’04), pp. 241-244, May 2004. [2] OpenJPEG. http://www.opegjpeg.org/ [3] Liu, L., Li, D., Li, Z., Wang, Z. and Chen, H., “A VLSI architecture of EBCOT encoder for JPEG2000”, in 5th International Conference on ASIC, pp. 882-885, Oct. 2003. [4] Chen, K., Lian, C., Chen, H., and L. Chen, “Analysis and architecture design of EBCOT for JPEG-2000,” in IEEE International Symposium on Circuits and Systems, vol. 2, pp. 765-768, May 2001. [5] Van Buren, D., “A high-rate JPEG2000 compression system for space”, in IEEE Aerospace Conference, March 2005. [6] Aouadi, I., and Hammami, O., “Analysis and hardware design of a scalable dual JPEG-2000 entropy coder”, in Euromicro Symposium onDigital System Design (DSD 2004), pp. 227-233, Sept. 2004. [7] Gangadhar, M. and Bhatia, D., “FPGA based EBCOT architecture for JPEG 2000”, in IEEE International Conference on Field-Programmable Technology (FPT’03), pp. 228-233, Dec. 2003 [8] Hung, K., Huang Y., Truong, T., Wang, C., “FPGA implementation for 2D discrete wavelet transform”, in Electronics Letters, pp. 639-640, April 1998. [9] Lakshminarayanan, G. Venkataramani, B. Senthil Kumar, J., Yousuf, A.K. and Sriram, G., “Design and FPGA implementation of image block encoders with 2D-DWT”, in Conference on Convergent Technologies for Asia-Pacific Region (TENCON 2003), pp. 1015-1019, Oct. 2003. [10] McCanny, P., Masud, S., and McCanny, J., “Design and implementation of the symmetrically extended 2-D wavelet transform”, in IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP’02), vol. 3, pp. 3108-31111, May 2002. [11] D. Taubman, “High performance scalable image compression with EBCOT,” in IEEE Trans. Image Processing, vol. 9, pp. 1158-1170, July 2000. [12] I.E.G. Richardson, Video Codec Design: Developing Image and Video Compression Systems. Chichester, West Sussex, New York: John Wiley and Sons, Ltd (UK), 2002. [13] T. Acharya and P.-S. Tsai, JPEG 2000 Standard for image Compression: Concepts, Algorithms, and VLSI Architectures. Hoboken, New Jersey: John Wiley and Sons, Inc., 2005.