130 likes | 286 Views
Data Compression for PDS4. Lisa Gaddis, Sue LaVoie, Jeff Anderson, Elizabeth Rye PDS Imaging Node March 26, 2010. Syntax. Data Compression Encodes information using fewer bits Reduces consumption of expensive resources Data storage and/or transmission bandwidth Requires decompression
E N D
Data Compression for PDS4 Lisa Gaddis, Sue LaVoie, Jeff Anderson, Elizabeth Rye PDS Imaging Node March 26, 2010
Syntax • Data Compression • Encodes information using fewer bits • Reduces consumption of expensive resources • Data storage and/or transmission bandwidth • Requires decompression • Trade-offs • degree of compression • amount of ‘distortion’ introduced • computational resources required for decompression • Image Compression • Application of data compression to digital images • Reduces redundancy in images to improve efficiency of storage and transmission • Lossless and lossy methods • Preserve image quality at a given bit- or compression-rate • File Compression • Reduces redundancy at the file level • Many available tools • ZIP • GZIP • BZIP2 Data Compression
Why image compression? • Image compression for data providers and archivists • NASA missions deliver significant numbers of large image files • Need to support and/or reduce storage costs and data transmission times of images • Promotes exchange between different users and systems • Athough falling in cost, storage is expensive for many TB of data and multiple copies • FY10: ~$750/TB for RAID storage with network infrastructure Data Compression
Image Compression • Lossless compression • Exploits data redundancy • Image can be recovered exactly • ‘Run-length encoding’ makes use of redundant patterns or ‘runs’ • ‘LZW (Lempel Ziv Welch) encoding’ also address strings of characters; builds up a table of strings and their corresponding codes • ‘Huffman coding’ uses a binary encoding tree to represent commonly occurring values in few bits and less frequently occurring values in more bits • Best for documents, computer programs, line drawings, etc. • JPEG2000 has a lossless option, approved for use by PDS • Lossy compression • Exploits data redundancy and ‘irrelevant’ data • Image data are not recovered exactly • JPEG • JPEG2000 (lossy) • Best for digital images, audio, video • Not approved for PDS archive data • Exceptions: Browse and some EDR images (e.g., Clementine UVVIS and NIR) are lossy JPEG images (5.5 ave. compression rate) Data Compression
MRO and LRO images • Not your typical images • MESSENGER MDIS, Viking Orbiter, Galileo SSI, etc. • Framing cameras • 800 samples x 800 lines to 1024 samples x 1024 lines • Roughly one megabyte (MB) per observation • PDS Imaging Node combined archive requirements for all missions other than LRO and MRO is <25 TB • MRO/HiRISE, LRO/LROC • Line-scan cameras • 10,000-20,000 samples x 50,000-100,000 lines • Roughly 500 to 2,000 MB per observation • Combined expected archive total for MRO and LRO is 500 TB • 20X larger than sum total of all other Imaging Node holdings Data Compression
Image Compression for HiRISE RDRs • Why image compression was needed • Enormous volume of HiRISE archive, 1 yr • EDR – 12,100 Gb (~1.5 TB) • RDR – 92,500 Gb (11.3 TB) • Very large Standard Data Products • EDR (2048 X 64,000, 16-bit) = 262 MB • RDR (40,000 x 64,000, after reprojection, 16-bit) = ~500 to 1000 MB • Advantages for delivery of RDR data in JPEG2000 format • Losslessly recompressed format • Wavelet compression greatly improves speed of web access • Fast browse, zoom, pan capabilities for handling large files • Volume projections • EDR DVD volumes: 321 (losslessly recompressed) vs 482 (uncompressed) (1.5 compression ratio) • RDR DVD volumes: 2400 (losslessly compressed) vs 7300 (uncompressed) (assuming 3.0 compression ratio) Data Compression
HiRISE Example • JPEG2000 image compression applied to map-projected RDR images only • lots of null pixels • Nulls are highly compressed as a result of the lossless compression using JPEG2000 • Projected ~3:1 compression ratios • Achieved 15:1 in recent tests Data Compression
Past Experience • Problems with compression • Voyager, Viking, and MGS-MOC PDS archives contain losslessly compressed data • Decompression algorithms (e.g., in ISIS) break due to • New compilers • New operating systems • Changes in hardware architecture (32-bit vs 64-bit) • JPEG2000 compressed HiRISE RDR images are supported by ISIS3 • But, when JPEG2000 format reaches end-of-life, software maintenance to read this format will be much more difficult than the existing Voyager/Viking/MGS-MOC algorithms • A proliferation of image compression formats in PDS would be a problem for long-term archiving and usability of the images Data Compression
Data Storage Costs: MRO & LRO • Expected PDS storage requirements for the MRO nominal mission are75TB • High capacity RAID storage & network infrastructure costs ~$750 per TB • The hardware cost to store a single copy of the MRO data is ~$56K • Only one copy of the three required by PDS • Does not include data from an extended mission • Archive includes JPEG2000 compressed images • LRO archive volume is projected to be ~400 TB • Hardware cost for one copy is ~$300K • Same caveats as above apply Data Compression
PDS3 Compressed Image Formats • Clem-JPEG (not in PDS Standards Reference) • Huffman First Difference (“) • JPEG2000 • Improved compression efficiency (vs. JPEG) • Highly scalable embedded data streams • Progressive lossy to lossless compression within a single data stream • Arbitrarily crop images in the compressed domain • Selectively enhance quality of spatial “regions of interest” • Support for very large images • Used for HiRISE & LROC RDRs • Previous Pixel (“) • Run Length (“) • Zip, gzip = GNU zip • Widely used open-source tool • Runs on a variety of common computer platforms • Available since 1992 Data Compression
Possible Solution for PDS4 • Allow File Compression • Use standard, non-patented algorithms (e.g., Lempel-Ziv 77, Huffman coding) • Use stable, open-source, well-maintained software (e.g., gzip) • Tests using gzip, HiRISE data • RDRs • HiRISE RDR, JPEG2000 = 454 MB • Uncompressed, converted to raw format = 6.6 GB (15x larger) • Compressed using gzip = 1.1 GB (2.5x larger) • EDRs • Not compressed, typical file size = 250 MB • gzipped versions = 100 MB (2.5x smaller) • Overall the HiRISE archive would be 5% smaller • gzip EDRs • Convert RDRs to raw, then gzip Data Compression
Recommendation • Allow file-based compression (such as gzip, bzip2) in PDS4 • Stable, free, widely used open-source software tool • Works on a variety of common computer platforms • Macs, PCs, Solaris, MSDOS, VAX, etc. • Maintained by open-source community • Consistent with PDS3 history, PDS4 plans for simplification • Reduces storage costs • Improves data transfer rates over internet • Supports management and delivery of high-volume data sets for providers and users Data Compression
Policy Questions • Do we permit compression at all in the PDS4 archive? • If so: • Do we want a mixture of compressed and uncompressed data? • One copy is uncompressed, two are compressed • Do we distinguish between EDRs and RDRs and other derived products? • Do we distinguish between frequently accessed data and those offline and/or in ‘deep archive’ storage? • Store deep archive data in uncompressed form or use one approved compression format (e.g., gzip) • Permit nodes to use and maintain other compression methods as needed for one or more copies • Whatever we decide, do we require older, compressed data to be ‘restored’ to meet requirements of the new compression policy? Data Compression