210 likes | 372 Views
Integrating Online Compression to Accelerate Large-Scale Data Analytics Applications. ‡. †. ‡. Tekin Bicer , Jian Yin, David Chiu, Gagan Agrawal and Karen Schuchardt Ohio State University Washington State University Pacific Northwest National Laboratories. †. ‡. Introduction.
E N D
Integrating Online Compression to Accelerate Large-Scale Data Analytics Applications ‡ † ‡ TekinBicer, Jian Yin, David Chiu, GaganAgrawaland Karen SchuchardtOhio State UniversityWashington State UniversityPacific Northwest National Laboratories † ‡
Introduction • Scientific simulations and instruments can generate large amount of data • E.g. Global Cloud Resolving Model • 1PB data for 4km grid-cell • Higher resolutions, more and more data • I/O operations become bottleneck • Problems • Storage, I/O performance • Compression
Motivation • Generic compression algorithms • Good for low entropy sequence of bytes • Scientific dataset are hard to compress • Floating point numbers: Exponent and mantissa • Mantissa can be highly entropic • Using compression in applications is challenging • Suitable compression algorithms • Utilization of available resources • Integration of compression algorithms
Outline • Introduction • Motivation • Compression Methodology • Online Compression Framework • Experimental Results • Related Work • Conclusion
Compression Methodology • Common properties of scientific datasets • Multidimensional arrays • Consist of floating point numbers • Relationship between neighboring values • Domain specific solutions can help • Approach: • Prediction-based differential compression • Predict the values of neighboring cells • Store the difference
Example: GCRM Temperature Variable Compression • E.g.: Temperature record • The values of neighboring cells are highly related • X’ table (after prediction): • X’’ compressed values • 5bits for prediction + difference • Lossless and lossy comp. • Fast and good compression ratios
Compression Framework • Improve end-to-end application performance • Minimize the application I/O time • Pipelining I/O and (de)comp. operations • Hide computational overhead • Overlapping app. computation with comp. framework • Easy implementation of diff. comp. alg. • Easy integration with applications • Similar API to POSIX I/O
A Compression Framework for Data Intensive Applications Chunk Resource Allocation (CRA) Layer • Initialization of the system • Generate chunk requests, enqueue processing • Converting original offset and data size requests to compressed Parallel I/O Layer (PIOL) • Creates parallel chunk requests to storage medium • Each chunk request is handled by a group of threads • Provides abstraction for different data transfer protocols Parallel Compression Engine (PCE) • Applies encode(), decode() functions to chunks • Manages in-memory cache with informed prefetching • Creates I/O requests
Compression Framework API • User defined functions: • encode_t(…): (R) Code for compression • decode_t(…): (R) Code for decompression • prefetch_t(…): (O) Informed prefetching function • Application can use below functions • comp_read: Applies decode_t to comp. chunk • comp_write: Applies encode_t to original chunk comp_seek: Mimics fseek, also utilizes prefetch_t • comp_init: Init. system (thread pools, cache etc.)
Prefetching and In-Memory Cache • Overlapping application layer computation with I/O • Reusability of already accessed data is small • Prefetching and caching the prospective chunks • Default is LRU • User can analyze history and provide prospective chunk list • Cache uses row-based locking scheme for efficient consecutive chunk requests Informed Prefetching prefetch(…)
Integration with a Data-Intensive Computing System • MapReduce style API • Remote data processing • Sensitive to I/O bandwidth • Processes data in… • local cluster • cloud • or both (Hybrid Cloud)
Outline • Introduction • Motivation • Compression Methodology • Online Compression Framework • Experimental Results • Related Work • Conclusion
Experimental Setup • Two datasets: • GCRM: 375GB (L:270 + R:105) • NPB: 237GB (L:166 + R:71) • 16x8 cores (Intel Xeon 2.53GHz) • Storage of datasets • Lustre FS (14 storage nodes) • Amazon S3 (Northern Virginia) • Compression algorithms • CC, FPC, LZO, bzip, gzip, lzma • Applications: AT, MMAT, KMeans
Performance of MMAT Breakdown of Performance • Overhead (Local): 15.41% • Read Speedup: 1.96
Lossy Compression (MMAT) Lossy • #e: # dropped bits • Error bound: 5x(1/10^5)
Performance of KMeans • NPB dataset • Comp ratio: 24.01% (180GB) • More computation • More opportunity to fetch and decompression
Conclusion • Management and analysis of scientific datasets are challenging • Generic compression algorithms are inefficient for scientific datasets • We proposed a compression framework and methodology • Domain specific compression algorithms are fast and space efficient • 51.68% compression ratio • 53.27% improvement in exec. time • Easy plug-and-play of compression • Integration of the proposed framework and methodology with a data analysis middleware
Multithreading & Prefetching Diff. # PCE and I/O Threads • 2P – 4IO • 2 PCE threads, 4 I/O threads • One core is assigned to comp. framework
Related Work • (Scientific) data management • NetCDF, PNetCDF, HDF5 • Nicolae et al. (BlobSeer) • Distributed data management service for efficient reading, writing and appending ops. • Compression • Generic: LZO, bzip, gzip, szip, LZMA etc. • Scientific • Schendel and Jin et al. (ISOBAR) • Organizes highly entropic data into compressible data chunks • Burtscher et al. (FPC) • Efficient double-precision floating point compression • Lakshminarasimhan et al. (ISABELA)