Integrating Online Compression to Accelerate Large-Scale Data Analytics Applications

Integrating Online Compression to Accelerate Large-Scale Data Analytics Applications ‡ † ‡ TekinBicer, Jian Yin, David Chiu, GaganAgrawaland Karen SchuchardtOhio State UniversityWashington State UniversityPacific Northwest National Laboratories † ‡

Introduction • Scientific simulations and instruments can generate large amount of data • E.g. Global Cloud Resolving Model • 1PB data for 4km grid-cell • Higher resolutions, more and more data • I/O operations become bottleneck • Problems • Storage, I/O performance • Compression

Motivation • Generic compression algorithms • Good for low entropy sequence of bytes • Scientific dataset are hard to compress • Floating point numbers: Exponent and mantissa • Mantissa can be highly entropic • Using compression in applications is challenging • Suitable compression algorithms • Utilization of available resources • Integration of compression algorithms

Outline • Introduction • Motivation • Compression Methodology • Online Compression Framework • Experimental Results • Related Work • Conclusion

Compression Methodology • Common properties of scientific datasets • Multidimensional arrays • Consist of floating point numbers • Relationship between neighboring values • Domain specific solutions can help • Approach: • Prediction-based differential compression • Predict the values of neighboring cells • Store the difference

Example: GCRM Temperature Variable Compression • E.g.: Temperature record • The values of neighboring cells are highly related • X’ table (after prediction): • X’’ compressed values • 5bits for prediction + difference • Lossless and lossy comp. • Fast and good compression ratios

Compression Framework • Improve end-to-end application performance • Minimize the application I/O time • Pipelining I/O and (de)comp. operations • Hide computational overhead • Overlapping app. computation with comp. framework • Easy implementation of diff. comp. alg. • Easy integration with applications • Similar API to POSIX I/O

A Compression Framework for Data Intensive Applications Chunk Resource Allocation (CRA) Layer • Initialization of the system • Generate chunk requests, enqueue processing • Converting original offset and data size requests to compressed Parallel I/O Layer (PIOL) • Creates parallel chunk requests to storage medium • Each chunk request is handled by a group of threads • Provides abstraction for different data transfer protocols Parallel Compression Engine (PCE) • Applies encode(), decode() functions to chunks • Manages in-memory cache with informed prefetching • Creates I/O requests

Compression Framework API • User defined functions: • encode_t(…): (R) Code for compression • decode_t(…): (R) Code for decompression • prefetch_t(…): (O) Informed prefetching function • Application can use below functions • comp_read: Applies decode_t to comp. chunk • comp_write: Applies encode_t to original chunk comp_seek: Mimics fseek, also utilizes prefetch_t • comp_init: Init. system (thread pools, cache etc.)

Prefetching and In-Memory Cache • Overlapping application layer computation with I/O • Reusability of already accessed data is small • Prefetching and caching the prospective chunks • Default is LRU • User can analyze history and provide prospective chunk list • Cache uses row-based locking scheme for efficient consecutive chunk requests Informed Prefetching prefetch(…)

Integration with a Data-Intensive Computing System • MapReduce style API • Remote data processing • Sensitive to I/O bandwidth • Processes data in… • local cluster • cloud • or both (Hybrid Cloud)

Outline • Introduction • Motivation • Compression Methodology • Online Compression Framework • Experimental Results • Related Work • Conclusion

Experimental Setup • Two datasets: • GCRM: 375GB (L:270 + R:105) • NPB: 237GB (L:166 + R:71) • 16x8 cores (Intel Xeon 2.53GHz) • Storage of datasets • Lustre FS (14 storage nodes) • Amazon S3 (Northern Virginia) • Compression algorithms • CC, FPC, LZO, bzip, gzip, lzma • Applications: AT, MMAT, KMeans

Performance of MMAT Breakdown of Performance • Overhead (Local): 15.41% • Read Speedup: 1.96

Lossy Compression (MMAT) Lossy • #e: # dropped bits • Error bound: 5x(1/10^5)

Performance of KMeans • NPB dataset • Comp ratio: 24.01% (180GB) • More computation • More opportunity to fetch and decompression

Conclusion • Management and analysis of scientific datasets are challenging • Generic compression algorithms are inefficient for scientific datasets • We proposed a compression framework and methodology • Domain specific compression algorithms are fast and space efficient • 51.68% compression ratio • 53.27% improvement in exec. time • Easy plug-and-play of compression • Integration of the proposed framework and methodology with a data analysis middleware

Thanks!

Multithreading & Prefetching Diff. # PCE and I/O Threads • 2P – 4IO • 2 PCE threads, 4 I/O threads • One core is assigned to comp. framework

Related Work • (Scientific) data management • NetCDF, PNetCDF, HDF5 • Nicolae et al. (BlobSeer) • Distributed data management service for efficient reading, writing and appending ops. • Compression • Generic: LZO, bzip, gzip, szip, LZMA etc. • Scientific • Schendel and Jin et al. (ISOBAR) • Organizes highly entropic data into compressible data chunks • Burtscher et al. (FPC) • Efficient double-precision floating point compression • Lakshminarasimhan et al. (ISABELA)

Integrating Online Compression to Accelerate Large-Scale Data Analytics Applications

Integrating Online Compression to Accelerate Large-Scale Data Analytics Applications

Presentation Transcript

SQL Server: A Data Platform for Large-Scale Applications

SQL Server: A Data Platform for Large-Scale Applications

System for Troubleshooting Big Data Applications in Large Scale Data Centers

Integrating Online Compression to Accelerate Large-Scale Data Analytics Applications

Large scale genomic data mining

Building Large-Scale Data-Centric Applications with Silverlight

DATA ANALYTICS on web scale

Large-scale Data Processing Challenges

Large scale genomic data mining

Large- scale Linked Data Management

Large-Scale Iterative Data Processing CS525 Big Data Analytics

Large scale data processing

HEART Online Large-Scale Assessment Data Management System

SQL Server Parallel Data Warehouse: Supporting Large Scale Analytics

FP7-ICT-2007-2 HELIOS Large-scale Integrating Project Large-scale integrating project (IP)

Large Scale Applications

Large Scale Data Integration

Large Scale Metabolic Network Alignments by Compression

Large Scale Data Analytics

Data Analytics Course | Data Analytics Online Course | Data Analytics Certification

Large-Scale Graph Analytics

large scale data analysis