1 / 20

Integrating Online Compression to Accelerate Large-Scale Data Analytics Applications

Integrating Online Compression to Accelerate Large-Scale Data Analytics Applications. ‡. †. ‡. Tekin Bicer , Jian Yin, David Chiu, Gagan Agrawal and Karen Schuchardt Ohio State University Washington State University Pacific Northwest National Laboratories. †. ‡. Introduction.

hop
Download Presentation

Integrating Online Compression to Accelerate Large-Scale Data Analytics Applications

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Integrating Online Compression to Accelerate Large-Scale Data Analytics Applications ‡ † ‡ TekinBicer, Jian Yin, David Chiu, GaganAgrawaland Karen SchuchardtOhio State UniversityWashington State UniversityPacific Northwest National Laboratories † ‡

  2. Introduction • Scientific simulations and instruments can generate large amount of data • E.g. Global Cloud Resolving Model • 1PB data for 4km grid-cell • Higher resolutions, more and more data • I/O operations become bottleneck • Problems • Storage, I/O performance • Compression

  3. Motivation • Generic compression algorithms • Good for low entropy sequence of bytes • Scientific dataset are hard to compress • Floating point numbers: Exponent and mantissa • Mantissa can be highly entropic • Using compression in applications is challenging • Suitable compression algorithms • Utilization of available resources • Integration of compression algorithms

  4. Outline • Introduction • Motivation • Compression Methodology • Online Compression Framework • Experimental Results • Related Work • Conclusion

  5. Compression Methodology • Common properties of scientific datasets • Multidimensional arrays • Consist of floating point numbers • Relationship between neighboring values • Domain specific solutions can help • Approach: • Prediction-based differential compression • Predict the values of neighboring cells • Store the difference

  6. Example: GCRM Temperature Variable Compression • E.g.: Temperature record • The values of neighboring cells are highly related • X’ table (after prediction): • X’’ compressed values • 5bits for prediction + difference • Lossless and lossy comp. • Fast and good compression ratios

  7. Compression Framework • Improve end-to-end application performance • Minimize the application I/O time • Pipelining I/O and (de)comp. operations • Hide computational overhead • Overlapping app. computation with comp. framework • Easy implementation of diff. comp. alg. • Easy integration with applications • Similar API to POSIX I/O

  8. A Compression Framework for Data Intensive Applications Chunk Resource Allocation (CRA) Layer • Initialization of the system • Generate chunk requests, enqueue processing • Converting original offset and data size requests to compressed Parallel I/O Layer (PIOL) • Creates parallel chunk requests to storage medium • Each chunk request is handled by a group of threads • Provides abstraction for different data transfer protocols Parallel Compression Engine (PCE) • Applies encode(), decode() functions to chunks • Manages in-memory cache with informed prefetching • Creates I/O requests

  9. Compression Framework API • User defined functions: • encode_t(…): (R) Code for compression • decode_t(…): (R) Code for decompression • prefetch_t(…): (O) Informed prefetching function • Application can use below functions • comp_read: Applies decode_t to comp. chunk • comp_write: Applies encode_t to original chunk comp_seek: Mimics fseek, also utilizes prefetch_t • comp_init: Init. system (thread pools, cache etc.)

  10. Prefetching and In-Memory Cache • Overlapping application layer computation with I/O • Reusability of already accessed data is small • Prefetching and caching the prospective chunks • Default is LRU • User can analyze history and provide prospective chunk list • Cache uses row-based locking scheme for efficient consecutive chunk requests Informed Prefetching prefetch(…)

  11. Integration with a Data-Intensive Computing System • MapReduce style API • Remote data processing • Sensitive to I/O bandwidth • Processes data in… • local cluster • cloud • or both (Hybrid Cloud)

  12. Outline • Introduction • Motivation • Compression Methodology • Online Compression Framework • Experimental Results • Related Work • Conclusion

  13. Experimental Setup • Two datasets: • GCRM: 375GB (L:270 + R:105) • NPB: 237GB (L:166 + R:71) • 16x8 cores (Intel Xeon 2.53GHz) • Storage of datasets • Lustre FS (14 storage nodes) • Amazon S3 (Northern Virginia) • Compression algorithms • CC, FPC, LZO, bzip, gzip, lzma • Applications: AT, MMAT, KMeans

  14. Performance of MMAT Breakdown of Performance • Overhead (Local): 15.41% • Read Speedup: 1.96

  15. Lossy Compression (MMAT) Lossy • #e: # dropped bits • Error bound: 5x(1/10^5)

  16. Performance of KMeans • NPB dataset • Comp ratio: 24.01% (180GB) • More computation • More opportunity to fetch and decompression

  17. Conclusion • Management and analysis of scientific datasets are challenging • Generic compression algorithms are inefficient for scientific datasets • We proposed a compression framework and methodology • Domain specific compression algorithms are fast and space efficient • 51.68% compression ratio • 53.27% improvement in exec. time • Easy plug-and-play of compression • Integration of the proposed framework and methodology with a data analysis middleware

  18. Thanks!

  19. Multithreading & Prefetching Diff. # PCE and I/O Threads • 2P – 4IO • 2 PCE threads, 4 I/O threads • One core is assigned to comp. framework

  20. Related Work • (Scientific) data management • NetCDF, PNetCDF, HDF5 • Nicolae et al. (BlobSeer) • Distributed data management service for efficient reading, writing and appending ops. • Compression • Generic: LZO, bzip, gzip, szip, LZMA etc. • Scientific • Schendel and Jin et al. (ISOBAR) • Organizes highly entropic data into compressible data chunks • Burtscher et al. (FPC) • Efficient double-precision floating point compression • Lakshminarasimhan et al. (ISABELA)

More Related