1 / 29

SciMATE: A Novel MapReduce-Like Framework for Multiple Scientific Data Formats

SciMATE: A Novel MapReduce-Like Framework for Multiple Scientific Data Formats. Yi Wang, Wei Jiang, Gagan Agrawal. The Ohio State University. CCGrid 2012 May 15 th , Ottawa, Canada. Outline. Introduction System Design System Optimization Experimental Results.

kris
Download Presentation

SciMATE: A Novel MapReduce-Like Framework for Multiple Scientific Data Formats

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. SciMATE: A Novel MapReduce-Like Framework for Multiple Scientific Data Formats Yi Wang, Wei Jiang, Gagan Agrawal The Ohio State University CCGrid 2012 May 15th, Ottawa, Canada

  2. Outline • Introduction • System Design • System Optimization • Experimental Results

  3. Scientific Data Analysis Today • Increasingly data-intensive • Volume approximately doubles each year • Stored in certain specialized formats • NetCDF, HDF5, ADIOS… • Popularity of MapReduce and its variants • Free accessibility • Easy programmability • Good scalability • Built-in fault tolerance

  4. Scientific Data Analysis Today(Cont’d) • “Store-first-analyze-after” • Reload data in another file system • E.g. load data from PVFS to HDFS • Reload data in another data format • E.g. load NetCDF/HDF5 data to a specific format • Problem • Long data migration/transformation time • Stresses network and disks

  5. Outline • Introduction • System Design • System Optimization • Experimental Results

  6. SciMATE Framework • “In-situ data analysis ” (No data reloading!) • Extend MATE for scientific data analysis [Wei Jiang et al., CCGRID’10] • Customizable data format adaption API • Ability to be adapted to support processing on any ( or even new) scientific data format • Optimized by • Access strategies: full read/partial read • Access patterns for partial read

  7. System Overview • Key feature • scientific data processing module

  8. Scientific Data Processing Module

  9. Scientific Data Processing Module • Function • Data partitioning • Data loading (+ Data restructuring) • Block loader • Data format selector • Transparently translate the calls to block loader interface into the calls to scientific data library • System-defined adapters: NetCDF, HDF5, Flat-file • Access strategy selector

  10. Integrating a New Data Format • Data adaption layer is customizable • Insert a third-party adapter • Open for extension but closed for modification • Have to implement the generic block loader interface • Partitioning function and auxiliary functions • E.g., partition, get_dimensionality • Full read function and partial read functions • E.g., full_read, partial_read, partial_read_by_block

  11. Outline • Introduction • System Design • System Optimization • Experimental Results

  12. Data Access Strategies and Patterns • Full Read • Partial Read • Strided pattern: read equal-size segments separated by regular strides • Column pattern: read a set of arbitrary columns that corresponds to a subset of dataset dimensions • Discrete point pattern: read a sequence of discrete points

  13. Access Pattern Optimization • Strided pattern: directly supported by API • Discrete point pattern: rarely used, so no optimization for now • Column pattern • Fixed-size column read • Read a fixed number of columns at a time • Load balanced • Contiguous column read (our choice) • Read a contiguous column set at a time • Reduce the overhead of frequent small I/O requests

  14. Outline • Introduction • System Design • System Optimization • Experimental Results

  15. Evaluation • System functionality and scalability • Use 16 GB datasets • Data processing times (K-means, PCA, kNN) • Thread scalability • Node scalability • Data loading times (K-means, PCA) • Node scalability • Compare partial read with full read • Compare fixed-size column read with contiguous column read

  16. Evaluating Thread Scalability • Data processing times for K-means ( on 1 node)

  17. Evaluating Thread Scalability • Data processing times for PCA (on 1 node)

  18. Evaluating Thread Scalability • Data processing times for kNN (on 1 node)

  19. Evaluating Node Scalability • Data processing times for K-means (8 threads per node)

  20. Evaluating Node Scalability • Data processing times for PCA (8 threads per node)

  21. Evaluating Node Scalability • Data processing times for kNN (8 threads per node)

  22. Data Processing Evaluation • Good scalability as either the number of threads or the number of nodes varies • Data processing time is independent of data format

  23. Evaluating Node Scalability • Data loading times for K-means (8 threads per node)

  24. Evaluating Node Scalability • Data loading times for PCA (8 threads per node)

  25. Data Loading Evaluation • Good node scalability • Loading flat-file datasets is the slowest • Highly structured nature facilitates parallel I/O • Loading NetCDF datasets is slightly faster than loading HDF5 datasets • Related to libraries, unrelated to SciMATE • Compared with HDF5 hierarchical data layout, NetCDF linear data layout favors MPI-IO • NetCDF has a smaller header I/O overhead

  26. Full Read v.s. Partial Read • Data loading time for PCA (4 nodes and 1 thread per node, 8 GB dataset)

  27. Fixed-size Column Read v.s. Contiguous Column Read • Loading NetCDF data for kNN (1 node with 2 threads, 8 GB dataset)

  28. Fixed-size Column Read v.s. Contiguous Column Read • Loading HDF5 data for kNN (1 node with 2 threads, 8 GB dataset)

  29. Conclusion • SciMATE framework avoids bulk data transfers and vast data transformation, and hence reduces analysis time • Customizable data format adaption API helps integrate any scientific data format • Support optimized read by using proper access strategies and patterns

More Related