290 likes | 520 Views
SciMATE: A Novel MapReduce-Like Framework for Multiple Scientific Data Formats. Yi Wang, Wei Jiang, Gagan Agrawal. The Ohio State University. CCGrid 2012 May 15 th , Ottawa, Canada. Outline. Introduction System Design System Optimization Experimental Results.
E N D
SciMATE: A Novel MapReduce-Like Framework for Multiple Scientific Data Formats Yi Wang, Wei Jiang, Gagan Agrawal The Ohio State University CCGrid 2012 May 15th, Ottawa, Canada
Outline • Introduction • System Design • System Optimization • Experimental Results
Scientific Data Analysis Today • Increasingly data-intensive • Volume approximately doubles each year • Stored in certain specialized formats • NetCDF, HDF5, ADIOS… • Popularity of MapReduce and its variants • Free accessibility • Easy programmability • Good scalability • Built-in fault tolerance
Scientific Data Analysis Today(Cont’d) • “Store-first-analyze-after” • Reload data in another file system • E.g. load data from PVFS to HDFS • Reload data in another data format • E.g. load NetCDF/HDF5 data to a specific format • Problem • Long data migration/transformation time • Stresses network and disks
Outline • Introduction • System Design • System Optimization • Experimental Results
SciMATE Framework • “In-situ data analysis ” (No data reloading!) • Extend MATE for scientific data analysis [Wei Jiang et al., CCGRID’10] • Customizable data format adaption API • Ability to be adapted to support processing on any ( or even new) scientific data format • Optimized by • Access strategies: full read/partial read • Access patterns for partial read
System Overview • Key feature • scientific data processing module
Scientific Data Processing Module • Function • Data partitioning • Data loading (+ Data restructuring) • Block loader • Data format selector • Transparently translate the calls to block loader interface into the calls to scientific data library • System-defined adapters: NetCDF, HDF5, Flat-file • Access strategy selector
Integrating a New Data Format • Data adaption layer is customizable • Insert a third-party adapter • Open for extension but closed for modification • Have to implement the generic block loader interface • Partitioning function and auxiliary functions • E.g., partition, get_dimensionality • Full read function and partial read functions • E.g., full_read, partial_read, partial_read_by_block
Outline • Introduction • System Design • System Optimization • Experimental Results
Data Access Strategies and Patterns • Full Read • Partial Read • Strided pattern: read equal-size segments separated by regular strides • Column pattern: read a set of arbitrary columns that corresponds to a subset of dataset dimensions • Discrete point pattern: read a sequence of discrete points
Access Pattern Optimization • Strided pattern: directly supported by API • Discrete point pattern: rarely used, so no optimization for now • Column pattern • Fixed-size column read • Read a fixed number of columns at a time • Load balanced • Contiguous column read (our choice) • Read a contiguous column set at a time • Reduce the overhead of frequent small I/O requests
Outline • Introduction • System Design • System Optimization • Experimental Results
Evaluation • System functionality and scalability • Use 16 GB datasets • Data processing times (K-means, PCA, kNN) • Thread scalability • Node scalability • Data loading times (K-means, PCA) • Node scalability • Compare partial read with full read • Compare fixed-size column read with contiguous column read
Evaluating Thread Scalability • Data processing times for K-means ( on 1 node)
Evaluating Thread Scalability • Data processing times for PCA (on 1 node)
Evaluating Thread Scalability • Data processing times for kNN (on 1 node)
Evaluating Node Scalability • Data processing times for K-means (8 threads per node)
Evaluating Node Scalability • Data processing times for PCA (8 threads per node)
Evaluating Node Scalability • Data processing times for kNN (8 threads per node)
Data Processing Evaluation • Good scalability as either the number of threads or the number of nodes varies • Data processing time is independent of data format
Evaluating Node Scalability • Data loading times for K-means (8 threads per node)
Evaluating Node Scalability • Data loading times for PCA (8 threads per node)
Data Loading Evaluation • Good node scalability • Loading flat-file datasets is the slowest • Highly structured nature facilitates parallel I/O • Loading NetCDF datasets is slightly faster than loading HDF5 datasets • Related to libraries, unrelated to SciMATE • Compared with HDF5 hierarchical data layout, NetCDF linear data layout favors MPI-IO • NetCDF has a smaller header I/O overhead
Full Read v.s. Partial Read • Data loading time for PCA (4 nodes and 1 thread per node, 8 GB dataset)
Fixed-size Column Read v.s. Contiguous Column Read • Loading NetCDF data for kNN (1 node with 2 threads, 8 GB dataset)
Fixed-size Column Read v.s. Contiguous Column Read • Loading HDF5 data for kNN (1 node with 2 threads, 8 GB dataset)
Conclusion • SciMATE framework avoids bulk data transfers and vast data transformation, and hence reduces analysis time • Customizable data format adaption API helps integrate any scientific data format • Support optimized read by using proper access strategies and patterns