SciMATE: A Novel MapReduce-Like Framework for Multiple Scientific Data Formats

SciMATE: A Novel MapReduce-Like Framework for Multiple Scientific Data Formats Yi Wang, Wei Jiang, Gagan Agrawal The Ohio State University CCGrid 2012 May 15th, Ottawa, Canada

Outline • Introduction • System Design • System Optimization • Experimental Results

Scientific Data Analysis Today • Increasingly data-intensive • Volume approximately doubles each year • Stored in certain specialized formats • NetCDF, HDF5, ADIOS… • Popularity of MapReduce and its variants • Free accessibility • Easy programmability • Good scalability • Built-in fault tolerance

Scientific Data Analysis Today(Cont’d) • “Store-first-analyze-after” • Reload data in another file system • E.g. load data from PVFS to HDFS • Reload data in another data format • E.g. load NetCDF/HDF5 data to a specific format • Problem • Long data migration/transformation time • Stresses network and disks

SciMATE Framework • “In-situ data analysis ” (No data reloading!) • Extend MATE for scientific data analysis [Wei Jiang et al., CCGRID’10] • Customizable data format adaption API • Ability to be adapted to support processing on any ( or even new) scientific data format • Optimized by • Access strategies: full read/partial read • Access patterns for partial read

System Overview • Key feature • scientific data processing module

Scientific Data Processing Module

Scientific Data Processing Module • Function • Data partitioning • Data loading (+ Data restructuring) • Block loader • Data format selector • Transparently translate the calls to block loader interface into the calls to scientific data library • System-defined adapters: NetCDF, HDF5, Flat-file • Access strategy selector

Integrating a New Data Format • Data adaption layer is customizable • Insert a third-party adapter • Open for extension but closed for modification • Have to implement the generic block loader interface • Partitioning function and auxiliary functions • E.g., partition, get_dimensionality • Full read function and partial read functions • E.g., full_read, partial_read, partial_read_by_block

Data Access Strategies and Patterns • Full Read • Partial Read • Strided pattern: read equal-size segments separated by regular strides • Column pattern: read a set of arbitrary columns that corresponds to a subset of dataset dimensions • Discrete point pattern: read a sequence of discrete points

Access Pattern Optimization • Strided pattern: directly supported by API • Discrete point pattern: rarely used, so no optimization for now • Column pattern • Fixed-size column read • Read a fixed number of columns at a time • Load balanced • Contiguous column read (our choice) • Read a contiguous column set at a time • Reduce the overhead of frequent small I/O requests

Evaluation • System functionality and scalability • Use 16 GB datasets • Data processing times (K-means, PCA, kNN) • Thread scalability • Node scalability • Data loading times (K-means, PCA) • Node scalability • Compare partial read with full read • Compare fixed-size column read with contiguous column read

Evaluating Thread Scalability • Data processing times for K-means ( on 1 node)

Evaluating Thread Scalability • Data processing times for PCA (on 1 node)

Evaluating Thread Scalability • Data processing times for kNN (on 1 node)

Evaluating Node Scalability • Data processing times for K-means (8 threads per node)

Evaluating Node Scalability • Data processing times for PCA (8 threads per node)

Evaluating Node Scalability • Data processing times for kNN (8 threads per node)

Data Processing Evaluation • Good scalability as either the number of threads or the number of nodes varies • Data processing time is independent of data format

Evaluating Node Scalability • Data loading times for K-means (8 threads per node)

Evaluating Node Scalability • Data loading times for PCA (8 threads per node)

Data Loading Evaluation • Good node scalability • Loading flat-file datasets is the slowest • Highly structured nature facilitates parallel I/O • Loading NetCDF datasets is slightly faster than loading HDF5 datasets • Related to libraries, unrelated to SciMATE • Compared with HDF5 hierarchical data layout, NetCDF linear data layout favors MPI-IO • NetCDF has a smaller header I/O overhead

Full Read v.s. Partial Read • Data loading time for PCA (4 nodes and 1 thread per node, 8 GB dataset)

Fixed-size Column Read v.s. Contiguous Column Read • Loading NetCDF data for kNN (1 node with 2 threads, 8 GB dataset)

Fixed-size Column Read v.s. Contiguous Column Read • Loading HDF5 data for kNN (1 node with 2 threads, 8 GB dataset)

Conclusion • SciMATE framework avoids bulk data transfers and vast data transformation, and hence reduces analysis time • Customizable data format adaption API helps integrate any scientific data format • Support optimized read by using proper access strategies and patterns

SciMATE: A Novel MapReduce-Like Framework for Multiple Scientific Data Formats