270 likes | 419 Views
SAGA: Array Storage as a DB with Support for Structural Aggregations. SSDBM 2014 June 30 th , Aalborg, Denmark. Yi Wang, Arnab Nandi, Gagan Agrawal The Ohio State University. Outline. Introduction Grid Aggregations Overlapping Aggregations Experimental Results Conclusion.
E N D
SAGA: Array Storage as a DB with Support for Structural Aggregations SSDBM 2014 June 30th, Aalborg, Denmark Yi Wang, Arnab Nandi, GaganAgrawal The Ohio State University
Outline • Introduction • Grid Aggregations • Overlapping Aggregations • Experimental Results • Conclusion
Big Data Is Often Big Arrays • Array data is everywhere Molecular Simulation: Molecular Data Life Science: DNA Sequencing Data (Microarray) Earth Science: Ocean and Climate Data Space Science: Astronomy Data
How to Process Big Arrays? • Use relational databases? • Poor Expressibility • Loses the natural positional/structural information • Most complex operations are naturally defined in terms of arrays: e.g., correlations, convolution, curve fitting … • Poor Performance • Cumbersome data transformations • Too heavyweight: e.g., transactions • One size does not fit all! • Mapping • Manipulation • Rendering Input Table Input Array Output Array Output Table
Array Databases • Examples: SciDB, RasDaMan and MonetDB • Take Array as the First-Class Citizens • Everything is defined in the array dialect • Lightweight or No ACID Maintenance • No write conflict: ACID is inherently guaranteed • Other Desired Functionality • Structural aggregations, array join, provenance…
The Upfront Cost of Using SciDB • High-Level Data Flow • Requires data ingestion • Data Ingestion Steps • Raw files (e.g., HDF5) -> CSV • Load CSV files into SciDB “EarthDB: scalable analysis of MODIS data using SciDB” - G. Planthaber et al.
Array Storage as a DB • A Paradigm Similar to NoDB • Still maintains DB functionality • But no data ingestion • DB and Array Storage as a DB: Friends or Foes? • When to use DB? • Load once, and query frequently • When to directly use array storage? • Query infrequently, so avoid loading • Our System • Focuses on a set of special array operations - Structural Aggregations
Traditional Value-Based Aggregation • Value-based aggregation SELECT COUNT(Member) AS Num, Nationality FROM T1 GROUP BY Nationality; aggregate the elements of the same value at a time Aggregation Result T1
Structural Aggregation • Aggregate the elements based on positional relationships • E.g., moving average: calculates the average of each 2 × 2 square from left to right Input Array Aggregation Result aggregate the elements in the same square at a time
Structural Aggregation Types Non-Overlapping Aggregation Overlapping Aggregation
Grid Aggregation • Parallelization: Easy after Partitioning • Considerations • Data contiguity which affects the I/O performance • Communication cost • Load balancing for skewed data • Partitioning Strategies • Coarse-grained, fine-grained, hybrid, and auto-grained • Why not use dynamic repartitioning? • Runtime overhead • Poor data contiguity • Redundant data loads
Coarse-Grained Partitioning • Pros • Low I/O cost • Low communication cost • Cons • Workload imbalance for skewed data
Fine-Grained Partitioning • Pros • Excellent workload balance for skewed data • Cons • Relatively high I/O cost • High communication cost
Hybrid Partitioning • Pros • Low communication cost • Good workload balance for skewed data • Cons • High I/O cost
Auto-Grained Partitioning • 2 Steps • Estimate the grid density (after filtering) by uniform sampling, and hence estimate the computation cost (based on the computation complexity) • For each grid, total processing cost = constant loading cost + variable computation cost • Partition the cost array - Balanced Contiguous Multi-Way Partitioning • Dynamic programming (a small number of grids) • Greedy (a large number of grids)
Auto-Grained Partitioning (Cont’d) • Pros • Low I/O cost • Low communication cost • Great workload balance for skewed data • Cons • Overhead of sampling an runtime partitioning
Partitioning Strategy Summary Our partitioning strategy decider can help choose the best strategy
Partitioning Strategy Decider • Cost Model: analyze load cost and computation cost separately • Load cost • Loading factor × data amount • Computation cost • Exception - Auto-Grained: take load cost and computation cost as a whole
Overlapping Aggregation • I/O Cost • Reuse the data already in the memory • Reduce the disk I/O to enhance the I/O performance • Memory Accesses • Reuse the data already in the cache • Reduce cache misses to accelerate the computation • Aggregation Approaches • Naïve approach • Data-reuse approach • All-reuse approach
Example: Hierarchical Aggregation • Aggregate 3 grids in a 6 × 6 array • The innermost 2 × 2 grid • The middle 4 × 4 grid • The outmost 6 × 6 grid • (Parallel) sliding aggregation is much more complicated
Naïve Approach Load the innermost grid Aggregate the innermost grid Load the middle grid Aggregate the middle grid Load the outermost grid Aggregate the outermost grid For N grids: N loads + N aggregations
Data-Reuse Approach Load the outermost grid Aggregate the outermost grid Aggregate the middle grid Aggregate the innermost grid For N grids: 1 load + N aggregations
All-Reuse Approach Load the outermost grid Once an element is accessed, accumulatively update the aggregation results it contributes to For N grids: 1 load + 1 aggregation Only update the outermost aggregation result Update both the outermost and the middle aggregation results Update all the 3 aggregation results
All-Reuse Approach (Cont’d) • Key Insight • # of aggregation results ≤ # of queried elements • More computationally efficient to iterate over elements and update the associated aggregation results • More Benefits • Load balance (for hierarchical/circular aggregations) • More speedup for compound array elements • The data type of an aggregation result is usually primitive, but this is not always true for an array element
Parallel Performance vs. SciDB • No preprocessing cost is included for SciDB • Array slab/data size (8 GB) ratio: from 12.5% to 100% • Coarse-grained partitioning for the grid aggregation • All-reuse approach for the sliding aggregation • SciDB stores `chunked’ array: can even support overlapping chunking to accelerate the sliding aggregation
Parallel Sliding Aggregation Performance • # of nodes: from 1 to 16 • 8 GB data • Sliding grid size: from 3 × 3 to 7 × 7
Conclusion • Support efficient structural aggregations over native array storage • Different partitioning strategies and a cost model for grid aggregations • All-reuse approach for overlapping aggregations