1 / 27

SAGA: Array Storage as a DB with Support for Structural Aggregations

SAGA: Array Storage as a DB with Support for Structural Aggregations. SSDBM 2014 June 30 th , Aalborg, Denmark. Yi Wang, Arnab Nandi, Gagan Agrawal The Ohio State University. Outline. Introduction Grid Aggregations Overlapping Aggregations Experimental Results Conclusion.

huey
Download Presentation

SAGA: Array Storage as a DB with Support for Structural Aggregations

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. SAGA: Array Storage as a DB with Support for Structural Aggregations SSDBM 2014 June 30th, Aalborg, Denmark Yi Wang, Arnab Nandi, GaganAgrawal The Ohio State University

  2. Outline • Introduction • Grid Aggregations • Overlapping Aggregations • Experimental Results • Conclusion

  3. Big Data Is Often Big Arrays • Array data is everywhere Molecular Simulation: Molecular Data Life Science: DNA Sequencing Data (Microarray) Earth Science: Ocean and Climate Data Space Science: Astronomy Data

  4. How to Process Big Arrays? • Use relational databases? • Poor Expressibility • Loses the natural positional/structural information • Most complex operations are naturally defined in terms of arrays: e.g., correlations, convolution, curve fitting … • Poor Performance • Cumbersome data transformations • Too heavyweight: e.g., transactions • One size does not fit all! • Mapping • Manipulation • Rendering Input Table Input Array Output Array Output Table

  5. Array Databases • Examples: SciDB, RasDaMan and MonetDB • Take Array as the First-Class Citizens • Everything is defined in the array dialect • Lightweight or No ACID Maintenance • No write conflict: ACID is inherently guaranteed • Other Desired Functionality • Structural aggregations, array join, provenance…

  6. The Upfront Cost of Using SciDB • High-Level Data Flow • Requires data ingestion • Data Ingestion Steps • Raw files (e.g., HDF5) -> CSV • Load CSV files into SciDB “EarthDB: scalable analysis of MODIS data using SciDB” - G. Planthaber et al.

  7. Array Storage as a DB • A Paradigm Similar to NoDB • Still maintains DB functionality • But no data ingestion • DB and Array Storage as a DB: Friends or Foes? • When to use DB? • Load once, and query frequently • When to directly use array storage? • Query infrequently, so avoid loading • Our System • Focuses on a set of special array operations - Structural Aggregations

  8. Traditional Value-Based Aggregation • Value-based aggregation SELECT COUNT(Member) AS Num, Nationality FROM T1 GROUP BY Nationality; aggregate the elements of the same value at a time Aggregation Result T1

  9. Structural Aggregation • Aggregate the elements based on positional relationships • E.g., moving average: calculates the average of each 2 × 2 square from left to right Input Array Aggregation Result aggregate the elements in the same square at a time

  10. Structural Aggregation Types Non-Overlapping Aggregation Overlapping Aggregation

  11. Grid Aggregation • Parallelization: Easy after Partitioning • Considerations • Data contiguity which affects the I/O performance • Communication cost • Load balancing for skewed data • Partitioning Strategies • Coarse-grained, fine-grained, hybrid, and auto-grained • Why not use dynamic repartitioning? • Runtime overhead • Poor data contiguity • Redundant data loads

  12. Coarse-Grained Partitioning • Pros • Low I/O cost • Low communication cost • Cons • Workload imbalance for skewed data

  13. Fine-Grained Partitioning • Pros • Excellent workload balance for skewed data • Cons • Relatively high I/O cost • High communication cost

  14. Hybrid Partitioning • Pros • Low communication cost • Good workload balance for skewed data • Cons • High I/O cost

  15. Auto-Grained Partitioning • 2 Steps • Estimate the grid density (after filtering) by uniform sampling, and hence estimate the computation cost (based on the computation complexity) • For each grid, total processing cost = constant loading cost + variable computation cost • Partition the cost array - Balanced Contiguous Multi-Way Partitioning • Dynamic programming (a small number of grids) • Greedy (a large number of grids)

  16. Auto-Grained Partitioning (Cont’d) • Pros • Low I/O cost • Low communication cost • Great workload balance for skewed data • Cons • Overhead of sampling an runtime partitioning

  17. Partitioning Strategy Summary Our partitioning strategy decider can help choose the best strategy

  18. Partitioning Strategy Decider • Cost Model: analyze load cost and computation cost separately • Load cost • Loading factor × data amount • Computation cost • Exception - Auto-Grained: take load cost and computation cost as a whole

  19. Overlapping Aggregation • I/O Cost • Reuse the data already in the memory • Reduce the disk I/O to enhance the I/O performance • Memory Accesses • Reuse the data already in the cache • Reduce cache misses to accelerate the computation • Aggregation Approaches • Naïve approach • Data-reuse approach • All-reuse approach

  20. Example: Hierarchical Aggregation • Aggregate 3 grids in a 6 × 6 array • The innermost 2 × 2 grid • The middle 4 × 4 grid • The outmost 6 × 6 grid • (Parallel) sliding aggregation is much more complicated

  21. Naïve Approach Load the innermost grid Aggregate the innermost grid Load the middle grid Aggregate the middle grid Load the outermost grid Aggregate the outermost grid For N grids: N loads + N aggregations

  22. Data-Reuse Approach Load the outermost grid Aggregate the outermost grid Aggregate the middle grid Aggregate the innermost grid For N grids: 1 load + N aggregations

  23. All-Reuse Approach Load the outermost grid Once an element is accessed, accumulatively update the aggregation results it contributes to For N grids: 1 load + 1 aggregation Only update the outermost aggregation result Update both the outermost and the middle aggregation results Update all the 3 aggregation results

  24. All-Reuse Approach (Cont’d) • Key Insight • # of aggregation results ≤ # of queried elements • More computationally efficient to iterate over elements and update the associated aggregation results • More Benefits • Load balance (for hierarchical/circular aggregations) • More speedup for compound array elements • The data type of an aggregation result is usually primitive, but this is not always true for an array element

  25. Parallel Performance vs. SciDB • No preprocessing cost is included for SciDB • Array slab/data size (8 GB) ratio: from 12.5% to 100% • Coarse-grained partitioning for the grid aggregation • All-reuse approach for the sliding aggregation • SciDB stores `chunked’ array: can even support overlapping chunking to accelerate the sliding aggregation

  26. Parallel Sliding Aggregation Performance • # of nodes: from 1 to 16 • 8 GB data • Sliding grid size: from 3 × 3 to 7 × 7

  27. Conclusion • Support efficient structural aggregations over native array storage • Different partitioning strategies and a cost model for grid aggregations • All-reuse approach for overlapping aggregations

More Related