SAGA: Array Storage as a DB with Support for Structural Aggregations

SAGA: Array Storage as a DB with Support for Structural Aggregations SSDBM 2014 June 30th, Aalborg, Denmark Yi Wang, Arnab Nandi, GaganAgrawal The Ohio State University

Outline • Introduction • Grid Aggregations • Overlapping Aggregations • Experimental Results • Conclusion

Big Data Is Often Big Arrays • Array data is everywhere Molecular Simulation: Molecular Data Life Science: DNA Sequencing Data (Microarray) Earth Science: Ocean and Climate Data Space Science: Astronomy Data

How to Process Big Arrays? • Use relational databases? • Poor Expressibility • Loses the natural positional/structural information • Most complex operations are naturally defined in terms of arrays: e.g., correlations, convolution, curve fitting … • Poor Performance • Cumbersome data transformations • Too heavyweight: e.g., transactions • One size does not fit all! • Mapping • Manipulation • Rendering Input Table Input Array Output Array Output Table

Array Databases • Examples: SciDB, RasDaMan and MonetDB • Take Array as the First-Class Citizens • Everything is defined in the array dialect • Lightweight or No ACID Maintenance • No write conflict: ACID is inherently guaranteed • Other Desired Functionality • Structural aggregations, array join, provenance…

The Upfront Cost of Using SciDB • High-Level Data Flow • Requires data ingestion • Data Ingestion Steps • Raw files (e.g., HDF5) -> CSV • Load CSV files into SciDB “EarthDB: scalable analysis of MODIS data using SciDB” - G. Planthaber et al.

Array Storage as a DB • A Paradigm Similar to NoDB • Still maintains DB functionality • But no data ingestion • DB and Array Storage as a DB: Friends or Foes? • When to use DB? • Load once, and query frequently • When to directly use array storage? • Query infrequently, so avoid loading • Our System • Focuses on a set of special array operations - Structural Aggregations

Traditional Value-Based Aggregation • Value-based aggregation SELECT COUNT(Member) AS Num, Nationality FROM T1 GROUP BY Nationality; aggregate the elements of the same value at a time Aggregation Result T1

Structural Aggregation • Aggregate the elements based on positional relationships • E.g., moving average: calculates the average of each 2 × 2 square from left to right Input Array Aggregation Result aggregate the elements in the same square at a time

Structural Aggregation Types Non-Overlapping Aggregation Overlapping Aggregation

Grid Aggregation • Parallelization: Easy after Partitioning • Considerations • Data contiguity which affects the I/O performance • Communication cost • Load balancing for skewed data • Partitioning Strategies • Coarse-grained, fine-grained, hybrid, and auto-grained • Why not use dynamic repartitioning? • Runtime overhead • Poor data contiguity • Redundant data loads

Coarse-Grained Partitioning • Pros • Low I/O cost • Low communication cost • Cons • Workload imbalance for skewed data

Fine-Grained Partitioning • Pros • Excellent workload balance for skewed data • Cons • Relatively high I/O cost • High communication cost

Hybrid Partitioning • Pros • Low communication cost • Good workload balance for skewed data • Cons • High I/O cost

Auto-Grained Partitioning • 2 Steps • Estimate the grid density (after filtering) by uniform sampling, and hence estimate the computation cost (based on the computation complexity) • For each grid, total processing cost = constant loading cost + variable computation cost • Partition the cost array - Balanced Contiguous Multi-Way Partitioning • Dynamic programming (a small number of grids) • Greedy (a large number of grids)

Auto-Grained Partitioning (Cont’d) • Pros • Low I/O cost • Low communication cost • Great workload balance for skewed data • Cons • Overhead of sampling an runtime partitioning

Partitioning Strategy Summary Our partitioning strategy decider can help choose the best strategy

Partitioning Strategy Decider • Cost Model: analyze load cost and computation cost separately • Load cost • Loading factor × data amount • Computation cost • Exception - Auto-Grained: take load cost and computation cost as a whole

Overlapping Aggregation • I/O Cost • Reuse the data already in the memory • Reduce the disk I/O to enhance the I/O performance • Memory Accesses • Reuse the data already in the cache • Reduce cache misses to accelerate the computation • Aggregation Approaches • Naïve approach • Data-reuse approach • All-reuse approach

Example: Hierarchical Aggregation • Aggregate 3 grids in a 6 × 6 array • The innermost 2 × 2 grid • The middle 4 × 4 grid • The outmost 6 × 6 grid • (Parallel) sliding aggregation is much more complicated

Naïve Approach Load the innermost grid Aggregate the innermost grid Load the middle grid Aggregate the middle grid Load the outermost grid Aggregate the outermost grid For N grids: N loads + N aggregations

Data-Reuse Approach Load the outermost grid Aggregate the outermost grid Aggregate the middle grid Aggregate the innermost grid For N grids: 1 load + N aggregations

All-Reuse Approach Load the outermost grid Once an element is accessed, accumulatively update the aggregation results it contributes to For N grids: 1 load + 1 aggregation Only update the outermost aggregation result Update both the outermost and the middle aggregation results Update all the 3 aggregation results

All-Reuse Approach (Cont’d) • Key Insight • # of aggregation results ≤ # of queried elements • More computationally efficient to iterate over elements and update the associated aggregation results • More Benefits • Load balance (for hierarchical/circular aggregations) • More speedup for compound array elements • The data type of an aggregation result is usually primitive, but this is not always true for an array element

Parallel Performance vs. SciDB • No preprocessing cost is included for SciDB • Array slab/data size (8 GB) ratio: from 12.5% to 100% • Coarse-grained partitioning for the grid aggregation • All-reuse approach for the sliding aggregation • SciDB stores `chunked’ array: can even support overlapping chunking to accelerate the sliding aggregation

Parallel Sliding Aggregation Performance • # of nodes: from 1 to 16 • 8 GB data • Sliding grid size: from 3 × 3 to 7 × 7

Conclusion • Support efficient structural aggregations over native array storage • Different partitioning strategies and a cost model for grid aggregations • All-reuse approach for overlapping aggregations

SAGA: Array Storage as a DB with Support for Structural Aggregations

SAGA: Array Storage as a DB with Support for Structural Aggregations

Presentation Transcript

Bundling As a Structural Issue

Support Bootcamp - Storage

Structural Support and Movement

Amazon Storage as a Service

Hermetic Storage as a Solution for Prolonged Storage of Seed

DB On Demand A DB as a Service story

Optimizing Sun ZFS Storage Appliances for Oracle DB

Composite Material with Structural and Power Storage Capabilities

Scanning Microscopy with a Microlens Array

SciDB Array Storage

Alternative aggregations for ISIC and CPC

Architecture Diagram – DB as a Service from vCAC

Social threshold aggregations

HEAVEN A Hierarchical Storage and Archive Environment for Multidimensional Array-DBMS

SWPBS as a Foundation for Including Students with Significant Support Needs

Housing as a Structural Intervention for Transgender People living with HIV/AIDS

TCD Model of Support for students with AS

Aggregations

Storage as a Service Market