The Gamma Operator for Big Data Summarization on an Array DBMS

The Gamma Operator for Big Data Summarization on an Array DBMS Carlos Ordonez

Acknowledgments Michael Stonebraker from MIT My PhD students: Yiqun Zhang, Wellington Cabrera SciDB team: Paul Brown, Bryan Lewis, Alex Polyakov

Why SciDB? Large matrices beyond RAM size Storage by row or column not good enough Matrices natural in statistics, engineer. and science Multidimensional arrays -> matrices, not same thing Parallel shared-nothing best for big data analytics Closer to DBMS technology, but some similarity with Hadoop Feasible to create array operators, having matrices as input and matrix as output Combine processing with R package and LAPACK

Old: separate sufficient statistics

New: Generalizing and unifying Sufficient Statistics: Z=[1,X,Y]

Equivalent equations with projections from Γ

Important properties of  PARALLEL

Storage in array chunks

In SciDB we store the points in X as 2D array. SCAN Worker

Array storage and processing in SciDB Assuming d<<n it is natural to hash partition X by i=1..n Gamma computation is fully parallel maintaining local Gamma versions in RAM. X can be read with a fully parallel scan No need to write Gamma to disk during scan, unless fault tolerant

Point must fit in one chunk. Otherwise, join is needed (slow) NO! OK Coordinator Coordinator Worker 1 Worker 1

Parallel computation Coordinator Worker 1 Worker 2 send send

Gamma Operator algorithm: full Γ

Pros: Algorithm evaluation with physical array operators Since xi fits in one chunk we do not need to compute joins Since xi*xiT can be computed in RAM we avoid an aggregation which would require sorting points by i No need to store X twice: X, XT: half I/O No need transpose X, costly reorganization even in RAM Operator works in C++ compiled code: fast.

System issues and limitations • Gamma not efficiently computable in AQL or AFL: hence operator is required • Arrays of tuples in SciDB are more general, but cumbersome for matrix manipulation: arrays of single attribute (double) • Points must be stored completely inside a chunk: wide rectangular chunks: may not be I/O optimal • Slow: Arrays must be pre-processed to SciDB load format, loaded to 1D array and re-dimensioned=>optimize load. • Multiple SciDB instances per node improve I/O speed: interleaving CPU • Larger chunks are better: 8MB, especially for dense matrices; avoid shuffling; avoid joins • Dense (alpha) and sparse (beta) versions

Benchmark: scale up • Small: cluster with 2 Intel Quadcore servers 4GB RAM, 3TB disk • Large: Amazon cloud 2

Combination: SciDB + R

Conclusions • One pass parallel summarization operator for a large matrix • Optimization of outer matrix multiplication as sum (aggregation) of vector outer products • Operator compatible with any parallel shared-nothing system • Gamma matrix must fit in RAM, but n unlimited • Summarization matrix can be exploited in many intermediate computations (with appropriate projections) in linear models • Simplifies many methods to two phases: • Summarization • Computing model parameters • Requires arrays, but can still work with SQL or MapReduce

Future work • Theory • Use Gamma in other models like logistic regression, clustering, Factor Analysis, HMMs • Connection to frequent itemset • Sampling • Higher expected moments, co-variates • Unlikely: Numeric stability with unnormalized sorted data • Systems • DONE: Sparse matrices: layout, compression • DONE: Beat LAPACK on high d • Online model learning (cursor interface needed) • Unlimited d (currently d>8000); join required for high d? Parallel processing of high d more complicated, chunked • PENDING: Interface with BLAS and MKL • Faster than UDFs in a columnar DBMS?

The Gamma Operator for Big Data Summarization on an Array DBMS

The Gamma Operator for Big Data Summarization on an Array DBMS

Presentation Transcript

The Cherenkov Telescope Array an advanced facility for ground-based gamma ray astronomy

Descriptive Data Summarization (Understanding Data)

An Introduction of Big Data

Predictions on Big Data

On Big Data

Square Kilometre Array Big Data Opportunity

Advanced Gamma Tracking Array AGATA

Gamma-Ray Energy Tracking Array GRETINA

AGATA Advanced Gamma Tracking Array

Gamma DBMS Part 1: Physical Database Design

GAMMA-PARTICLE ARRAY FOR DIRECT REACTION STUDIES

AGATA: Advanced Gamma Tracking Array

MC studies for a future gamma-ray array

The Outlook for Big Data

Language for Array Data Processing

GAMMA-PARTICLE ARRAY FOR DIRECT REACTION STUDIES

Big Data Big Data

Big Data on AWS

Operator Overloading; String and Array Objects

Operator Overloading; String and Array Objects

Operator Overloading; String and Array Objects

The Gamma Operator for Big Data Summarization on an Array DBMS