A Comparison of Column, Row and Array DBMSs to Process Recursive Queries

A Comparison of Column, Row and Array DBMSs to Process Recursive Queries Carlos Ordonez University of Houston

Acknowledgments Michael Stonebraker Visited MIT 2013-2015 Wellington Cabrera; Achyuth Gurram Divesh Srivastava, sabbatical at ATT Labs

Introduction Classical topic in Datalog Recursion defined in ANSI SQL Graph algorithms: paths, reachability, neighborhood analysis Complexity: Cubic, NP-completeness Before: Deductive databases: datalog Harder query optimization than traditional SPJ queries

Directed Graphs Definitions: Directed Graph G=(V,E), maybe cyclic! A vertex in V : i or j and i,j=1..n. and edge (i,j ) has a direction and weight v storage: adjacency matrix E : |E|=N Problems: Transitive closure: vertices j reachable from i Power matrix Ek

Examples V=cities,E=roads. Is there some path from San Diego to NYC?: path from i to j? shortest one? V=employees, E=manager -> employee q1: all employees under i q2: Is j supervised by i? Bill of materials: The well-known part/subpart manufacturing DB: all subparts Y from part X

Recursive view Rk: recursion depth n=|V|,N=|E|

Transitive Closure: G+=(V,E’)

Power matrix

Technical details Linear recursion; Intuition: R=R*E Inner joins; no negation No GROUP-BY, DISTINCT, HAVING, NOT IN, OUTER JOIN clauses inside R Any SQL query on R is valid Seminaive; recursion depth k: loop with k-1 joins or (rarely) fixpoint

Stonebraker: One size does not fit all! • Storage in a DBMS: • Row: OLTP, point queries, cubes • Column: cube queries, ad-hoc queries • Array: math, science • Other: • Stream: one pass; in-RAM • MMDB: OLTP • Hadoop/noSQL: yawn (but evolving)

DBMS storage elevator storyrow | column | array Row: old, single file, block, B-trees/hash, hash horizontal partitioning Column: new, multiple files, var. size blocks, ordered values, compressed, no row-level index!, hash-segment Array: very different storage; attributes={dimensions|columns}; chunk==subarray; multidimensional; grid index in RAM; still hash but on chunk

Algorithms Semi-naïve (Datalog): classical, general, reasonably efficient (Delta), expressive Direct: very efficient; TC only; in-place update; matrix-based; requires arrays; not good for SQL; not used today! [TKDE 2010, Teradata DBMS]

Semi-naive

Seminaive in SQL

Optimizations: SPJ Relational algebra + physical operators Join: hash or sort-merge (nested loop does not make sense with E) Projection: push dup elimination & aggreg. Selection: push filter To be explored later Left outer joins External joins Secondary indexes: row-level only?

Join: hash versus sort-mergeGoal: O(N) • Main computation: • Join optimization: • Column: projection={unordered, ordered values} • Row: unordered, ordered versus index • Array: default={ordered, indexed} choice={sparse,dense}

Projection • Duplicate elimination • reachability • binary edges • Aggregation • shortest/longest path • count # paths • length vs weight/cost

SelectionReduce |Rd|, correctness

Issues with select operator Incorrect to use a predicate involving a join expression column in recursive step Cycles => Infinite recursion Monotonically increasing v, OK to prune Recursion depth k: required in practice

Benchmark with graphsgoal: understand O() and plan • Real • Skewed • Complex structure; sample==different • But Fixed size • Synthetic • Vary n,N • Vary shape • NEW: Cliques!

Simulating realistic Gsocial nets, Internet

Join (stop=30 mins; array 2 chunk sizes req’d)

ProjectionDuplicate elimination

ProjectionAggregation

Selectionsimple filter i=1

Ultimate benchmark (fair?)tuned storage, best join, aggr

Conclusions Query optimizations Confirm decades of research: required But impact definitely varies G knowledge helps (catch 22) Benchmark with tuned query processing Column DBMS faster; cliques/skewed degree OK Array DBMS competitive for dense/clique G Row DBMS reasonable

Conclusions • Graph features impacting time and I/O • Density: avg vertex degree; deg(i) skew • Cliques: K • Cycles: deep k • Lessons • Tune storage • Tune join (skewed hash join) • Beware of large cliques and short cycles • Increment recursion depth k gradually

Future work Develop operators for Array DBMS Time complexity based on G Beyond semi-naïve: Direct, logarithmic Different non-path G problems like CC, Neighborhood analysis Beyond paths => clique detection Beyond O(): Query plan cost model

References on RQs • Many Datalog papers • Hadoop: SPARQL gaining interest • SQL much less • IBM • Direct: Agrawal, but not used today • My group: TKDE 2010 paper; Teradata papers; DOLAP 2014: IS journal: revisiting RQs in Columnar DBMS

A Comparison of Column, Row and Array DBMSs to Process Recursive Queries

A Comparison of Column, Row and Array DBMSs to Process Recursive Queries

Presentation Transcript

Decidable Containment of Recursive Queries

Column Oriented Database Vs Row Oriented Databases

Row and column matrices are sometimes called row vectors and column vectors

Some Recursive Queries

Evaluation of Recursive Queries Part 2: Pushing Selections

5.5 Row Space, Column Space, and Nullspace

Massive Support Vector Regression (via Row and Column Chunking)

Section 5.5 Row Space, Column Space, and Nullsapce

Databases and DBMSs

Row Space and Column Space

Row n Column r

Optimizing recursive queries in sql

Row-Column Scanning Demonstration

Architecture and Classification of DBMSs

Recursive Graph Deduction and Reachability Queries

Massive Support Vector Regression (via Row and Column Chunking)

4.7 Row Space, Column Space, and Null Space

Recursive Queries

Optimizing Recursive Queries in SQL