300 likes | 317 Views
This study examines the performance of column, row, and array-based database management systems in processing recursive queries, focusing on graphs, data storage types, query optimization, and algorithms. The research compares the three DBMS models in terms of query processing, optimizations, and practical implications for various types of databases.
E N D
A Comparison of Column, Row and Array DBMSs to Process Recursive Queries Carlos Ordonez University of Houston
Acknowledgments Michael Stonebraker Visited MIT 2013-2015 Wellington Cabrera; Achyuth Gurram Divesh Srivastava, sabbatical at ATT Labs
Introduction Classical topic in Datalog Recursion defined in ANSI SQL Graph algorithms: paths, reachability, neighborhood analysis Complexity: Cubic, NP-completeness Before: Deductive databases: datalog Harder query optimization than traditional SPJ queries
Directed Graphs Definitions: Directed Graph G=(V,E), maybe cyclic! A vertex in V : i or j and i,j=1..n. and edge (i,j ) has a direction and weight v storage: adjacency matrix E : |E|=N Problems: Transitive closure: vertices j reachable from i Power matrix Ek
Examples V=cities,E=roads. Is there some path from San Diego to NYC?: path from i to j? shortest one? V=employees, E=manager -> employee q1: all employees under i q2: Is j supervised by i? Bill of materials: The well-known part/subpart manufacturing DB: all subparts Y from part X
Technical details Linear recursion; Intuition: R=R*E Inner joins; no negation No GROUP-BY, DISTINCT, HAVING, NOT IN, OUTER JOIN clauses inside R Any SQL query on R is valid Seminaive; recursion depth k: loop with k-1 joins or (rarely) fixpoint
Stonebraker: One size does not fit all! • Storage in a DBMS: • Row: OLTP, point queries, cubes • Column: cube queries, ad-hoc queries • Array: math, science • Other: • Stream: one pass; in-RAM • MMDB: OLTP • Hadoop/noSQL: yawn (but evolving)
DBMS storage elevator storyrow | column | array Row: old, single file, block, B-trees/hash, hash horizontal partitioning Column: new, multiple files, var. size blocks, ordered values, compressed, no row-level index!, hash-segment Array: very different storage; attributes={dimensions|columns}; chunk==subarray; multidimensional; grid index in RAM; still hash but on chunk
Algorithms Semi-naïve (Datalog): classical, general, reasonably efficient (Delta), expressive Direct: very efficient; TC only; in-place update; matrix-based; requires arrays; not good for SQL; not used today! [TKDE 2010, Teradata DBMS]
Optimizations: SPJ Relational algebra + physical operators Join: hash or sort-merge (nested loop does not make sense with E) Projection: push dup elimination & aggreg. Selection: push filter To be explored later Left outer joins External joins Secondary indexes: row-level only?
Join: hash versus sort-mergeGoal: O(N) • Main computation: • Join optimization: • Column: projection={unordered, ordered values} • Row: unordered, ordered versus index • Array: default={ordered, indexed} choice={sparse,dense}
Projection • Duplicate elimination • reachability • binary edges • Aggregation • shortest/longest path • count # paths • length vs weight/cost
Issues with select operator Incorrect to use a predicate involving a join expression column in recursive step Cycles => Infinite recursion Monotonically increasing v, OK to prune Recursion depth k: required in practice
Benchmark with graphsgoal: understand O() and plan • Real • Skewed • Complex structure; sample==different • But Fixed size • Synthetic • Vary n,N • Vary shape • NEW: Cliques!
Conclusions Query optimizations Confirm decades of research: required But impact definitely varies G knowledge helps (catch 22) Benchmark with tuned query processing Column DBMS faster; cliques/skewed degree OK Array DBMS competitive for dense/clique G Row DBMS reasonable
Conclusions • Graph features impacting time and I/O • Density: avg vertex degree; deg(i) skew • Cliques: K • Cycles: deep k • Lessons • Tune storage • Tune join (skewed hash join) • Beware of large cliques and short cycles • Increment recursion depth k gradually
Future work Develop operators for Array DBMS Time complexity based on G Beyond semi-naïve: Direct, logarithmic Different non-path G problems like CC, Neighborhood analysis Beyond paths => clique detection Beyond O(): Query plan cost model
References on RQs • Many Datalog papers • Hadoop: SPARQL gaining interest • SQL much less • IBM • Direct: Agrawal, but not used today • My group: TKDE 2010 paper; Teradata papers; DOLAP 2014: IS journal: revisiting RQs in Columnar DBMS