370 likes | 538 Views
TDD: Research Topics in Distributed Databases. Querying Big Data Tractability revisited for querying big data BD-tractability Reductions, complete problems, separation results Querying big data Scale independence Making big data “small” Approximate query answering Relaxing query semantics
E N D
TDD: Research Topics in Distributed Databases Querying Big Data • Tractability revisited for querying big data • BD-tractability • Reductions, complete problems, separation results • Querying big data • Scale independence • Making big data “small” • Approximate query answering • Relaxing query semantics • Data-driven approximation TDD (LN 5)
Big data • Volume: in PB (1015B) or EB (1018B) or … • Variety: heterogeneous, semi-structured or unstructured • Velocity: dynamic • Veracity: trust in its quality The new challenges introduced by big data? • Computer science is the topic about the computation of function f(x) in fact, any data that cannot be handled with your available resources • x is big: PB (1015B) or EB (1015B) TDD (LN 5) 2
A new complexity theory for big data TDD (LN 5) 3
The good, the bad and the ugly • Traditional computational complexity theory of almost 50 years: • The good: polynomial time computable (PTIME) • The bad: NP-hard (intractable) • The ugly: PSPACE-hard, EXPTIME-hard, undecidable… What happens when it comes to big data? • Assuming SSD of 6G/s. A linear scan of a data set D would take • 1.9 days when D is of 1PB (1015B) • 5.28 years when D is of 1EB (1018B) • O(n) time is already beyond reach on big data in practice! Polynomial time queries become intractable on big data TDD (LN 5) 4
Tractability revisited for queries on big data • A class Q of queries is BD-tractable if there exists a PTIME preprocessing function such that • for any database D on which queries of Qare defined, • D’ = (D) • for all queries Q in Qdefined on D, Q(D) can be computed by evaluating Q on D’in parallel polylog time (NC) hence D’ is of polynomial size possible rewriting parallel logk(|D|, |Q|) Q1((D)) D Q2((D)) (D) 。 。 • Does it work? If a linear scan of D could be done in log(|D|) time: • 15 seconds when D is of 1 PB instead of 1.99 days • 18 seconds when D is of 1 EB rather than 5.28 years BD-tractable queries are feasible on big data TDD (LN 5) 5
BD-tractable queries • A class Q of queries is BD-tractable if there exists a PTIME preprocessing function such that • for any database D on which queries of Qare defined, • D’ = (D) • for all queries Q in Qdefined on D, Q(D) can be computed by evaluating Q on D’in parallel polylog time (NC) TQ0: the set of all BD-tractable query classes in parallel with more resources • Preprocessing: • one-time process, offline, once for all queries in Q • indices, compression, views, incremental computation, … not necessarily reduce the size of D Preprocessing: a common practice of database people TDD (LN 5) 6
What query classes are BD-tractable? • Boolean selection queries • Input: A dataset D • Query: Does there exist a tuple t in D such that t[A] = c? • Build a B+-tree on the A-column values in D. Then all such selection queries can be answered in O(log(|D|)) time. • Graph reachability queries • Input: A directed graph G • Query: Does there exist a path from node s to t in G? NL-complete What else? Relational algebra + set recursion on ordered relational databases Some natural query classes are BD-tractable TDD (LN 5) 7
Deal with queries that are not BD-tractable Starts at a node s, and visits all its children, pushing them onto a stack in the reverse order induced by the vertex numbering. After all of s’ children are visited, it continues with the node on the top of the stack, which plays the role of s Many query classes are not BD-tractable. • Breadth-Depth Search (BDS) • Input: An unordered graph G = (V, E) with a numbering on its nodes, and a pair (u, v) of nodes in V • Question: Is u visited before v in the breadth-depth search of G? Is this problem (query class) BD-tractable? D is empty, Q is (G, (u, v)) • No. The problem is well known to be P-complete! • We need PTIME to process each query (G, (u, v)) ! • Preprocessing does not help us answer such queries. Can we make it BD-tractable? TDD (LN 5) 8
Make queries BD-tractable Factorization: partition instances to identify a data part D for preprocessing, and a query part Q for operations • Breadth-Depth Search (BDS) • Input: An unordered graph G = (V, E) with a numbering on its nodes, and a pair (u, v) of nodes in V • Question: Is u visited before v in the breadth-depth search of G? Factorization: D is G = (V, E), Q is (u, v) • Preprocessing: (G) performs BDS on G, and returns a list M consisting of nodes in V in the same order as they are visited • For all queries (u, v), whether u occurs before v can be decided by a binary search on M, in log(|M|) time after proper factorization TQ: The set of all query classes that can be made BD-tractable TDD (LN 5) 9
Fundamental problems for BD-tractability BD-tractable queries help practitioners determine what query classes are tractable on big data. Are we done yet? • No, a number of questions in connection with a complexity class! • Reductions: how to transform a problem to another in the class that we know how to solve, and hence make it BD-tractable? • Complete problems: Is there a natural problem (a class of queries) that is the hardest one in the complexity class? A problem to which all problems in the complexity class can be reduced • How large is TQ? TQ0?Compared to P? NC? Analogous to our familiar NP-complete problems Why do we care? Fundamental to any complexity classes: P, NP, … TDD (LN 5) 10
Reductions transformations for making queries BD-tractable Departing from our familiar polynomial-time reductions, we need reductions that are in NC, and deal with both data D and query Q! • NC-factor reductions NC: a pair of NC functions that allow re-factorizations (repartition data and query part), for TQ • F-reductions F: a pair of NC functions that do not allow re-factorizations, for TQ0 to determine whether a query class is BD-tractable • Properties: • transitivity: if Q1NC Q2 and Q2NC Q3, then Q1NC Q3 (also F) • compatibility: • if Q1NC Q2 and Q2 is in TQ, then so is Q1. • if Q1F Q2 and Q2 is in TQ0, then so is Q1. transform a given problem to one that we know how to solve TDD (LN 5) 11
Complete problems • A query class Q is complete for TQ if Q is in TQ, and moreover, for any query class Q’in TQ, Q’NC Q • A query class Q is complete for TQ0 if Q is in TQ0, and for any query class Q’in TQ0, Q’F Q Is there a complete problems for TQ (TQ0)? • There exists a natural query class Q that is complete for TQ • Not for TQ0 • Unless P = NC, a query class complete for TQ0is a witness for P \ NC (as hard as the big open whether P = NC) • Whether P = NC is as hard as whether P = NP It is hard to find a complete problem for TQ0 TDD (LN 5) 12
Comparing with P and NC How large is TQ? How large is TQ0? • NC TQ = P • All PTIME query classes can be made BD-tractable! • Unless P = NC, NC TQ0 P • Unless P = NC, not all PTIME query classes are BD-tractable separation need proper factorizations to answer PTIME queries on big data PTIME Properly contained in P not BD-tractable BD-tractable 13 13 TDD (LN 5)
What can we get from BD-tractability? Guidelines for the following. • What query classes are feasible on big data? TQ0 • What query classes can be made feasible to answer on big data? TQ • How to determine whether it is feasible to answer a class Q of queries on big data? • Reduce Q to a complete problem Qcfor TQ via NC • If so, how to answer queries in Q? • Identify factorizations (NC reductions) such that QNC Qc • Compose the reduction and the algorithm for answering queries of Qc A revision of the classical computational complexity theory TDD (LN 5) 14
Making big data small TDD (LN 5) 15
Scale independence • The scale independence problem • Input: A dataset D, a query Q, and a bound M • Query: Does there exist a subset DQ of D such that • |DQ | M, and • Q(D) = Q(DQ)? • A more general setting: • Input: A query Q defined over a schema R, and a bound M • Query: Is it for all instances D of R, there exists a subset DQ of D such that • |DQ | M, and • Q(D) = Q(DQ)? • The cost of query processing is “independent” of |D|! • Scalable with big data D, when D grows! TDD (LN 5) 16 Why do we care?
Scale independent queries in practice? Personalized social search queries (Facebook Graph Search) • Find me all my friends who live in Edinburgh and like cycling • Find all restaurants rated A that are in King of Prussia Mall • Find me all restaurants in Edinburgh where my friends dined in 2013. • Bounded number of tuples • Why bounded? • Facebook: at most 5000 friends per person • At most K restaurants in a mall • At most 5000 friends, there are 365 days each year, and each person dines at most once per day (a normal person) To answer a query, we need to access a bounded amount of data TDD (LN 5) 17
Query processing • Access schemas: (R, X, N) • index on X for instances D of X • there exist at most N tuples sharing the same X values in D (e.g., 365 days per year), and these tuples can be fetched efficiently • find a query plan, visiting a bounded amount of data decide whether a query is scale independent • Complexity: the scale independence problem is • 3p-complete for conjunctive queries (SPC); • PSPACE-complete for first-order logic queries (SQL); but • in O(1) time for Boolean conjunctive queries if |Q| M! there are sufficient conditions for this, based on rules Incremental scale independence? Using views? TDD (LN 5) 18
How to make a query tractable on big data? • Querying big data: • Input: Query Q, and big data G, • Output: Q (G), the set of answers to Q in G • A number of techniques: • Distributed query processing • Query preserving data compression • Query answering using views • Bounded incremental evaluation • Top-k query answering with early termination • … Too costly The cost of query processing: afunction of |G| and |Q| O(|G|) time is already beyond reach in practice! Can we effectively query big data? • Approximate or inexact algorithms • Exact algorithms? Make the cost of query processing “independent” of |G|! MapReduce is not the only solution, and is not even the best one! TDD (LN 5) 19
Distributed query processing O(n2) or O(n3) is too costly The cost of evaluation algorithm:f(|G|, |Q|) It is unlikely that we can lower its complexity, but can we reduce the size of its parameter |G|? manageable sizes Divide and conquer • partition G into fragments (G1, …, Gn), distributed to various sites evaluate Q on smaller Gi • upon receiving a query Q, • evaluate Q( Gi )in parallel • collect partial answers at a coordinator site, and assemble them to find the answer Q( G ) in the entire G Performance guarantees for evaluating regular reachability queries based on partial evaluation Network traffic and response time: Independent of |G| TDD (LN 5) 20
Query preserving data compression R G Gc Q Q P Q( G) Q( Gc) The cost of query processing:f(|G|, |Q|) reduce the parameter? Query preserving compression <R, P> for a class L of queries • For any data collection G, C =R(G) • For any Q in L, Q( G ) = P(Q, Gc) Compress big G into a smaller Gc TDD (LN 5) 21
What is new about query preserving compression? Query preserving compression <R, P> for a class L of queries • For any dataset G, Gc =R(G) • For any Q in L, Q( G ) = P(Q, Gc) • Relative to a class L of queries of users’ choice • Better compression ratio: only information about L queries no need to decompress Gc • For any Q in L, Q(Gc) can be directly computed Any algorithms and indexing structures forG can be used for Gc In contrast to lossless compression, no need to restore the original graph G • Gc is computed once for all queries Q in L Incrementally maintained Reduction: 95% in average for reachability queries TDD (LN 5) 22
Answering queries using views The cost of query processing:f(|G|, |Q|) can we compute Q(G) without accessing G, i.e., independent of |G|? Query answering using views: given a query Q in a language Land a set V views, find another query Q’such that • Q and Q’ are equivalent • Q’only accesses V(G) for any G, Q(G) =Q’(G) • Answering graph pattern queries on big social graphs: • Regardless of how big G is – the cost is “independent”of G • V(G)is often much smaller than G (4% -- 12% on real-life data) Improvement: 97% for graph pattern matching The complexity is no longer a function of |G| TDD (LN 5) 23
Incremental query answering 5%/week in Web graphs • Real-life data is dynamic – constantly changes, ∆G • Re-compute Q(G⊕∆G)starting from scratch? • Changes ∆G are typically small Compute Q(G) once, and then incrementally maintain it Changes to the input Old output Incremental query processing: • Input: Q, G, Q(G), ∆G • Output: ∆M such that Q(G⊕∆G) = Q(G) ⊕∆M New output Changes to the output When changes ∆G to the data G are small, typically so are the changes ∆M to the output Q(G⊕∆G) Minimizing unnecessary recomputation TDD (LN 5) 24
Complexity of incremental problems Incremental query answering • Input: Q, G, Q(G), ∆G • Output: ∆M such that Q(G⊕∆G) = Q(G) ⊕∆M Incremental algorithms? The cost of query processing: afunction of |G| and |Q| • incremental algorithms: |CHANGED|, the size of changes in • the input: ∆G, and • the output: ∆M The updating cost that isinherentto the incremental problem itself The amount of workabsolutely necessary to perform for any incremental algorithm Effective on graph pattern matching Complexity analysis in terms of the size of changes Bounded: the cost is expressible as f(|CHANGED|)? Optimal: in O(|CHANGED|)? TDD (LN 5) 25
Top-k query answering Traditional query answering: compute Q(G) • It is expensive to compute when G is large • The result Q(G) is excessively large for the users to inspect – larger than G Top-k query answering: • Input: : Query Q, dataset G and a positive integer k. • Output: A top-ranked set of k elements in Q(G) Improvement: 65% on graph pattern matching Early termination: return top-k matches without computing Q(G) TDD (LN 5) 26
Answering queries on big data Yes, MapReduce is useful, but it is not the only way! • Partial evaluation for distributed query processing: can we get performance guarantees? • Query preserving compression: convert big data to small data • Query answering using views: make big data small • Bounded incremental query answering: depending on the size of the changes rather than the size of the original big data • Top-k query answering and early termination:find answers without traversing the entire data set Prerocessing methods Make big data small Combinations of these can do better than MapReduce! TDD (LN 5) 27
Further reading • W. Fan, J. Li, X. Wang, and Y. Wu. Query Preserving Graph Compression, SIGMOD, 2012. • W. Fan. Graph Pattern Matching Revised for Social Network Analysis, ICDT 2012 (invited). • W. Fan, X. Wang, and Y. Wu. Performance Guarantees for Distributed Reachability Queries, VLDB, 2012. • W. Fan, X. Wang, and Y. Wu. Diversified Top-k Graph Pattern Matching, VLDB, 2014. • W. Fan, J. Li, X. Wang, and Y. Wu. Incremental Graph Pattern Matching, SIGMOD, 2011 (TODS 38(3), 2013). • W. Fan J. Li, S. Ma, and H. Wang, and Y. Wu. Graph Homomorphism Revisited for Graph Matching, VLDB 2010. • W. Fan J. Li, S. Ma, and N. Tang, and Y. Wu. Graph pattern matching: From intractable to polynomial time, VLDB, 2010. TDD (LN 5)
Approximate query answering TDD (LN 5) 29
Graph Pattern Matching • Applications • pattern recognition • knowledge discovery • intelligence analysis • transportation network analysis • Web site classification, • social position detection. • User targeted advertising … a bijective function fon nodes: (u,u’ ) ∈Q iff (f(u), f(u’)) ∈ G Widely used in social network analysis Given a pattern graph Q and a data graph G, find all the matches of Q in G. subgraph isomorphism TDD (LN 5) 30
Problems Facebook : 1B users, 140B links • Real-life social graphs are typically large • Subgraph isomorphism • What is for the complexity for determining whether there exists a match of a pattern Q in a graph G? • Given a pattern Q and a graph Q, how many matches of Q can possibly exist in G? NP-complete Possibly exponential O(|G|) time is already beyond reach in practice! • Nonetheless, we need to conduct graph pattern matching on social networks, among other things What can we do if a class of queries is NOT BD-tractable? subgraph isomorphism is too costly for social network analysis TDD (LN 5) 31
Relaxing the semantics of queries • Much cheaper • Complexity of computing the set of matches: quadratic time • The number of matches of Q in G: there exists a unique, maximum match relation S: 1 a binary relation S on nodes • for each node u in Q, there exists v in G such that (u,v)∈ S, • for each pair (u,v)∈ S, each edge (u,u’) in Q is mapped to an edge (v, v’ ) inG, such that (u’,v’ )∈ S • Effective: • Social position detection • User targeted advertising, … Quadratic time is still too expensive! How to deal with it? A variety of extensions to capture topology, with low complexity So, graph simulation for social data analysis, instead of subgraph isomorphism Graph simulation TDD (LN 5) 32
The approximation theory revisited • If a query class is not BD-tractable and its semantics can’t be relaxed, is it still feasible to answer such queries on big data? Yes, approximation • When exact algorithms are infeasible, we find inexact algorithms with performance guarantees – can’t be too far! • feasible on big data – reducing big data to small data • performance guarantees whenever possible The need for revising the traditional approximation theory, for querying big data Data-driven approximation TDD (LN 5) 33
Data-driven approximation • Resource-bounded query answering • Input: A dataset D, a class Q of queries, a resource ratio [0, 1) • Question: Develop an algorithm that given any query Q Q computes Q(D) by accessing at most |G| amount of data Make big data “small”! • Personalized social searches and reachability queries: • Find me all my friends who live in Nanjing and like cycling • Does Michael connect to lady Gaga through social links? We can do personalized social search with = 0.0015%! • 1.5 *10-6 * 1PB (1015B) = 15 * 109 = 15GB • We are making big data of PB size as small as 15GB! We can make big data of PB size, fit into our memory! TDD (LN 5) 34
Summing up TDD (LN 5) 35
Summary and Review • What is BD-tractability? Why do we care about it? • What is scale independence? • How to make big data “small”? • Is MapReduce the only way for querying big data? Can we do better than it? • What is query preserving data compression? Query answering using views? Bounded incremental query answering? Top-k query answering? • If a class of queries is known not to be BD-tractable, how can we process the queries in the context of big data? • Develop an algorithm for processing a class of queries on big data, by combining various methods discussed TDD (LN 5)