260 likes | 273 Views
Explore challenges and solutions in querying big data by accessing small data efficiently. Learn about bounded evaluability, graph search optimization, query plans, and formalization of queries. Discover the complexities and strategies for deciding bounded evaluability.
E N D
Querying Big Data by Accessing Small Data Wenfei Fan University of Edinburgh & Beihang University Floris Geerts University of Antwerp Yang Cao University of Edinburgh & Beihang University Ting Deng Beihang University Ping Lu Beihang University
Challenges introduced by big data • Traditional computational complexity theory of 50 years: • The ugly: PSPACE-hard, EXPTIME-hard, … , undecidable • The bad: NP-hard (intractable) • The good: polynomial time computable (PTIME) What happens when it comes to big data? • Using SSD of 6G/s, a linear scan of a data set D would take • 1.9 days when D is of 1PB (1015B) • 5.28 years when D is of 1EB (1018B) • O(n) time is already beyond reach on big data in practice! Can we still answer queries on big data with limited resource? 1
Bounded evaluability • Input: A class L of queries • Question: Can we find, for any query Q L and any (possibly big) dataset D, a fraction DQ of Dsuch that • Q(D) = Q(DQ), and • DQ can be identified in time determined by Q? D Q( ) Q( ) DQ DQ • Scales with D no matter how big D grows Making the cost of computing Q(D) independent of |D|! 2
Graph Search (Facebook) • Find me restaurants in New York my friends have been to in 2014 1.38billion person tuples, and over 140 billion friend tuples select rid from friend(pid1, pid2), person(pid, name, city), dine(pid, rid, dd, mm, yy) where pid1 = p0 and pid2 = person.pid and pid2 = dine.pid and city = NYC and yy = 2014 Data semantics in constraints • Facebook: 5000 friends per person • Each year has at most 366 days • Each person dines at most once per day • pid is a key for relation person Build an index from pid1 to pid2 for friend(pid1, pid2) Boundedly evaluable with indices under constraints?
Bounded query evaluation • Find me restaurants in New York my friends have been to in 2014 Q(rid) = p, p1, n, c, dd, mm, yy (friend(p, p1) person(p, n, c) dine(p, rid, dd, mm, yy) p = p0 c = NYC yy = 2014) A query plan under the constraints + indices • Fetch 5000pid’s for friends of p0 -- 5000 friends per person • For each pid, check whether she lives in NYC – 5000 person tuples • For pid’s living in NYC, find restaurants where they dined in 2014 – 5000 * 366 tuples at most In contrast to 1.38billion person tuples, and over 140 billion friend tuples Accessing 5000 + 5000 + 5000 * 366 tuples in total 4
Overview • Formalization of bounded query plans and queries • The complexity of deciding the bounded evaluability for • CQ (SPJ), UCQ, FO+ (SPJU), FO • Effective syntax for boundedly evaluable queries • Approximate query answering with bounded evaluability • Bounded envelopes • Bounded query specialization We only know that bounded evaluability is • undecidable for FO [PODS 2014] • in PTME for CQ with very restricted query plans [VLDB 2014] Previous work: bounded query plans are not properly defined
Access constraints to capture data semantics Combining cardinality constraints and index On a relation schema R: X (Y, N) • X, Y: sets of attributes of R • for any X-value, there exist at most N distinct Y values • Index on X for Y: given an X value, find relevant Y values Examples • friend(pid1, pid2): pid1 (pid2, 5000) 5000 friends per person • dine(pid, rid, dd, mm, yy): pid, yy (rid, 366) each year has at most 366 days and each person dines at most once per day • person(pid, name, city): pid (city, 1) pid is a key for person Discovery: functional dependencies, simple aggregate queries Access schema: A set of access constraints 6
Bounded plans for query Q In the presence of access schema A (Q, R): T1 = 1, …, Tn = n, where i is Y X Y’ • { a }: a constant in query Q • Fetch(X Tj, R, Y): via access constraint R: X (Y’, N), j < i • Y(Tj),C(Tj), (Tj): projection, selection, renaming • Tj Tk, Tj Tk, Tj -Tk: Cartesian product, union, set difference, for j < I, k < i The length of (Q, R): bounded by an exponential in |R|, |Q| and |A| not very practical for plans beyond exponential Fetch data by making use of indices in A Independent of the size of instances D of R 7
Boundedly evaluable queries Q Q has a bounded query plan (Q, R) under an access schema A • CQ:only { a }, Fetch(X Tj, R, Y), Y(Tj),C(Tj), (Tj), Tj Tk : • UCQ: at the end only • FO+: { a }, Fetch, , , , , , • FO: { a }, Fetch, , , , , , Coping with big data 8
The bounded evaluability problem (BEP(L)) • Input: A relational schema R, an access schema A, and a query Q in a query language L • Question: Is Q boundedly evaluable under A? When Q has a bounded query plan under A. Undecidable for FO [PODS 2014] • Is BEP decidable for CQ? UCQ? FO+? • If so, what is the complexity? The bounded evaluability analysis is nontrivial 9
Example of bounded evaluable queries • Schema: R(A, B, C) • Access schema A: R( C, 1), R(AB C, N) • A CQ query: Q(x, y) = x1, x2, z1, z2, z3 (R(x1, x2, x) R(z1, z2, y ) R(x, y, z3) x1 = 1 x2 = 1) Is Q boundedly evaluable? Yes, Q is A-equivalent to Q’(x, x) = R (1, 1, x), which is boundedly evaluable: • x = y = z3 • z1, z2 (R(1, 1, x) R(z1, z2, y)) is entailed by R(1, 1, x) With indices in A, • “nontrivial” variables are fetchable; • combinations are indexed 10 We need to reason about A-equivalence and “nontrivial” variables
The complexity of BEP BEP is EXPSPACE-complete for CQ, UCQ and FO+ • good news: decidable • bad news: to expensive to be practical lower bound: by reduction from the non-emptiness problem for parameterized regular expressions Upper bound: a characterization based on A-equivalence and “nontrivial” variables for boundedly evaluable queries Can we make practical use of bounded evaluability? 11
An effective syntax for bounded CQ A form of queries covered by an access schema A • A CQ is boundedly evaluable under A iff it is A-equivalent to a CQ covered by A • All CQ queries covered by A are boundedly evaluable under A • It is in PTIME to syntactically check whether a CQ is covered by A in |Q|, |A| and |R| A CQ Q is covered by A if • all free variables and variables that participate in “selection / join” of Q are accessible via indices in A • combination of such variables in each atom R(x) is indexed by a single access constraint 12 A syntactic characterization of boundedly evaluable CQ
More on covered queries • Schema: R(A, B, C) • Access schema A: R( C, 1), R(ABC, N) • Q(x, y) = x1, x2, z1, z2, z3 (R(x1, x2, x) R(z1, z2, y ) R(x, y, z3) x1 = 1 x2 = 1) covered A query in FO+ is covered by A if for each CQ-subquery Qi • either Qiis covered by A, • or for each A-instance (Ti) of Qi, there exists a CQ-subquery Qj of Q such that Qi((Ti)) Qj((Ti)) and Qj is covered 2p-complete to decide whether a query in FO+ is covered 13
Bounded envelopes What can we do if query Q in L is not boundedly evaluable under A? We find QL and QU in the same language L such that • QL and QU are boundedly evaluable under A • for all instances D that satisfy A • QL(D) Q(D) QU(D), and • NL | Q(D) QL(D) |, and NU |QU(D) Q(D) |, where NLand NU are constants QL and QU: upper and lower envelopes of Q S. Chaudhuri and P. G. Kolatis.Can datalog be approximated? JCSS 55(2), 1997 QL(D) andQU(D) are not too far from Q(D) Approximate query answering 14
Example bounded envelopes • Schema: R(A, B) • Access schema A: R(A B, N) • Q(x) = y, z, w (R(w, x) R(y, w) R(x, z) w = 1) not boundedly evaluable relaxation Bounded envelopes • Upper: QU(x) = y, z (R(1, x) R(x, z)) • Lower: QL(x) = y, z (R(1, x) R(y, 1) R(x, y) R(x, z)) expansion Q(x, y) = w (R(w, x) R(y, w) w = 1) Bounded envelopes may not exist 15
The bounded envelope problems UPE(L): • Input: A relational schema R, an access schema A, and a query Q in a query language L • Question: Does Q have a bounded upper envelope under A? Similarly LPE(L) for lower envelopes. We consider covered envelopes when Q is in CQ, UCQ or FO+ Complexity bounds • For CQ, UEP and LEP are NP-complete • For UCQ, UPE is 2p-complete and LEP is NP-complete • For FO+, UPE is 2p-complete and LEP is DP-complete • For FO, UEP and LEP are undecidable 16
Bounded query specialization Access schema A, and query Q with a set X of parameters (variables) • Q(x = c): Q x = c: x X, valuation c is a constant tuple • bounded evaluable under A for all valuations c Consider covered queries when Q is in CQ, UCQ or FO+ • Find me restaurants in New York my friends have been to in 2014 Q(p, rid) = p, p1, n, c, dd, mm, yy (friend(p, p1) person(p, n, c) dine(p, rid, dd, mm, yy) p = p0 c = NYC yy = 2014) All valuations p0 Instantiate a minimum set of parameters and make Q bounded 17
The bounded specialization problem (QSP(L)) • Input: A relational schema R, an access schema A, a query Q in a query language L, a set X of parameters of Q, and a positive integer k • Question: Does Q have a bounded specialization Q(x = c) with k | x | ? Complexity bounds • NP-complete for CQ • 2p-complete for UCQ and FO+ • undecidable for FO 18
Bounded evaluability of queries Challenges: querying big data is cost-prohibitive • Bounded evaluability allows us to make big data small • However, the bounded evaluability analysis is expensive • Nonetheless, we can make practical use of bounded evaluability • Effective syntax: covered queries for CQ, UCQ and FO+ • Approximate query answering: • Bounded envelopes with a constant bound • Bounded specialization for parameterized queries Decidability and complexity An approach to effectively querying big data 26