Querying Big Data by Accessing Small Data

Querying Big Data by Accessing Small Data Wenfei Fan University of Edinburgh & Beihang University Floris Geerts University of Antwerp Yang Cao University of Edinburgh & Beihang University Ting Deng Beihang University Ping Lu Beihang University

Challenges introduced by big data • Traditional computational complexity theory of 50 years: • The ugly: PSPACE-hard, EXPTIME-hard, … , undecidable • The bad: NP-hard (intractable) • The good: polynomial time computable (PTIME) What happens when it comes to big data? • Using SSD of 6G/s, a linear scan of a data set D would take • 1.9 days when D is of 1PB (1015B) • 5.28 years when D is of 1EB (1018B) • O(n) time is already beyond reach on big data in practice! Can we still answer queries on big data with limited resource? 1

Bounded evaluability • Input: A class L of queries • Question: Can we find, for any query Q  L and any (possibly big) dataset D, a fraction DQ of Dsuch that • Q(D) = Q(DQ), and • DQ can be identified in time determined by Q? D Q( ) Q( ) DQ DQ • Scales with D no matter how big D grows Making the cost of computing Q(D) independent of |D|! 2

Graph Search (Facebook) • Find me restaurants in New York my friends have been to in 2014 1.38billion person tuples, and over 140 billion friend tuples select rid from friend(pid1, pid2), person(pid, name, city), dine(pid, rid, dd, mm, yy) where pid1 = p0 and pid2 = person.pid and pid2 = dine.pid and city = NYC and yy = 2014 Data semantics in constraints • Facebook: 5000 friends per person • Each year has at most 366 days • Each person dines at most once per day • pid is a key for relation person Build an index from pid1 to pid2 for friend(pid1, pid2) Boundedly evaluable with indices under constraints?

Bounded query evaluation • Find me restaurants in New York my friends have been to in 2014 Q(rid) =  p, p1, n, c, dd, mm, yy (friend(p, p1)  person(p, n, c)  dine(p, rid, dd, mm, yy)  p = p0  c = NYC  yy = 2014) A query plan under the constraints + indices • Fetch 5000pid’s for friends of p0 -- 5000 friends per person • For each pid, check whether she lives in NYC – 5000 person tuples • For pid’s living in NYC, find restaurants where they dined in 2014 – 5000 * 366 tuples at most In contrast to 1.38billion person tuples, and over 140 billion friend tuples Accessing 5000 + 5000 + 5000 * 366 tuples in total 4

Overview • Formalization of bounded query plans and queries • The complexity of deciding the bounded evaluability for • CQ (SPJ), UCQ, FO+ (SPJU), FO • Effective syntax for boundedly evaluable queries • Approximate query answering with bounded evaluability • Bounded envelopes • Bounded query specialization We only know that bounded evaluability is • undecidable for FO [PODS 2014] • in PTME for CQ with very restricted query plans [VLDB 2014] Previous work: bounded query plans are not properly defined

Boundedly evaluable queries: formulation

Access constraints to capture data semantics Combining cardinality constraints and index On a relation schema R: X  (Y, N) • X, Y: sets of attributes of R • for any X-value, there exist at most N distinct Y values • Index on X for Y: given an X value, find relevant Y values Examples • friend(pid1, pid2): pid1  (pid2, 5000) 5000 friends per person • dine(pid, rid, dd, mm, yy): pid, yy  (rid, 366) each year has at most 366 days and each person dines at most once per day • person(pid, name, city): pid  (city, 1) pid is a key for person Discovery: functional dependencies, simple aggregate queries Access schema: A set of access constraints 6

Bounded plans for query Q In the presence of access schema A (Q, R): T1 = 1, …, Tn = n, where i is Y  X  Y’ • { a }: a constant in query Q • Fetch(X  Tj, R, Y): via access constraint R: X  (Y’, N), j < i • Y(Tj)，C(Tj), (Tj): projection, selection, renaming • Tj Tk, Tj Tk, Tj -Tk: Cartesian product, union, set difference, for j < I, k < i The length of (Q, R): bounded by an exponential in |R|, |Q| and |A| not very practical for plans beyond exponential Fetch data by making use of indices in A Independent of the size of instances D of R 7

Boundedly evaluable queries Q Q has a bounded query plan (Q, R) under an access schema A • CQ:only { a }, Fetch(X  Tj, R, Y), Y(Tj)，C(Tj), (Tj), Tj Tk : • UCQ: at the end only • FO+: { a }, Fetch, , , , , , • FO: { a }, Fetch, , , , , ,  Coping with big data 8

Deciding bounded evaluability

The bounded evaluability problem (BEP(L)) • Input: A relational schema R, an access schema A, and a query Q in a query language L • Question: Is Q boundedly evaluable under A? When Q has a bounded query plan under A. Undecidable for FO [PODS 2014] • Is BEP decidable for CQ? UCQ? FO+? • If so, what is the complexity? The bounded evaluability analysis is nontrivial 9

Example of bounded evaluable queries • Schema: R(A, B, C) • Access schema A: R( C, 1), R(AB C, N) • A CQ query: Q(x, y) =  x1, x2, z1, z2, z3 (R(x1, x2, x)  R(z1, z2, y )  R(x, y, z3)  x1 = 1  x2 = 1) Is Q boundedly evaluable? Yes, Q is A-equivalent to Q’(x, x) = R (1, 1, x), which is boundedly evaluable: • x = y = z3 •  z1, z2 (R(1, 1, x)  R(z1, z2, y)) is entailed by R(1, 1, x) With indices in A, • “nontrivial” variables are fetchable; • combinations are indexed 10 We need to reason about A-equivalence and “nontrivial” variables

The complexity of BEP BEP is EXPSPACE-complete for CQ, UCQ and FO+ • good news: decidable • bad news: to expensive to be practical lower bound: by reduction from the non-emptiness problem for parameterized regular expressions Upper bound: a characterization based on A-equivalence and “nontrivial” variables for boundedly evaluable queries Can we make practical use of bounded evaluability? 11

Effective syntax for boundedly evaluable queries

An effective syntax for bounded CQ A form of queries covered by an access schema A • A CQ is boundedly evaluable under A iff it is A-equivalent to a CQ covered by A • All CQ queries covered by A are boundedly evaluable under A • It is in PTIME to syntactically check whether a CQ is covered by A in |Q|, |A| and |R| A CQ Q is covered by A if • all free variables and variables that participate in “selection / join” of Q are accessible via indices in A • combination of such variables in each atom R(x) is indexed by a single access constraint 12 A syntactic characterization of boundedly evaluable CQ

More on covered queries • Schema: R(A, B, C) • Access schema A: R( C, 1), R(ABC, N) • Q(x, y) =  x1, x2, z1, z2, z3 (R(x1, x2, x)  R(z1, z2, y )  R(x, y, z3)  x1 = 1  x2 = 1) covered A query in FO+ is covered by A if for each CQ-subquery Qi • either Qiis covered by A, • or for each A-instance (Ti) of Qi, there exists a CQ-subquery Qj of Q such that Qi((Ti))  Qj((Ti)) and Qj is covered 2p-complete to decide whether a query in FO+ is covered 13

Bounded envelopes

Bounded envelopes What can we do if query Q in L is not boundedly evaluable under A? We find QL and QU in the same language L such that • QL and QU are boundedly evaluable under A • for all instances D that satisfy A • QL(D)  Q(D) QU(D), and • NL  | Q(D)  QL(D) |, and NU |QU(D)  Q(D) |, where NLand NU are constants QL and QU: upper and lower envelopes of Q S. Chaudhuri and P. G. Kolatis.Can datalog be approximated? JCSS 55(2), 1997 QL(D) andQU(D) are not too far from Q(D) Approximate query answering 14

Example bounded envelopes • Schema: R(A, B) • Access schema A: R(A  B, N) • Q(x) =  y, z, w (R(w, x)  R(y, w)  R(x, z)  w = 1) not boundedly evaluable relaxation Bounded envelopes • Upper: QU(x) =  y, z (R(1, x)  R(x, z)) • Lower: QL(x) =  y, z (R(1, x)  R(y, 1) R(x, y)  R(x, z)) expansion Q(x, y) =  w (R(w, x)  R(y, w)  w = 1) Bounded envelopes may not exist 15

The bounded envelope problems UPE(L): • Input: A relational schema R, an access schema A, and a query Q in a query language L • Question: Does Q have a bounded upper envelope under A? Similarly LPE(L) for lower envelopes. We consider covered envelopes when Q is in CQ, UCQ or FO+ Complexity bounds • For CQ, UEP and LEP are NP-complete • For UCQ, UPE is 2p-complete and LEP is NP-complete • For FO+, UPE is 2p-complete and LEP is DP-complete • For FO, UEP and LEP are undecidable 16

Bounded specialized queries

Bounded query specialization Access schema A, and query Q with a set X of parameters (variables) • Q(x = c): Q x = c: x  X, valuation c is a constant tuple • bounded evaluable under A for all valuations c Consider covered queries when Q is in CQ, UCQ or FO+ • Find me restaurants in New York my friends have been to in 2014 Q(p, rid) =  p, p1, n, c, dd, mm, yy (friend(p, p1)  person(p, n, c)  dine(p, rid, dd, mm, yy)  p = p0  c = NYC  yy = 2014) All valuations p0 Instantiate a minimum set of parameters and make Q bounded 17

The bounded specialization problem (QSP(L)) • Input: A relational schema R, an access schema A, a query Q in a query language L, a set X of parameters of Q, and a positive integer k • Question: Does Q have a bounded specialization Q(x = c) with k  | x | ? Complexity bounds • NP-complete for CQ • 2p-complete for UCQ and FO+ • undecidable for FO 18

Summing up

Bounded evaluability of queries Challenges: querying big data is cost-prohibitive • Bounded evaluability allows us to make big data small • However, the bounded evaluability analysis is expensive • Nonetheless, we can make practical use of bounded evaluability • Effective syntax: covered queries for CQ, UCQ and FO+ • Approximate query answering: • Bounded envelopes with a constant bound • Bounded specialization for parameterized queries Decidability and complexity An approach to effectively querying big data 26

Querying Big Data by Accessing Small Data

Querying Big Data by Accessing Small Data

Presentation Transcript

Accessing Data

Accessing APOGEE Data

Querying Encrypted Data

GraphChi: Big Data – small machine

Accessing Assessment Data

Big Data, Small(er) Company

Data analysis by querying

Accessing Event Data

Accessing Your Data

Querying your data

Accessing NASA data

Accessing NJR Data

Data Querying Website

Querying Big Data Tractability revisited for querying big data BD-tractability

Querying Encrypted Data

Big Data Big Data

How Small Data and Big Data Work Collectively?

Data Querying Website

Accessing data

Chapter 6: Accessing large amount of data i.e. Big Data

Data Quality Powered by Big Data

Big Data vs Small Data