380 likes | 502 Views
The Complexity of Massive Data Set Computations. Ziv Bar-Yossef Computer Science Division U.C. Berkeley Ph.D. Dissertation Talk May 6, 2002. What Are Massive Data Sets?. Examples The Web IP packets Supermarket transactions Telephone call graph Astronomical observations
E N D
The Complexity of Massive Data Set Computations Ziv Bar-Yossef Computer Science Division U.C. Berkeley Ph.D. Dissertation Talk May 6, 2002
What Are Massive Data Sets? Examples • The Web • IP packets • Supermarket transactions • Telephone call graph • Astronomical observations Characterizing properties • Huge collections of raw data • Data is generated and modified continuously • Distributed over many sites • Slow storage devices • Data is not organized / indexed
Nontraditional Computational Challenges Restricted access to the data • Random access: expensive • “Streaming” access: more feasible • Some data may be unavailable • Fetching data is expensive Massive Date Sets Cope with the size of the data and the restricted access to it Traditionally Cope with the difficulty of the problem Sub-linear running time • Ideally, independent of data size Sub-linear space • Ideally, logarithmic in data size
Input Data Access Regime Basic Framework Massive data set computations are typically: • Approximate • Randomized • Have a restricted access regime Approximate Output $$ Algorithm ($$ = randomness)
Prominent Computational Models for Massive Data Sets • Sampling Computations • Sub-linear running time & space • Suitable for “insensitive” functions • Data Stream Computations • Linear running time, sub-linear space • Can compute sensitive functions • Sketch Computations • Suitable for distributed data
Sampling Computations x1 x2 • Applications • Statistical parameter estimation • Computational and statistical learning [Valiant 84, Vapnik 98] • Property testing [RS96,GGR96] Sampling Algorithm Approximation of f(x1,…,xn) $$ xn • Query input at random locations • Can choose query distribution and can query adaptively • Complexity measure: query complexity
Data Stream Computations[HRR98, AMS96, FKSV99] • Input arrives in a one-way stream in arbitrary order • Complexity measures: space and time per data item x1 x2 x3 xn Data Stream Algorithm Approximation of f(x1,…,xn) $$ memory • Applications • Database (Frequency moments [AMS96]) • Networking (Lp distance [AMS96, FKSV99, FS00, Indyk 00]) • Web Information Retrieval(Web crawling, Google query logs [CCF02])
Sketch Computations[GM98, BCFM98, FKSV99] x11 • Algorithm computes from data “sketches” sent from sites • Complexity measure: sketch lengths • Applications • Web Information Retrieval (Identifying document similarities [BCFM98]) • Networking (Lp distance [FKSV99]) • Lossy compression, approximate nearest neighbor … x1k x21 … x2k xt1 … xtk $$ compression compression compression Sketch Algorithm Approximation of f(x11,…,xtk) $$
Main Objective • Develop general lower bound techniques • Obtain lower bounds for specific functions Explore the limitations of the above computational models
Sampling Computations Data Stream Computations Statistical Decision Theory Sketch Computations Thesis Blueprint lower bounds for general functions [BKS01,B02] Reduction from simultaneous CC General CC lower bounds [BJKS02b] Communication Complexity Reduction from one-way CC One-way and simultaneous CC lower bounds [BJKS02a] Information Theory
Sampling Lower Bounds(with R. Kumar, and D. Sivakumar, STOC 2001, and Manuscript, 2002) • Combinatorial lower bound [BKS01] • bounds the expected query complexity of every function • tends to be weak • based on a generalization of Boolean block sensitivity [Nisan 89] • Statistical lower bounds • bound the query complexity of symmetric functions • via Hellinger distance: worst-case query complexity [BKS01] • via KL distance: expected query complexity [B02] • tend to be tight • work by a reduction from statistical hypothesis testing • Information theory lower bound[B02] • bounds the worst-case query complexity of symmetric functions • has better dependence on the domain size
Main Idea approximation set of w approximation set of y approximation set of x e-disjoint inputs (e,d)-approximation: Main observation: Since for all x, w.p. 1 - d, then: x,y e-disjoint T(x),T(y) are “far” from each other
Main Result Theorem For any symmetric f and e-disjoint inputs x,y, and for any algorithm that (e,d)-approximates f, • Worst-case # of queries 1/h2(Ux,Uy) log(1/d) • Expected # of queries 1/KL(Ux,Uy) log(1/d) • Ux – uniform query distribution on x: (induced by:pick i u.a.r, output xi) • Hellinger: h2(Ux,Uy) = 1 – a (Ux(a) Uy(a))½ • KL: KL(Ux,Uy) = a Ux(a) log(Ux(a) / Uy(a))
Example: Mean Theorem (originally,[CEG95]) Approximating the mean of n numbers in [0,1] to within e additive error requires W(1/e2log(1/d))queries. ½ + e ½ - e ½ - e ½ + e X: y: 1 0 0 1 h2(Ux,Uy) = KL(Ux,Uy) = O(e2) Other applications: Selection functions, frequency moments, extractors and dispersers
Proof Outline • For symmetric functions, WLOG, all queries are uniform without replacement • If # of queries is n½, can further assume queries are uniform with replacement • For any e-disjoint inputs x,y, • Hypothesis testing lower bounds • via Hellinger distance (worst-case) • via KL distance (expected) (cf. [Siegmund 85]) Hypothesis test of Ux against Uy with error d and k samples (e,d)-approximation of f with k queries
P Hypothesis Test Black Box Q Statistical Hypothesis Testing k i.i.d. samples • Black box contains either P or Q • Test has to decide: “P” or “Q” • Allowed error probability d • Goal: minimize k
Sampling Algorithm Ux Black Box Sampling Algorithm Hypothesis Test x,y: e-disjoint inputs k i.i.d. samples Uy “Ux” – if output “Uy” - otherwise
Lower Bound via Hellinger Distance Hypothesis test for Ux against Uy with error d and k samples Lemma(cf. Le Cam, Yang 90) 1. 2. Corollary: k 1/h2(Ux,Uy) log(1/d)
Communication Complexity [Yao 79] f: X Y Z Alice Bob $$ $$ y Y f(x,y) x X Rd(f) = randomized CC of f with error d
Multi-Party Communication f: X1 … Xt Z P1 x1 xt Pt P2 x2 f(x1,…,xt) P3 x3
Example: Set-disjointness t-party set-disjointness Pi gets Si [n], Theorem[KS87,R90]: Rd(Disj2) = W(n) Theorem[AMS96]: Rd(Disjt) = W(n/t4) Best upper bound: Rd(Disjt) =O(n/t)
Restricted Communication Models One-Way Communication[PS84, Ablayev 93, KNR95] P1 P2 Pt f(x1,…,xt) • Reduces to data stream computations Simultaneous Communication[Yao 79] P1 P2 Pt Referee f(x1,…,xt) • Reduces to sketch computations
Example: Disjointness Frequency Moments k-th frequency moment Fk(a1,…,am) = j [n] (fj)k Input stream: a1,…,am[n], For j [n], fj = # of occurrences of j in a1,…,am Theorem[AMS96]: Corollary: DS(Fk) = nW(1), k > 5 Best upper bounds:DS(Fk) = nO(1), k > 2 DS(Fk) = O(log n), k=0,1,2
Information Statistics Approach to Communication Complexity(with T.S. Jayram, R. Kumar, and D. Sivakumar, Manuscript 2002) A novel lower bound technique for randomized CC based on statistics and information theory Applications • General CC lower bounds • t-party set-disjointness: W(n/t2) (improving on [AMS96]) • Lp (solving an open problem of [Saks-Sun 02]) • Inner product • One-way CC lower bounds • t-party set-disjointness: W(n/t1+e ) for any e > 0 • Space lower bounds in the data stream model • frequency moments: nW(1),k > 2 (proving conjecture of [AMS96]) • Lp distance
Statistical View of Communication Complexity • – a d-error randomized protocol for f: X Y Z • (x,y) – distribution over transcripts • Lemma: • For any two input pairs (x,y), (x’,y’) with f(x,y) f(x’,y’), • V(P(x,y),P(x’,y’)) 1 – 2d • Proof: • By reduction from hypothesis testing. • Corollary:h2(P(x,y),P(x’,y’)) 1 – 2d½
Information Cost[Ablayev 93, Chakrabarti et al. 01, Saks-Sun 02] For a protocol P that computes f, how much information does P(x,y) have to reveal about (x,y)? m = (X,Y) – a distribution over inputs of f Definition:m-information cost icostm(P) = I(X,Y ; P(X,Y)) icostm,d(f) = minP{icostm(P)} I(X,Y ; P(X,Y)) H(P(X,Y)) |P(X,Y)| Information cost lower bound CC lower bound
Direct Sum for Information Cost Decomposable functions: f(x,y) = g(h(x1,y1),…,h(xn,yn)), h: Xi Yi {0,1}, g: {0,1}n {0,1} Example: Set Disjointness Disj2(x,y) = (x1Λy1) V … V (xnΛyn) Theorem (direct sum): For appropriately chosen m,m’, icostm,d(f) n · icostm’,d(h) Lower bound on icost(h) Lower bound on icost(f)
Information Cost of Single-Bit Functions In Disj2, m’ = ½ m’1 + ½ m’2, where: m’1 = ½(1,0) + ½(0,0), m’2 = ½(0,1) + ½(0,0) Lemma 1: For any protocol P for AND, icostm’(P) W(h2(P(0,1),P(1,0)) Lemma 2: h2(P(0,1),P(1,0)) = h2(P(1,1),P(0,0)) Corollary 1: icostm’,d(AND) W(1 – 2d½) Corollary 2: icostm,d(Disj2) W(n · (1 – 2d½))
Proof of Lemma 2 “Rectangle” property of deterministic protocols: For any transcript a, the set of all (x,y) with P(x,y) = a is a “combinatorial rectangle”: S T, where S X and T Y “Rectangle” property of randomized protocols: For all x X, y Y, there exist functions px: {0,1}*[0,1] and qy: {0,1}*[0,1], such that for any possible transcript a, Pr(P(x,y) = a) = px(a) · qy(a) h2(P(0,1),P(1,0)) = 1 - Sa(Pr(P(0,1) = a) · Pr(P(1,0) = a))½ = 1 – Sa(p0(a) · q1(a) · p1(a) · q0(a))½ = h2(P(0,0),P(1,1))
Conclusions • Studied limitations of computing on massive data sets • Sampling computations • Data stream computations • Sketch computations • Lower bound methodologies are based on • Information theory • Statistical decision theory • Communication complexity • Lower bound techniques: • Reveal novel aspects of the models • Present a “template” for obtaining specific lower bounds
Open Problems • Sampling • Lower bounds for non-symmetric functions • Property testing lower bounds • Communication complexity • Study the communication complexity of approximations • Tight lower bound for t-party set disjointness • Under what circumstances are one-way and simultaneous communication equivalent?
Yao’s Lemma[Yao 83] • Convenient technique to prove randomized CC lower bounds Definition:m-distributional CC (Dm,d(f)) Complexity of best deterministic protocol that computes f with error d on inputs drawn according to m Yao’s Lemma: Rd(f) maxmDm,d(f)
Communication Complexity Lower Bounds via Information Theory(with T.S. Jayram, R. Kumar, and D. Sivakumar, Complexity 2002) • A novelinformation theory paradigm for proving CC lower bounds • Applications • Characterization results: (w.r.t. product distributions) • 1-way simultaneous • 2-party1-way t-party 1-way • VC dimension characterization of t-party 1-way CC • Optimal lower bounds for simultaneous CC • t-party set-disjointness: W(n/t) • Generalized addressing function
Information Theory r R m M • M – distribution of transmitted messages • R – distribution of received messages • Goal of receiver: reconstruct m from r • dg: error probability of a reconstruction function g sender receiver noisy channel Fano’s Inequality: For all g,H2(dg) H(M | R) MLE Principle: dMLE H(M | R) For a Boolean M
Information Theory View of Distributional CC • x,y distribute according to m = (X,Y) • “God” transmits f(x,y) to Alice & Bob • Alice & Bob receive the transcript P(x,y) • Fano’s inequality: For any d-error protocol P for f, H2(d) H(f(X,Y) | P(X,Y)) f(x,y) P(x,y) Alice & Bob “God” CC protocol
Simultaneous CC vs. One-Way CC Theorem For every product distribution m = X Y, and every Boolean f, Dm,2H(d),sim(f) Dm,d,AB(f) + Dm,d,BA(f) Proof A(x) – message of A on x in a d-error A B protocol for f B(y) – message of B on y in a d-error B A protocol for f Construct a SIM protocol for f: A Referee: A(x) B Referee: B(y) Referee outputs MLE(f(X,Y) | A(x), B(y))
Simultaneous CC vs. One-Way CC Proof (cont.) By MLE Principle, Prm(MLE(f(X,Y) | A(X),B(Y)) f(X,Y)) H(f(X,Y) | A(X),B(Y)) By Fano, H(f(X,Y) | A(X),Y) H2(d) and H(f(X,Y) | X,B(Y)) H2(d) Lemma For independent X,Y, H(f(X,Y) | A(X),B(Y))H(f(X,Y) | A(X),Y) + H(f(X,Y) | X,B(Y)) Our protocol errs with probability at most 2H2(d) □