The Complexity of Massive Data Set Computations

The Complexity of Massive Data Set Computations Ziv Bar-Yossef Computer Science Division U.C. Berkeley Ph.D. Dissertation Talk May 6, 2002

What Are Massive Data Sets? Examples • The Web • IP packets • Supermarket transactions • Telephone call graph • Astronomical observations Characterizing properties • Huge collections of raw data • Data is generated and modified continuously • Distributed over many sites • Slow storage devices • Data is not organized / indexed

Nontraditional Computational Challenges Restricted access to the data • Random access: expensive • “Streaming” access: more feasible • Some data may be unavailable • Fetching data is expensive Massive Date Sets Cope with the size of the data and the restricted access to it Traditionally Cope with the difficulty of the problem Sub-linear running time • Ideally, independent of data size Sub-linear space • Ideally, logarithmic in data size

Input Data Access Regime Basic Framework Massive data set computations are typically: • Approximate • Randomized • Have a restricted access regime Approximate Output $$ Algorithm ($$ = randomness)

Prominent Computational Models for Massive Data Sets • Sampling Computations • Sub-linear running time & space • Suitable for “insensitive” functions • Data Stream Computations • Linear running time, sub-linear space • Can compute sensitive functions • Sketch Computations • Suitable for distributed data

Sampling Computations x1 x2 • Applications • Statistical parameter estimation • Computational and statistical learning [Valiant 84, Vapnik 98] • Property testing [RS96,GGR96] Sampling Algorithm Approximation of f(x1,…,xn) $$ xn • Query input at random locations • Can choose query distribution and can query adaptively • Complexity measure: query complexity

Data Stream Computations[HRR98, AMS96, FKSV99] • Input arrives in a one-way stream in arbitrary order • Complexity measures: space and time per data item x1 x2 x3 xn Data Stream Algorithm Approximation of f(x1,…,xn) $$ memory • Applications • Database (Frequency moments [AMS96]) • Networking (Lp distance [AMS96, FKSV99, FS00, Indyk 00]) • Web Information Retrieval(Web crawling, Google query logs [CCF02])

Sketch Computations[GM98, BCFM98, FKSV99] x11 • Algorithm computes from data “sketches” sent from sites • Complexity measure: sketch lengths • Applications • Web Information Retrieval (Identifying document similarities [BCFM98]) • Networking (Lp distance [FKSV99]) • Lossy compression, approximate nearest neighbor … x1k x21 … x2k xt1 … xtk $$ compression compression compression Sketch Algorithm Approximation of f(x11,…,xtk) $$

Main Objective • Develop general lower bound techniques • Obtain lower bounds for specific functions Explore the limitations of the above computational models

Sampling Computations Data Stream Computations Statistical Decision Theory Sketch Computations Thesis Blueprint lower bounds for general functions [BKS01,B02] Reduction from simultaneous CC General CC lower bounds [BJKS02b] Communication Complexity Reduction from one-way CC One-way and simultaneous CC lower bounds [BJKS02a] Information Theory

Sampling Lower Bounds(with R. Kumar, and D. Sivakumar, STOC 2001, and Manuscript, 2002) • Combinatorial lower bound [BKS01] • bounds the expected query complexity of every function • tends to be weak • based on a generalization of Boolean block sensitivity [Nisan 89] • Statistical lower bounds • bound the query complexity of symmetric functions • via Hellinger distance: worst-case query complexity [BKS01] • via KL distance: expected query complexity [B02] • tend to be tight • work by a reduction from statistical hypothesis testing • Information theory lower bound[B02] • bounds the worst-case query complexity of symmetric functions • has better dependence on the domain size

Main Idea approximation set of w approximation set of y approximation set of x e-disjoint inputs (e,d)-approximation: Main observation: Since for all x, w.p. 1 - d, then: x,y e-disjoint  T(x),T(y) are “far” from each other

Main Result Theorem For any symmetric f and e-disjoint inputs x,y, and for any algorithm that (e,d)-approximates f, • Worst-case # of queries 1/h2(Ux,Uy) log(1/d) • Expected # of queries 1/KL(Ux,Uy) log(1/d) • Ux – uniform query distribution on x: (induced by:pick i u.a.r, output xi) • Hellinger: h2(Ux,Uy) = 1 – a (Ux(a) Uy(a))½ • KL: KL(Ux,Uy) = a Ux(a) log(Ux(a) / Uy(a))

Example: Mean Theorem (originally,[CEG95]) Approximating the mean of n numbers in [0,1] to within  e additive error requires W(1/e2log(1/d))queries. ½ + e ½ - e ½ - e ½ + e X: y: 1 0 0 1 h2(Ux,Uy) = KL(Ux,Uy) = O(e2) Other applications: Selection functions, frequency moments, extractors and dispersers

Proof Outline • For symmetric functions, WLOG, all queries are uniform without replacement • If # of queries is  n½, can further assume queries are uniform with replacement • For any e-disjoint inputs x,y, • Hypothesis testing lower bounds • via Hellinger distance (worst-case) • via KL distance (expected) (cf. [Siegmund 85]) Hypothesis test of Ux against Uy with error d and k samples (e,d)-approximation of f with k queries

P Hypothesis Test Black Box Q Statistical Hypothesis Testing k i.i.d. samples • Black box contains either P or Q • Test has to decide: “P” or “Q” • Allowed error probability d • Goal: minimize k

Sampling Algorithm Ux Black Box Sampling Algorithm  Hypothesis Test x,y: e-disjoint inputs k i.i.d. samples Uy “Ux” – if output “Uy” - otherwise

Lower Bound via Hellinger Distance Hypothesis test for Ux against Uy with error d and k samples Lemma(cf. Le Cam, Yang 90) 1. 2. Corollary: k 1/h2(Ux,Uy) log(1/d)

Communication Complexity [Yao 79] f: X  Y Z Alice Bob $$ $$ y  Y f(x,y) x  X Rd(f) = randomized CC of f with error d

Multi-Party Communication f: X1 …  Xt Z P1 x1 xt Pt P2 x2 f(x1,…,xt) P3 x3

Example: Set-disjointness t-party set-disjointness Pi gets Si [n], Theorem[KS87,R90]: Rd(Disj2) = W(n) Theorem[AMS96]: Rd(Disjt) = W(n/t4) Best upper bound: Rd(Disjt) =O(n/t)

Restricted Communication Models One-Way Communication[PS84, Ablayev 93, KNR95] P1 P2 Pt f(x1,…,xt) • Reduces to data stream computations Simultaneous Communication[Yao 79] P1 P2 Pt Referee f(x1,…,xt) • Reduces to sketch computations

Example: Disjointness  Frequency Moments k-th frequency moment Fk(a1,…,am) = j [n] (fj)k Input stream: a1,…,am[n], For j [n], fj = # of occurrences of j in a1,…,am Theorem[AMS96]: Corollary: DS(Fk) = nW(1), k > 5 Best upper bounds:DS(Fk) = nO(1), k > 2 DS(Fk) = O(log n), k=0,1,2

Information Statistics Approach to Communication Complexity(with T.S. Jayram, R. Kumar, and D. Sivakumar, Manuscript 2002) A novel lower bound technique for randomized CC based on statistics and information theory Applications • General CC lower bounds • t-party set-disjointness: W(n/t2) (improving on [AMS96]) • Lp (solving an open problem of [Saks-Sun 02]) • Inner product • One-way CC lower bounds • t-party set-disjointness: W(n/t1+e ) for any e > 0 • Space lower bounds in the data stream model • frequency moments: nW(1),k > 2 (proving conjecture of [AMS96]) • Lp distance

Statistical View of Communication Complexity • – a d-error randomized protocol for f: X Y Z • (x,y) – distribution over transcripts • Lemma: • For any two input pairs (x,y), (x’,y’) with f(x,y)  f(x’,y’), • V(P(x,y),P(x’,y’))  1 – 2d • Proof: • By reduction from hypothesis testing. • Corollary:h2(P(x,y),P(x’,y’))  1 – 2d½

Information Cost[Ablayev 93, Chakrabarti et al. 01, Saks-Sun 02] For a protocol P that computes f, how much information does P(x,y) have to reveal about (x,y)? m = (X,Y) – a distribution over inputs of f Definition:m-information cost icostm(P) = I(X,Y ; P(X,Y)) icostm,d(f) = minP{icostm(P)} I(X,Y ; P(X,Y)) H(P(X,Y)) |P(X,Y)| Information cost lower bound CC lower bound

Direct Sum for Information Cost Decomposable functions: f(x,y) = g(h(x1,y1),…,h(xn,yn)), h: Xi Yi {0,1}, g: {0,1}n  {0,1} Example: Set Disjointness Disj2(x,y) = (x1Λy1) V … V (xnΛyn) Theorem (direct sum): For appropriately chosen m,m’, icostm,d(f)  n · icostm’,d(h) Lower bound on icost(h) Lower bound on icost(f)

Information Cost of Single-Bit Functions In Disj2, m’ = ½ m’1 + ½ m’2, where: m’1 = ½(1,0) + ½(0,0), m’2 = ½(0,1) + ½(0,0) Lemma 1: For any protocol P for AND, icostm’(P) W(h2(P(0,1),P(1,0)) Lemma 2: h2(P(0,1),P(1,0)) = h2(P(1,1),P(0,0)) Corollary 1: icostm’,d(AND) W(1 – 2d½) Corollary 2: icostm,d(Disj2) W(n · (1 – 2d½))

Proof of Lemma 2 “Rectangle” property of deterministic protocols: For any transcript a, the set of all (x,y) with P(x,y) = a is a “combinatorial rectangle”: S  T, where S X and T Y “Rectangle” property of randomized protocols: For all x X, y Y, there exist functions px: {0,1}*[0,1] and qy: {0,1}*[0,1], such that for any possible transcript a, Pr(P(x,y) = a) = px(a) · qy(a) h2(P(0,1),P(1,0)) = 1 - Sa(Pr(P(0,1) = a) · Pr(P(1,0) = a))½ = 1 – Sa(p0(a) · q1(a) · p1(a) · q0(a))½ = h2(P(0,0),P(1,1))

Conclusions • Studied limitations of computing on massive data sets • Sampling computations • Data stream computations • Sketch computations • Lower bound methodologies are based on • Information theory • Statistical decision theory • Communication complexity • Lower bound techniques: • Reveal novel aspects of the models • Present a “template” for obtaining specific lower bounds

Open Problems • Sampling • Lower bounds for non-symmetric functions • Property testing lower bounds • Communication complexity • Study the communication complexity of approximations • Tight lower bound for t-party set disjointness • Under what circumstances are one-way and simultaneous communication equivalent?

Thank You!

Yao’s Lemma[Yao 83] • Convenient technique to prove randomized CC lower bounds Definition:m-distributional CC (Dm,d(f)) Complexity of best deterministic protocol that computes f with error d on inputs drawn according to m Yao’s Lemma: Rd(f)  maxmDm,d(f)

Communication Complexity Lower Bounds via Information Theory(with T.S. Jayram, R. Kumar, and D. Sivakumar, Complexity 2002) • A novelinformation theory paradigm for proving CC lower bounds • Applications • Characterization results: (w.r.t. product distributions) • 1-way  simultaneous • 2-party1-way  t-party 1-way • VC dimension characterization of t-party 1-way CC • Optimal lower bounds for simultaneous CC • t-party set-disjointness: W(n/t) • Generalized addressing function

Information Theory r  R m  M • M – distribution of transmitted messages • R – distribution of received messages • Goal of receiver: reconstruct m from r • dg: error probability of a reconstruction function g sender receiver noisy channel Fano’s Inequality: For all g,H2(dg)  H(M | R) MLE Principle: dMLE H(M | R) For a Boolean M

Information Theory View of Distributional CC • x,y distribute according to m = (X,Y) • “God” transmits f(x,y) to Alice & Bob • Alice & Bob receive the transcript P(x,y) • Fano’s inequality: For any d-error protocol P for f, H2(d)  H(f(X,Y) | P(X,Y)) f(x,y) P(x,y) Alice & Bob “God” CC protocol

Simultaneous CC vs. One-Way CC Theorem For every product distribution m = X Y, and every Boolean f, Dm,2H(d),sim(f)  Dm,d,AB(f) + Dm,d,BA(f) Proof A(x) – message of A on x in a d-error A  B protocol for f B(y) – message of B on y in a d-error B  A protocol for f Construct a SIM protocol for f: A  Referee: A(x) B  Referee: B(y) Referee outputs MLE(f(X,Y) | A(x), B(y))

The Complexity of Massive Data Set Computations

The Complexity of Massive Data Set Computations

Presentation Transcript

Massive Choice Data

Complexity of Computations

On Availability of Intermediate Data in Cloud Computations

Low Latency Computations on Massive Data

Massive data streams

Dealing with large data set and complexity in your testing

Runtime Data Flow Graph Scheduling of Matrix Computations

Runtime Data Flow Graph Scheduling of Matrix Computations

Resolving the Complexity of Some Data Privacy Problems

Data Set

The Data Stream Space Complexity of Cascaded Norms

Beyond Set Disjointness : The Communication Complexity of Finding the Intersection

Massive Data Transfers

Runtime Data Flow Graph Scheduling of Matrix Computations

The Communication Complexity of Approximate Set Packing and Covering

Computations with Big Image Data

Release plan of the data set (year)

Learning the topology of a data set

Computations

Big Data Analytics Market Set for Massive Progress in the Nearby Future

Massive Data Transfers

Green Data Center Industry Set for Massive Progress in the Nearby Future