250 likes | 355 Views
Tight Bounds for Distributed Functional Monitoring. David Woodruff IBM Almaden. Qin Zhang Aarhus University MADALGO. Based on a paper in STOC, 2012. k-party Number-In-Hand Model. Player to player communication Protocol transcript always determines who speaks next. P 1. x 1. P k. P 2.
E N D
Tight Bounds for Distributed Functional Monitoring David Woodruff IBM Almaden Qin Zhang Aarhus University MADALGO Based on a paper in STOC, 2012
k-party Number-In-Hand Model • Player to player communication • Protocol transcript always determines who speaks next P1 x1 Pk P2 x2 xk … P3 x3 P4 x4 Goals: - compute a function f(x1, …, xk) - minimize communication complexity
k-party Number-In-Hand Model C P1 P2 P3 Pk … x1 x2 x3 xk Convenient to introduce a “coordinator” C All communication goes through the coordinator Communication only affected by a factor of 2
Model Motivation • Data distributed and stored in the cloud • Impractical to put data on a single device • Sensor networks • Communication is power-intensive • Network routers • Bandwidth limitations • Distributed functional monitoring Authors: Can, Cormode, Huang, Muthukrishnan, Patt- Shamir, Shafrir, Tirthapura, Wang, Yi, Zhao, …
k-Party Number-In-Hand Model For distributed databases: |x|0 is number of distinct elements |x|22 is known as self-join size |x|2 useful for regression, low-rank approx Important for applications that the xi are non-negative C P1 P2 P3 Pk … x1 x2 x3 xk Which functions do we care about? - 8i, xi2 {0,1, … n}n - x = x1 + x2 + … + xk - f(x) = |x|p = (Σi xip)1/p - |x|0 is number of non-zero coordinates - Talk will focus on |x|0 and |x|2
Randomized Communication Complexity • What is the randomized communication cost of f? • i.e., the minimal cost of a protocol, which for every set of inputs, fails in computing f with probability < 1/3 • (n) cost for |x|0 and |x|2 • Reduction from 2-Player Set-Disjointness (DISJ) • Alice has a set S µ [n] • Bob has a set T µ [n] • Either |S Å T| = 0 or |S Å T| = 1 • |S Å T| = 1 ! DISJ(S,T) = 1, |S Å T| = 0 !DISJ(S,T) = 0 • [KS, R] (n) communication • Prohibitive
Approximate Answers Compute a relation with probability > 2/3: f(x)2(1 ± ε) |x|0 f(x)2(1 ± ε) |x|2 What is the randomized communication cost as a function of k, ε, and n? Will ignore log(nk/ε) factors Understanding dependence on ε is critical, e.g., ε<.01
Previous Results • |x|0: (k + ε-2) and O(k¢ε-2) • |x|2: (k + ε-2) and O(k¢ε-2)
Our Results • |x|0: (k + ε-2) and O(k¢ε-2) (k¢ε-2) • |x|2: (k + ε-2) and O(k¢ε-2) (k¢ε-2) First lower bounds to depend on product of k and ε-2 • Implications for data streams: • First tight space lower bound for estimating number of distinct elements • without using the Gap-Hamming Problem • Improves lower bound for estimation of |x|p, p > 2
Previous Lower Bounds • Lower bounds for |x|0 and |x|2 • [CMY] (k) • [ABC] (ε-2) • Reduction from Gap-Orthogonality (GAP-ORT) • P1, P2 have u, v 2 {0,1}ε-2 , respectively • |¢(u, v) – 1/(2ε2)| < 1/ε or |¢(u, v) - 1/(2ε2)| > 2/ε • [CR, S] (ε-2) communication
Talk Outline • Lower Bounds • |x|0 • |x|2
Lower Bound for |x|0 • Improve bound to optimal (k¢ε-2) • Study a simpler problem: k-GAP-THRESH • Each player Pi holds a bit Zi • Zi are i.i.d. Bernoulli(¯) • Decide if • i=1k Zi > ¯ k + (¯ k)1/2 or i=1k Zi < ¯ k - (¯ k)1/2 Otherwise don’t care • Rectangle property: for any correct protocol transcript ¿, Z1, Z2, …, Zk are independent conditioned on ¿
Rectangle Property of Communication • Let r be the randomness of C, P1, …, Pk • For any fixed r, the set S of inputs giving rise to a transcript ¿ is a combinatorial rectangle: S = S1 x S2 x … x Sk • If input distribution is a product distribution, conditioned on ¿ and r, inputs are independent • Since this holds for every r, inputs are independent conditioned on ¿
k-GAP-THRESH C P1 P2 P3 Pk … Z1 Z2 Z3 Zk • The Zi are i.i.d. Bernoulli(¯) • Coordinator wants to decide if: • i=1k Zi > ¯ k + (¯ k)1/2 or i=1k Zi < ¯ k - (¯ k)1/2 • By independence of the Zi | ¿ , equivalent to C having “noisy” independent copies of the Zi
A Key Lemma • Lemma: For any protocol ¦ which succeeds w.pr. >.99, the transcript ¿ is such that w.pr. > 1/2, for at least k/2 different i, H(Zi | ¿) < H(.01 ¯) • Proof: Suppose ¿ does not satisfy this • With large probability, ¯ k - O(¯ k)1/2 < E[i=1k Zi | ¿] < ¯ k + O(¯ k)1/2 • Since the Zi are independent given ¿, i=1k Zi | ¿ is a sum of independent Bernoullis • Since most H(Zi | ¿) are large, by anti-concentration, both events occur with constant probability: i=1k Zi | ¿ > ¯ k + (¯ k)1/2 , i=1k Zi | ¿ < ¯ k - (¯ k)1/2 So ¦ can’t succeed with large probability
Composition Idea C DISJ DISJ DISJ DISJ P1 P2 P3 … Pk Zk Z1 Z2 Z3 The input to Pi in k-GAP-THRESH, denoted Zi, is the output of a 2-party Disjointness (DISJ) instance between C and Si - Let S be a random set of size 1/(4ε2) from {1, 2, …, 1/ε2} - For each i, if Zi = 1, then choose Ti of size 1/(4ε2) so that DISJ(S, Ti) = 1, else choose Ti so that DISJ(S, Ti) = 0 - Distributional complexity of solving DISJ with probability 1-¯/100, when DISJ(S,T) = 1 with probability ¯, is (1/ε2) [R]
Putting it All Together • Key Lemma ! For most i, H(Zi | ¿) < H(.01¯) • Since H(Zi) = H(¯) for all i, for most i protocol ¦ solves DISJ(X, Yi) with probability ¸ 1- ¯/100 • For most i, the communication between C and Pi is (ε-2) • Otherwise, C could simulate the other players without any communication and contradict lower bound for DISJ(X, Yi) • Total communication is (k¢ε-2) • Can show a reduction to estimating |x|0
Reduction to |x|0 • Think of C as a player • C’s input vector xC is characteristic vector of the set [1/ε2] \ S • Pi’s input vector xi is characteristic vector of the set Ti • When |TiÅ S| = 1, support of x = xC + i xi usually increases by 1 • Choose ¯ = £(1/(ε2 k)) so that i=1k Zi = ¯ k +- (¯ k)1/2 = 1/ε2 +- 1/ε
Talk Outline • Lower Bounds • |x|0 • |x|2
Lower Bound for Euclidean Norm • Improve (k + ε-2) bound to optimal (k¢ε-2) • Use Gap-Orthogonality (GAP-ORT(X, Y)) • GAP-ORT(X,Y) = 1 • Alice, Bob have X, Y 2 {0,1}ε-2 • Decide: |¢(X, Y) – 1/(2ε2)| <1/ε or |¢(X, Y) - 1/(2ε2)| >2/ε • Consider uniform distribution on X,Y • [KLLRX, CKW] For any protocol ¦ that solves GAP-ORT with constant probability, I(X, Y; ¦) = H(X,Y) – H(X,Y | ¦) = (1/ε2)
Information Implications • By chain rule, I(X, Y ; ¦) = i=11/ε2 I(Xi, Yi ; ¦ | X< i, Y< i) = (ε-2) • For most i, I(Xi, Yi ; ¦ | X< i, Y< i) = (1)
XOR DISJ We compose GAP-ORT with a variant of k-Party DISJ • Choose random j 2 [n] and random S 2 {00, 10, 01, 11}: S = 00: j doesn’t occur in any Ti S = 10: j occurs only in T1, …, Tk/2 S = 01: j occurs only in Tk/2+1, …, Tk S = 11: j occurs in T1, …, Tk • Every j’ j occurs in at most one set Ti • Output equals 1 if S 2 {10, 01}, otherwise output is 0 • I(¦ ; T1, …, Tk | j, S, D) = (k) for any ¦ for which I(¦ ; S) = (1) P1 … Pk/2 Pk/2+1 … Pk … Tk/2+1 … Tk µ [n] T1 Tk/2
GAP-ORT + XOR DISJ • Take 1/ε2 independent copies of XOR DISJ • Ti = (Ti1, …, Tik), ji, Si, Di are variables for i-th instance • Is the number of outputs equal to 1 about 1/(2ε2) +-1/ε or about 1/(2ε2) +- 2/ε? { XOR DISJ instance XOR DISJ instance 1/ε2 … XOR DISJ instance
Intuitive Proof • GAP-ORT is “embedded” inside of GAP-ORT + XOR DISJ Output is XOR of bits in S • Implies for any correct protocol ¦: For most i, I(Si ; ¦ | S< i) = (1) • Implies via a direct sum: For most i, I(¦ ; Ti | j, S, D, T< i ) = (k) • Implies via the chain rule: I(¦; T1, …, T1/ε2 | j, S, D) = (k/ε2) • Implies communication is (k/ε2)
Conclusions • Tight communication lower bounds for estimating |x|0 and |x|2 • Techniques imply tight lower bounds for empirical entropy, heavy hitters, quantiles • Other results: • Model in which the xi undergo poly(n) additive updates to their coordinates • Coordinator continually maintains (1+ε)-approximation • Improve k2/poly(ε) to k/poly(ε) communication for |x|2