340 likes | 442 Views
Algorithms for Distributed Functional Monitoring. Ke Yi HKUST Joint work with Graham Cormode (AT&T Labs) S. Muthukrishnan (Google Inc.). The Story Begins with . The Model. Alice observes A ( t ) by time t. 5. 4. 3. 1. 2. 4. 1. t.
E N D
Algorithms for Distributed Functional Monitoring Ke Yi HKUST Joint work with Graham Cormode (AT&T Labs)S. Muthukrishnan (Google Inc.)
The Model Alice observes A(t) by time t 5 4 3 1 2 4 1 t Carole tries to computef (A(t)UB(t)) for all t 2 1 2 5 3 2 Bob observes B(t) by time t All parties have infinite computing power Goal is to minimize communication A(t), B(t): multisets
The Model Continuous Communication Model / Distributed Streaming Model ksites 5 4 3 1 2 4 1 2 1 2 5 3 2 3 1 3 1 2 3 2 2 3 3 5 2
Combination of Two Models “ ” 3 2 3 2 1 1 1 1 2 4 2 4 Continuous Communication Model Distributed Streaming Model 3 1 2 4 1 Communication model Streaming model One-shot Model
Other Models [Gibbons and Tirthapura, 2001] 5 4 3 1 2 4 1 t Carole tries to computef (AUB) in the end 2 1 2 5 3 2 All parties make one pass using small memory small communication
Query Query site Q(S1∪S2∪…) S1 S3 S6 1 1 1 1 1 1 1 1 1 1 0 0 0 0 1 0 1 S5 S2 S4 0 1 0 0 1 Applied Motivation: Distributed Monitoring • Large-scale querying/monitoring: Inherently distributed! • Streams physically distributed across remote sitesE.g., stream of UDP packets through routers • Challenge is “holistic” querying/monitoring • Queries over the union of distributed streams Q(S1 ∪S2 ∪ …) • Streaming data is spread throughout the network Network Operations Center (NOC) Slide from the tutorial “Streaming in a connected world: Querying and trackingdistributed data streams” at VLDB’06 and SIGMOD’07 [Cormode and Garofalakis]
Query Query site Q(S1∪S2∪…) S1 S3 S6 1 1 1 1 1 1 1 1 1 1 0 0 0 0 1 0 1 S5 S2 S4 0 1 0 0 1 Applied Motivation: Distributed Monitoring • Traditional approach: “pull” based • Query all nodes once for a while • Expensive communication, most is wasted • Inaccurate • Current trend: moving towards a “push” based approach • The remote sites alert the coordinator when something interesting happens Network Operations Center (NOC)
Theoretical Questions • Upper bounds: Worst-case communication bounds for a given f ? • Lower bounds: Is there a gap in the communication complexity between the one-shot model and the continuous model?
The Frequency Moments • Assume integer domain [n] = {1, …, n} • iappears mi times • The p-th frequency moment: • F1 is the cardinality of A • F0 is # unique items in A (define 00=0) • F2 is • Gini’s index of homogeneity in statistics • self-join size in db • Extensively studied since [Alon, Matias, and Szegedy, 1999]
Approximate Monitoring • Must trigger alarm when Fp > τ • Cannot trigger alarm when Fp < (1 − ε) τ • Why approximate: Exact monitoring is expensive and unnecessary • Why monitoring • Most applications only need monitoring • Tracking can be simulated by monitoring with τ = 1+ε, (1+ε)2, (1+ε)3, …, so at most an O(1/ε) factor away. Fp τ (1 − ε) τ alarm time
Prior Work • Several papers in the database literature • Mostly heuristic based • Bad worst-case bounds, no lower bounds • F1: O(k/εlog(τ/k)) [SIGMOD’06] • F0: Õ(k2/ε3) [ICDE’06] • F2: Õ(k2/ε4) [VLDB’05] Õ() suppresses polylog factors O(k log(1/ε)) Õ(k/ε2) Õ(k2/ε+k3/2/ε3)
Continuous vs One-Shot • If there is a continuous monitoring algorithm that communicates X bits, then there is a one-shot algorithms that communicates O(X+k) bits
Our Results • Good news: all continuous bounds (except F2) are close to their one-shot counterparts • Bad news: all continuous bounds (except F2) are close to their one-shot counterparts
Talk Outline • Introduction • Deterministic F1 algorithm: O(k log(1/ε)) • Randomized F1 algorithm: O(1/ε2∙log(1/δ)) • Randomized F0 algorithm: Õ(k/ε2) • Randomized F2 algorithm: Õ(k2/ε+k3/2/ε3) • Conclusions
Deterministic F1 Algorithm • The first round: Terminates round after receiving k signals τ/2k · k = τ/2 < F1 < τ τ/2k coordinator
Deterministic F1 Algorithm • The second round: τ/4k coordinator
Deterministic F1 Algorithm • The second round: Terminates round after receiving k signals 3τ/4 < F1 < τ τ/4k coordinator
Deterministic F1 Algorithm • Each round communicates O(k) bits • Continue until Δ=ετ O(log(1/ε)) rounds After the last round, we have (1-ε)τ < F1 < τ Δ=ετ Total communication: O(k log(1/ε)) Lower bound: Ω(k log(1/(εk))) One-Shot: O(k log(1/ε)) Lower bound: Ω(k log(1/(εk))) coordinator
Talk Outline • Introduction • Deterministic F1 algorithm: O(k log(1/ε)) • Randomized F1 algorithm: O(1/ε2∙log(1/δ)) • Randomized F0 algorithm: Õ(k/ε2) • Randomized F2 algorithm: Õ(k2/ε+k3/2/ε3) • Conclusions
F0: # Distinct Items • Lower bound: Any deterministic (or Las Vegas randomized) algorithm has to communicate Ω(n) bits • Consider the one-shot case first • Use “sketches”: small-space streaming algorithms • “Combine” the sketches from the k sites • FM sketch [Flajolet and Martin 1985; Alon, Matias, and Szegedy, 1999]
FM Sketch • Take a pair-wise independent random hash function h :{1,…,n} {1,…,2d}, where 2d > n • For each incoming element x, compute h(x) • e.g., h(5) = 10101100010000 • Count how many trailing zeros • Remember the maximum number of trailing zeroes in any h(x) • Let Y be the maximum number of trailing zeroes • Can show E[2Y] = # distinct elements
FM Sketch • So 2Yis an unbiased estimator for # distinct elements • However, has a large variance • Some recent techniques[Gibbons and Tirthapura, 2001; Bar-Yossef, Jayram, Kumar, Sivakumar, and Trevisan, 2002]to produce a good estimator that has probability 1–δ to be within relative error ε • Space increased to Õ(1/ε2) • FM sketch has linearity • Y1 from A, Y2 from B, then 2max{Y1, Y2}estimates # distinct items in AUB • A one-shot algorithm with communication Õ(k/ε2)
Continuously Monitoring F0 • FM sketch is monotone • Yi is non-decreasing, and Yi < log n • Whenever Yiincreases, notify the coordinator • The coordinator can always have the up-to-date combined FM sketch • Total communication: Õ(k/ε2) • Lower bound: Ω(k)
Talk Outline • Introduction • Deterministic F1 algorithm: O(k log(1/ε)) • Randomized F1 algorithm: O(1/ε2∙log(1/δ)) • Randomized F0 algorithm: Õ(k/ε2) • Randomized F2 algorithm: Õ(k2/ε+k3/2/ε3) • Conclusions
F2: The One-Shot Case • Lower bound: Any deterministic (or Las Vegas randomized) algorithm has to communicate Ω(n) bits • Consider the one-shot case first • Use “sketches”: small-space streaming algorithms • “Combine” the sketches from the k sites • AMS sketch [Alon, Matias, and Szegedy, 1999]
AMS Sketch: “Tug-of-War” • Take a 4-wise independent random hash functionh :{1,…,n} {−1,+1} • Compute Y = ∑ h(x) over all x • Y2 is an unbiased estimator for F2 • Use O(1/ε2 ∙ log(1/δ)) copies to guarantee a good estimator that has probability 1–δ to be within relative error ε • Linearity still holds! • One-shot case can be solved with communication Õ(k/ε2)
However… • Y is not monotone! • Can’t afford to send all changes of the local sketch to the coordinator
F2Monitoring: Multi-Round Algorithm Beginning of a round sketch Õ(1/ε2) sketch Õ(1/ε2) coordinator estimate for F2
F2Monitoring: Multi-Round Algorithm During a round sends a signal wheneverthe F2 of the updates increasesby t = (τ − F2)2/(64k2τ) coordinator estimate for F2
F2Monitoring: Multi-Round Algorithm End of a round: when k signals are received # rounds: O(k/ε) Total cost: Õ(k2/ε3) coordinator estimate for F2 old F2 + (τ − old F2) ∙ ε/k < new F2 <τ
F2: Round / Sub-Round Algorithm End of a sub-round: when k signals are received “rough” sketch of size Õ(1) “rough” sketch of size Õ(1) combine sketches maintain an upper bound of F2 coordinator estimate for F2 old F2 + (τ − old F2) ∙ ε/k < new F2 <τ Total cost: Õ(k2/ε+k3/2/ε3) Lower bound:Ω(k) One-shot: Õ(k/ε2)
Open Problems • Still no clear separation between the one-shot model and the continuous model • F2 is an interesting case • Many other functions f • Statistics: entropy, heavy hitters • Geometric measures: diameter, width, … • Variations of the model • One-way vs two-way communication • Does having a broadcast channel help? • Sliding windows? • “Continuous Communication Complexity”?