1 / 50

Distributed Top-K Monitoring

Distributed Top-K Monitoring. Brian Babcock & Chris Olston Presented by Yuval Altman. To be presented at ACM SIGMOD 2003 International Conference on Management of Data. The problem. Continuously report the k largest values obtained from distributed data streams. Motivation -.

holland
Download Presentation

Distributed Top-K Monitoring

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Distributed Top-K Monitoring Brian Babcock & Chris Olston Presented by Yuval Altman To be presented at ACM SIGMOD 2003 International Conference on Management of Data

  2. The problem Continuously report the k largest values obtained from distributed data streams.

  3. Motivation - • Google is the most popular search engine in the world. • Servers in multiple sites in the world handle millions of queries an hour. • What are the top 20 search terms?

  4. The problem • Continuously report the k largest values obtained from distributed data streams. • Multiple sources - physically far away • Communication is expensive. • Inefficient to transmit large amounts of data • Streaming model • Values change over time • Approximation may be sufficient

  5. Motivation – Detecting DDos attacks

  6. Formal problem definition • m+1 nodes: • Monitor nodes: N1, N2 , … , Nm • Coordinator node: N0 • Set of n data objects U = {O1, O2 , … , On} • i.e. Search terms, IP addresses • Objects are associated with real values V1, V2 , … , Vn • i.e. # of requests DNS queries to IP address in last 15 minutes

  7. Distributed streaming model • Updates to values through a sequence of < Oi , Nj , > touples where: • Nj detects a change  in the value Vi of Oi. • Change is not seen by other nodes Nk(ki) • For each node j, Define Partial values V1,j, V2,j,…, Vn,j: Vi,j=  < Oi , Nj , > () • The value Vifor an object Oi: Vi= j (Vi,j)

  8. Model example U = {O1, O2 , O3 , O4} < O1 , N1 , 2> < O2 , N1 , 3> < O4 , N1 , 4> < O3 , N1 , 2> < O1 , N1 , 1> < O2 , N2 , 3> < O4 , N2 , 5> < O4 , N2 , -2> < O3 , N2 , 4> < O3 , N2 , 5> < O2 , N3 , -1> < O3 , N3 , 4> < O2 , N3 , 2> < O3 , N3 , 3> < O2 , N3 , 5> N1 N2 N3 V1,1 = 3 V2,1 = 3 V3,1 = 2 V4,1 = 4 V1,2 = 0 V2,2 = 3 V3,2 = 9 V4,2 = 3 V1,3 = 0 V2,3 = 6 V3,3 = 7 V4,3 = 0 V1=3 , V2=12 , V3=18 , V4=7

  9. Using the model • Top-k IP addresses in the last 15 minutes: • <IPAddr,Router,1> when receiving a request for an IP address. • A cancelling <IPAddr,Router,-1> 15 minutes afterwards • Can Adopt a different strategy: • <IPAddr, Router, 15> when receiving a request. • <IPAddr, Router, -1> 15 times on the minute

  10. The problem Example=5 1009795 92908887838075 • The coordinator node N0 must report a set TU, |T|=k, that represents the top-k data objects. • Must be the correct within . • Formally. If OtT and OsU-T : Vt+   VS

  11. Related work • One time distributed top-k calculation • Bruno, Gravano, Marian 2002 • Fagin, Lotem, Naor 2001 • Much better than transmitting all the values to coordinator node • Not streaming • no means to detect changes to data • Running algorithm continuously is very expensive • Monitor nodes have limited query capabilities • Sorted (GetNext) and random (GetValue)

  12. Related work • Streaming top-k monitoring from single source • Charikar, Chen, Farach-Colton 2002 • Manku, Motwani 2002 • Gibbons, Matias 1998 • Randomized Algorithms • Focus on minimizing space • Reminder: The objective is to minimize communication costs

  13. Overview of algorithm • Initialize a top-k set at the coordinator node • Set arithmetic constraints at monitor nodes • Depend on current top-k set • Constraints valid  No communications • Constraints invalidated  Resolution • Possibly new top-k set • Reallocation of constraints

  14. Choosing the constraints • Ideally, data is distributed evenly at monitor nodes, such that the top-k sets are the same • In this case, the global top-k set matches the local local top-k sets • It suffices that local constraints remain valid N1 (US) Money=100Sex=98 Health=94 Mail=92 N2 (Germany) Sex=30Money=20 Mail=5 Health=3 N3 (Japan) Money=50Sex=5 Mail=4 Health=1 Global List Money=170Sex=133 Mail=101 Health=98

  15. Adjustment factors • In real life, data is not distributed evenly <N1,Sex,-8> <N3,Health,5> N1 (US) Money=100Health=94 Mail=92 Sex=90 N2 (Germany) Sex=30Money=20 Mail=5 Health=3 N3 (Japan) Money=50Health=6 Sex=5 Mail=4 Global List Money=170Sex=125 Health=103 Mail=101 • Local constraints are invalidated, but global top-k still valid

  16. Adjustment factors • For each node Njand object Oi associate an adjustment factor i,j • Constraints are evaluated after adding the adjustment factors • If OtT and OsU-T : Vt,i+  t,i  Vs,i +  t,i • Adjustment factors for each object sum to zero: • This ensures sum remains valid

  17. Adjustment factors example N1 (US) Money=100Health=94 Mail=92 Sex=90 N2 (Germany) Sex=30Money=10 Mail=5 Health=3 N3 (Japan) Money=50Health=6 Sex=5 Mail=4 Global List Money=170Sex=125 Health=103 Mail=101 Sex,1=10, Sex,2=-15, Sex,3=5 N1 (US) Money=100Sex=100 Health=94 Mail=92 N2 (Germany) Money=20 Sex=15Mail=5 Health=3 N3 (Japan) Money=50Sex=10 Health=6 Mail=4 Global List Money=170Sex=125 Health=103 Mail=101

  18. Coordinator adjustment factor • For each object Oj add an adjustment factor j,0at the coordinator node • Factors for each object Ojmust still sum to 0 • To allow error, if OtT and OsU-T : • Give Ot values a “bonus” of  • Let Vt,0=  Vs,0= 0 • The constraint:  t,0+    s,0

  19. Allowing error – example  =5 <N3,Health,40> N1 (US) Money=100Sex=98 Health=94 Mail=92 N2 (Germany) Sex=30Money=20 Mail=5 Health=3 N3 (Japan) Money=50Health=41 Sex=5 Mail=4 Global List Money=170Health=138 Sex=133 Mail=101 sex,1=-4, 2,sex,2=-25, sex,3=29 health,2=2, health,3=-7 sex,0 + 5 health,0 The trick: Health,0 =5

  20. Why do adjustment factors work? For OtT and OsU-T : • As long as for each node Ni the adjusted constraints and the coordinator constraint are valid: • Vt,i+  t,i  Vs,i +  t,I •  t,0+    s,0 • We can sum for the nodesand the error constraint and get: Vt+   Vs

  21. Algorithm details • Coordinator node Nomaintains • Current approximate Top-k set • All adjustment factors i,j • Each monitor node Nj maintains • Current approximate top-k set • For each object Oi • Partial value: Vi,j • Relevant adjustment factor: i,j

  22. Algorithm details • Initialization. Coordinator: • Computes the approximate top-k set once. • Chooses adjustment factors • Sends adjustment factors and top-k set to monitors • Monitor node constraints: • For OtT and OsU-T : Vt,j+  t,j  Vs,j +  t,j • Adjustment factor constraints: • For each object Oi:  j (i,j) = 0 • For objects OtT and OsU-T:  t,0+    s,0

  23. Algorithm for monitor node Nj Algorithm for monitor node Nj • While (1) • Read tuple < Oi , Nj , > • Vi,j = Vi,j+  • Check constraints: For OtT and OsU-T :Vt,j+  t,j  Vs,j +  t,j • If invalid, initiate resolution. • End To check constraints: Use two Heaps (or Fibheaps)

  24. Resolution – phase 1 N3 (Japan) Money=50Mail=10 Sex=5 Health=1 Love=0 • First, Njsends a message to N0with: • F - The set of objects involved in violated constraints • All partial values for objects in R = FT • The border value Bf - Maximum adjusted value not in the resolution set F3= {Mail, Sex} R3= {Money,Mail, Sex} Vmoney,3 = 50 Vmail,3= 10 Vsex,3 = 5 B3 = 1

  25. Resolution – phase 2 • The coordinator N0 attempts to resolve the constraints using the  *,0 slack • For each violated constraint N0tests: • Vt,j+  t,j+  t,0 +   Vs,j +  s,j +  s,0 • If all tests succeed, the top-k set is valid, and there’s no need to communicate with other nodes. • No reallocates adjustment factors. • Resolution is over • If at least one test fails, proceed to phase 3

  26. Phase 2 resolution example  =5 *,* =0 Money=100Sex=98 Mail=96 Health=92 Money=35Sex =20 Mail=5 Health=3 Money=50Sex=5 Mail=4 Health=1 Money=185Sex=123 Mail=105 Health=96 <N2,Mail,17> Money=100Sex=98 Mail=96 Health=92 Money=35Mail=22 Sex =20 Health=3 Money=50Sex=5 Mail=4 Health=1 Money=185 Sex=123 Mail=122 Health=96 To fix: sex,0 =-2 sex,2 =2

  27. Phase 2 resolution failure sex,0 =-2 sex,2 =2 <N2,Sex,5> Money=100Sex=98 Mail=96 Health=92 Money=35Sex =27 Mail=22 Health=3 Money=50Sex=5 Mail=4 Health=1 Money=185 Sex=128 Mail=122 Health=96 <N3,Mail,5> Money=100Sex=98 Mail=96 Health=92 Money=35Sex =27 Mail=22 Health=3 Money=50Mail=9 Sex=5 Health=1 Money=185 Sex=128 Mail=127 Health=96 Can’t “loan” 4 from sex,0

  28. Resolution – phase 3 • The coordinator N0 contacts all the nodes Ni excluding Nj, requesting: • Partial values for objects in R = FT • Border values Bi • N0sums the partial values and sorts them to compute new top-k list T’ • N0 reallocates new adjustment factors for T’ • N0 sends T’ and adjustment factors to all nodes

  29. Resolution – summary • Phase 1 - Njdetects failed constraints and notifies N0. Initiates resolution for R = FT • Phase 2 – N0 attempts to resolve constraints using  *,0 – the “bank” • If success, reallocate adjustment factors & stop • Phase 3 - N0 requests all updated partial values for R, sorts, computes new top-k list • Reallocate adjustment factors

  30. Resolution Performance • Means to measure algorithm performance • Messages are usually small • Only resolution set R = FT is involved • Two phase resolution • Initiation + reallocation • Only two messages • Three phase resolution • Initiation + Query + reallocation • 1 + 2(m-1) + m = 3m –1

  31. Adjustment factor reallocation Money=50Mail=10 Sex=5 Health=1 Love=0 • Input: • top-k list T’ • Partial values in resolution set R • Border values • Output • New adjustment factors i,j • Method - For each object: • Meet border value constraints • Calculate leeway • Distribute leeway evenly F = {Mail, Sex} R = {Money,Mail, Sex} Vmoney = 50 Vmail = 10 Vsex = 5 B= 1

  32. Leeway computation • For each object in R compute leeway  : the slack above the sum of border values • Define: • Sum of border values: B= j (Bj) • Computed values: Vi = j (Vi,j) • Vi,0 = 0 ; Bj = max (i,0) where Oi not in R • If Oi  T’ : i= Vi – B +  • Otherwise : i= Vi – B

  33. Leeway computation example N1 (US) Money=100Sex=98 Health=94 Mail=92 Love = 85 N2 (Germany) Sex=30Money=20 Mail=5 Love = 5 Health=3 N3 (Japan) Money=50Mail=10 Sex=5 Health=1 Love=0 Global List Money=170Sex=133 Mail=107 Health=98Love=90 • B = 94+5+1 = 100 • money = 170 – B = 70 • sex = 133 – B = 33 • Mail = 107 – B = 7  =0

  34. Leeway distribution • Initialization: Meet constraints • i,j = Bj- Vi,j • For Oi  T’ , j = 0 : i,0 = B0-  • Leeway distribution: • i,j = i,j+ (i/ m) • Correctness: Vt,j+  t,j  Vs,j +  t,j • If Os  R: follows from Vt,i, > Bi • If Os  R: follows from t,i > s,i

  35. Leeway distribution example N1 (US) Money=100Sex=98 Health=94 Mail=92 Love = 85 N2 (Germany) Sex=30Money=20 Mail=5 Love = 5 Health=3 N3 (Japan) Money=50Mail=10 Sex=5 Health=1 Love=0 Global List Money=170Sex=133 Mail=107 Health=98Love=90 • sex = 33 • sex,1 = B1– Vsex,1 + 33/3 = 94 – 98 + 11 = 7 • sex,2 = B2– Vsex,2 + 33/3 = 5 – 30 + 11 = -14 • sex,3 = B3– Vsex,3 + 33/3 = 1 – 5 + 11 = 7

  36. Leeway distribution example • money = 70 • money,1 = B1– Vmoney,1 + 70/3 = 94 – 100 + 24 = 18 • money,2 = B2– Vmoney,2 + 70/3 = 5 – 20 + 23 = 8 • money,3 = B3– Vmoney,3 + 70/3 = 1 – 50 + 23 = -26 • mail = 7 • mail,1 = B1– Vmail,1 + 7/3 = 94 – 92 + 3 = 5 • mail,2 = B2– Vmail,2 + 7/3 = 5 – 5 + 2 = 2 • mail,3 = B3– Vmail,3 + 7/3 = 1 – 10 + 2 = -7

  37. Reallocation Results N1 (US) Money=100Sex=98 Health=94 Mail=92 Love = 85 N2 (Germany) Sex=30Money=20 Mail=5 Love = 5 Health=3 N3 (Japan) Money=50Mail=10 Sex=5 Health=1 Love=0 Global List Money=170Sex=133 Mail=107 Health=98Love=90 N1 (US) Money=118Sex=105 Mail=97 Health=94 Love = 85 N2 (Germany) Money=28 Sex=16Mail=7 Love = 5 Health=3 N3 (Japan) Money=24 Sex=12 Mail=3 Health=1 Love=0 Global List Money=170Sex=133 Mail=107 Health=98Love=90

  38. Leeway distribution to N0 • Leeway also distributed to monitor node •  added to leeway computation for Ot T’ • Initialization for t,0for Ot T’ is B0 -  • Any addition can be “loaned” to monitor nodes • Amount distributed to N0 • Higher (i/ 2) – Less chance for phase 3 in resolution • Lower (0) – Less resolutions (More leeway to monitor nodes)

  39. Proportional leeway distribution • Allocate more leeway to monitor nodes updated more often • Top-k likely to change more • Good for monitor notes that exhibit characteristic behavior • Google locations • Enterprise routers

  40. Experiments • Query 1: • FIFA ’98 Servers at 4 locations throughout the world. • 20 top Web site page hit statistics • Query 2: • Most loaded server in a cluster • Single value per monitor node • Query 3: • Berkly to world WAN link, with 4 monitor points • 20 top destination hosts by number outgoing tcp packets

  41. Results – Query 1

  42. Results – Query 2

  43. Results – query 3

  44. Analysis of results • Allowing error improves results dramatically • Leeway for N0 – Dominant factor • Low – Half leeway to N0 • Low  little leeway • Resolutions are bound to happen. Make them less expensive • High – No leeway to N0

  45. Analysis of results • Even / Proportional leeway distribution depends on query. • Server load – Proportional • Berkly WAN – Monitor nodes simulated, so even distribution better • FIFA – Proportional for lower . Even for higher .

  46. Comparison to alternative • Caching • Coordinator holds cached partial data values • Monitor must send update to coordinator when partial value deviates by  /2m • Monitor will always have correct partial values, within /2 • Top-k list always correct within 

  47. Results: Note the log scale!

  48. Summary • Problem – find top-k set within error  • Distributed – multiple sources • Streaming – frequent updates • Naive approach • Transmit streams to coordinator node • If error is allowed, transmit only when deviation from cached value threatens correctness • New approach offers dramatic improvement over naïve approach for low-medium .

  49. Summary • Use adjustment factors to establish constraints • Monitor node initiates resolution when constraint gets broken • Resolution • Attempt to use coordinator node leeway. If successful, fix constraints by adjustment factor reallocation. • Get partial values for resolution set from all nodes, compute new top-k set. Reallocate leeway to all nodes. • Reallocation • Distribute leeway evenly between monitor nodes • Distribute leeway for monitor on on low 

  50. Questions?

More Related