1 / 38

Monitoring Distributed Streams

Monitoring Distributed Streams. Presented by: Tsachi Sharfman Instructor: Assoc. Prof. Assaf Schuster Co-Instructor: Assoc. Prof. Daniel Keren, Haifa U. Problem Definition. A set of distributed data streams Mirrored web site Distributed spam filtering system A sensor network

lwickham
Download Presentation

Monitoring Distributed Streams

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Monitoring Distributed Streams Presented by: Tsachi Sharfman Instructor: Assoc. Prof. Assaf Schuster Co-Instructor: Assoc. Prof. Daniel Keren, Haifa U. Israel Innovation Summit

  2. Problem Definition • A set of distributed data streams • Mirrored web site • Distributed spam filtering system • A sensor network • A data vector is collected from each stream • Stream is infinite • Moving/jumping windows • Given: A function over the average of the data vectors • Given: A predetermined threshold • Question: did the function value cross the threshold?

  3. Example 1: Web Page Frequency Counts • Mirrored web site • Each mirror maintains the frequency each page was accessed in last 5 min. • We would like to constantly maintain a list of the most frequently accessed web pages (as defined by a threshold)

  4. Example 2: Air Quality Monitoring • Sensors monitoring the concentration of air pollutants. • Each sensor holds a data vector comprising of the measured concentration of various pollutants (CO2, SO2, O3, etc.). • A function on the average data vector determines the Air Quality Index (AQI) • Alert in case the AQI exceeds a given threshold.

  5. Example 3: Variance Alert • Sensors monitoring the temperature in a server room (machine room, conference room, etc.) • Ensure uniform temp.: monitor variance of readings • Alert in case variance exceeds a threshold • Temperature readings by n sensors x1, …, xn • Each sensor holds a data vector vi = (xi2, xi)T • The average data vector is v = • Var(all sensors) =

  6. Pervious Work • Focused on linear functions (e.g., sum, average): • M. Dilman and D. Raz. Efficient reactive monitoring. In INFOCOM, pages 1012–1019, 2001. • Pervious solutions for arbitrary Functions included only Naïve Algorithms • All data is moved to a central place • Communication overhead • CPU overhead • Power overhead • Privacy issues

  7. Frequency Count Algorithm • Each mirror maintains a frequency count for each web page • One of the mirrors is designated as a coordinator. • Synchronization: upon initialization, and from time to time, as dictated by the algorithm: • Coordinator collects frequency counts from all mirrors • Calculates global frequency, called the estimate value and denoted by e • Sends estimate value to all mirrors

  8. Solution (continued) • Per web page, each mirror maintains: • Reference value ,v’, holds the frequency count collected by coordinator during last Synch. Event • Estimate value, e, last estimate value sent by coordinator • Drift value, u = e + v – v’ • Observation: the average of the drift values equals the average of the frequency counts • Conclusion: As long as all the drift values are on same side of threshold as the estimate value, no communication is required

  9. Example 4 (running example):Distributed Feature Selection • A distributed spam mail filtering system. • A mail server receives a stream of positive and negative examples. • Select a set of features (words) to be used in order to build a spam classifier. • A feature is good if its information gain is above a threshold.

  10. Distributed Calculation of Information Gain • Each server maintains a contingency table for each feature. • We would like to determine, for each feature, whether the information gain on the average contingency table is above the threshold.

  11. Distributed Calculation of Information Gain – continued • Note that the information gain on the average contingency table can not be derived from the information gain on each individual contingency table! IG(C1)=1 IG(C2)=1

  12. Novel Geometric Approach • Each node hold a statistics vector • Coloring the vector space • Grey:: function > threshold • White:: function <= threshold • Goal: determine color of global data vector (average). • Observation: average is in the convex hull of drift vectors • If convex hull monochromatic then average is same color • How do we know convex hull is monochromatic? • Without global/central knowledge

  13. Distributively Bounding the Convex Hull • A reference point is known to all nodes • Each node constructs a ball • Theorem: convex hull is bound by the union of balls

  14. Basic Algorithm • Monitored function and threshold induce a coloring over the vector space • An initial estimate vector is calculated • Nodes check color of drift sphere • Drift vector is the diameter of the drift ball • If any ball non monochromatic synchronize nodes

  15. Reuters Corpus (RCV1-v2) • 800,000+ news stories • Aug 20 1996 -- Aug 19 1997 • Corporate/Industrial tagging simulates spam n=10

  16. Trade-off: Accuracy vs. Performance • Inefficiency: value of function on average is close to the threshold • Performance can be enhanced at the cost of less accurate result: • Set error margin around the threshold value

  17. Scalability # messages per node is constant.

  18. Balancing • Globally calculating average is costly • Often possible to average only some of the data vectors.

  19. Performance Analysis

  20. Performance Analysis (continued)

  21. Upper Bounds on Probability of Constraint Violation

  22. Tiered Sensor Networks • Network comprised of two types of sensors, Macro-Nodes and Motes • Motes: • Simple, inexpensive sensing units • Based on 8-bit processors • Macro Nodes: • Less resource constrained • Based on 32-bit processors. Support more advanced OS and development tools

  23. Monitoring Sensor Networks (1) • A spanning tree is constructed over the connectivity graph • Initial measurement vector aggregated over the tree, and flooded to all Motes • Motes use aggregated vector as estimate vector • An attempt is made to balance constraint violations within the cluster (intra cluster balancing): • Cluster Head iteratively selects motes and requests their drift vectors • Balancing succeeds if the average of the drift vectors collected from motes creates a monochromatic ball with the estimate vector

  24. Monitoring Sensor Networks (2) • In case intra cluster balancing failed, an attempt is made to balance the constraint violation by passing a token among the Cluster Heads (extra cluster balancing): • The token consists of the average of the drift vectors held by the motes in the clusters the token has visited • Upon receipt of token, the Cluster Head collects drift vectors from motes, and adds them to the token • In case extra cluster balancing has failed, the vector held by the token is flooded to the motes, which use it as the new estimate vector

  25. Monitoring Sensor Networks (3) • Token traversal implemented as a DFS search • Several tokens may simultaneously traverse the network, in which case they may be required to merge

  26. Data Set • A 144x36 data points of temperature readings in the northern hemisphere • Readings are taken every 6h for a period of a year • Strong Spatial and Temporal correlation among data readings • Average temperature ranges from -3.15 to 15 degrees Centigrade

  27. Experimental Results - Threshold

  28. Experimental Results – Error Margin

  29. Experimental Results – Cluster Size

  30. Future Work • Monitoring Top-k items • Additional applications • Large scale networks

  31. Questions?

  32. Bounding Theorem – Proof (1)

  33. Bounding Theorem – Proof (2)

  34. Bounding Theorem – Proof (3)

  35. Bounding Theorem – Proof (4)

  36. Bounding Theorem – Proof (5)

  37. Window Size

  38. Simultaneous Features

More Related