380 likes | 389 Views
Monitoring Distributed Streams. Presented by: Tsachi Sharfman Instructor: Assoc. Prof. Assaf Schuster Co-Instructor: Assoc. Prof. Daniel Keren, Haifa U. Problem Definition. A set of distributed data streams Mirrored web site Distributed spam filtering system A sensor network
Monitoring Distributed Streams Presented by: Tsachi Sharfman Instructor: Assoc. Prof. Assaf Schuster Co-Instructor: Assoc. Prof. Daniel Keren, Haifa U. Israel Innovation Summit
Problem Definition • A set of distributed data streams • Mirrored web site • Distributed spam filtering system • A sensor network • A data vector is collected from each stream • Stream is infinite • Moving/jumping windows • Given: A function over the average of the data vectors • Given: A predetermined threshold • Question: did the function value cross the threshold?
Example 1: Web Page Frequency Counts • Mirrored web site • Each mirror maintains the frequency each page was accessed in last 5 min. • We would like to constantly maintain a list of the most frequently accessed web pages (as defined by a threshold)
Example 2: Air Quality Monitoring • Sensors monitoring the concentration of air pollutants. • Each sensor holds a data vector comprising of the measured concentration of various pollutants (CO2, SO2, O3, etc.). • A function on the average data vector determines the Air Quality Index (AQI) • Alert in case the AQI exceeds a given threshold.
Example 3: Variance Alert • Sensors monitoring the temperature in a server room (machine room, conference room, etc.) • Ensure uniform temp.: monitor variance of readings • Alert in case variance exceeds a threshold • Temperature readings by n sensors x1, …, xn • Each sensor holds a data vector vi = (xi2, xi)T • The average data vector is v = • Var(all sensors) =
Pervious Work • Focused on linear functions (e.g., sum, average): • M. Dilman and D. Raz. Efficient reactive monitoring. In INFOCOM, pages 1012–1019, 2001. • Pervious solutions for arbitrary Functions included only Naïve Algorithms • All data is moved to a central place • Communication overhead • CPU overhead • Power overhead • Privacy issues
Frequency Count Algorithm • Each mirror maintains a frequency count for each web page • One of the mirrors is designated as a coordinator. • Synchronization: upon initialization, and from time to time, as dictated by the algorithm: • Coordinator collects frequency counts from all mirrors • Calculates global frequency, called the estimate value and denoted by e • Sends estimate value to all mirrors
Solution (continued) • Per web page, each mirror maintains: • Reference value ,v’, holds the frequency count collected by coordinator during last Synch. Event • Estimate value, e, last estimate value sent by coordinator • Drift value, u = e + v – v’ • Observation: the average of the drift values equals the average of the frequency counts • Conclusion: As long as all the drift values are on same side of threshold as the estimate value, no communication is required
Example 4 (running example):Distributed Feature Selection • A distributed spam mail filtering system. • A mail server receives a stream of positive and negative examples. • Select a set of features (words) to be used in order to build a spam classifier. • A feature is good if its information gain is above a threshold.
Distributed Calculation of Information Gain • Each server maintains a contingency table for each feature. • We would like to determine, for each feature, whether the information gain on the average contingency table is above the threshold.
Distributed Calculation of Information Gain – continued • Note that the information gain on the average contingency table can not be derived from the information gain on each individual contingency table! IG(C1)=1 IG(C2)=1
Novel Geometric Approach • Each node hold a statistics vector • Coloring the vector space • Grey:: function > threshold • White:: function <= threshold • Goal: determine color of global data vector (average). • Observation: average is in the convex hull of drift vectors • If convex hull monochromatic then average is same color • How do we know convex hull is monochromatic? • Without global/central knowledge
Distributively Bounding the Convex Hull • A reference point is known to all nodes • Each node constructs a ball • Theorem: convex hull is bound by the union of balls
Basic Algorithm • Monitored function and threshold induce a coloring over the vector space • An initial estimate vector is calculated • Nodes check color of drift sphere • Drift vector is the diameter of the drift ball • If any ball non monochromatic synchronize nodes
Reuters Corpus (RCV1-v2) • 800,000+ news stories • Aug 20 1996 -- Aug 19 1997 • Corporate/Industrial tagging simulates spam n=10
Trade-off: Accuracy vs. Performance • Inefficiency: value of function on average is close to the threshold • Performance can be enhanced at the cost of less accurate result: • Set error margin around the threshold value
Scalability # messages per node is constant.
Balancing • Globally calculating average is costly • Often possible to average only some of the data vectors.
Tiered Sensor Networks • Network comprised of two types of sensors, Macro-Nodes and Motes • Motes: • Simple, inexpensive sensing units • Based on 8-bit processors • Macro Nodes: • Less resource constrained • Based on 32-bit processors. Support more advanced OS and development tools
Monitoring Sensor Networks (1) • A spanning tree is constructed over the connectivity graph • Initial measurement vector aggregated over the tree, and flooded to all Motes • Motes use aggregated vector as estimate vector • An attempt is made to balance constraint violations within the cluster (intra cluster balancing): • Cluster Head iteratively selects motes and requests their drift vectors • Balancing succeeds if the average of the drift vectors collected from motes creates a monochromatic ball with the estimate vector
Monitoring Sensor Networks (2) • In case intra cluster balancing failed, an attempt is made to balance the constraint violation by passing a token among the Cluster Heads (extra cluster balancing): • The token consists of the average of the drift vectors held by the motes in the clusters the token has visited • Upon receipt of token, the Cluster Head collects drift vectors from motes, and adds them to the token • In case extra cluster balancing has failed, the vector held by the token is flooded to the motes, which use it as the new estimate vector
Monitoring Sensor Networks (3) • Token traversal implemented as a DFS search • Several tokens may simultaneously traverse the network, in which case they may be required to merge
Data Set • A 144x36 data points of temperature readings in the northern hemisphere • Readings are taken every 6h for a period of a year • Strong Spatial and Temporal correlation among data readings • Average temperature ranges from -3.15 to 15 degrees Centigrade
Future Work • Monitoring Top-k items • Additional applications • Large scale networks