180 likes | 200 Views
Explore advanced techniques in monitoring distributed data streams, ranging from sensor networks to peer-to-peer systems. Learn about real-world applications like air quality monitoring and distributed feature selection.
E N D
Monitoring Distributed Data Streams Assaf Schuster, Technion. Line of works joint with Tsachi Scharfmann, Technion Daniel Keren, Haifa U. Israel Innovation Summit
The Distributed Systems Laboratory http://dsl.cs.technion.ac.il • High-performance clusters (DSM, Infiniband) • Grid: research, development (Condor), production systems (Superlink, EGEE) • Mobile, ad-hoc, wireless networks • Sensor networks • Peer-to-peer systems • Knowledge extraction from distributed data/streams • In core parallelism, multicore, multithreading, programming paradigms, debugging Sponsors: EC, MOS, ISF, IDF, TAMAS, Intel, IBM, Microsoft, France Telecom, Voltaire, Mellanox, others Israel Innovation Summit
Today’s Problem Definition • A set of distributed data streams • Example: a sensor network • A data vector is collected from each stream • Stream is infinite • Moving/jumping windows • Given: A function over the average of the data vectors • Given: A predetermined threshold • Question: did the function value cross the threshold? • Example 1: counting, frequency count, average (e.g. temperature) • sum over all data elements and all streams Israel Innovation Summit
Example 2: Monitoring Air Quality • Sensors monitoring the concentration of air pollutants. • Each sensor holds a data vector with the measured concentration of pollutants (CO2, SO2, O3, etc.). • A function on the average data vector determines the Air Quality Index (AQI) • Alert in case the AQI exceeds a given threshold. Israel Innovation Summit
Example 3: Variance Alert • Sensors monitoring the temperature in a server room (machine room, conference room, etc.) • Ensure uniform temp.: monitor variance of readings • Alert in case variance exceeds a threshold • Temperature readings by n sensors x1, …, xn • Each sensor holds a data vector vi = (xi2, xi)T • The average data vector is v = • Var(all sensors) = Israel Innovation Summit
Example 4 (running example):Distributed Feature Selection • A collaborative distributed spam mail filtering system. • A mail server receives a stream of positive and negative examples. • Select a set of features (words) to be used in order to build a spam classifier. • A feature is good if its information gain is above a threshold. Israel Innovation Summit
Distributed Calculation of Info Gain??? • Each server maintains a local contingency table for each feature. • Is the info gain on the global contingency table above the threshold? • Information gain of average contingency table cannot be derived from that of individual tables. IG(C1)=1 IG(C2)=1 Israel Innovation Summit
Previous Solutions • Naïve Algorithms • All data is moved to a central place • Communication overhead • CPU overhead • Power overhead • Privacy issues • Can we do better? Israel Innovation Summit
A Novel Geometric Approach [Forthcoming SIGMOD 2006] • Coloring the vector space • Grey:: function > threshold • White:: function <= threshold • Goal: determine color of global data vector (average). • Observation: average is in the convex hull of streams • If convex hull monochromatic then average is same color • How do we know convex hull is monochromatic? • Without global/central knowledge Israel Innovation Summit
Distributively Bounding the Convex Hull • A reference point is known to all streams • Each stream constructs a ball • Theorem: convex hull is bound by the union of balls Israel Innovation Summit
Basic Algorithm • Set reference point = initial average • Drift – the difference between current local data and reference • drift is diameter of ball • If ball becomes non monochromatic – recalculate average • used as the new reference • drifts become zero Israel Innovation Summit
Rueters Corpus (RCV1-v2) • 800,000+ news stories • Aug 20 1996 -- Aug 19 1997 • Identify corporate/industrial stories n=10 Israel Innovation Summit
Trade-off: Accuracy vs. Performance • Inefficiency: value of function on average is close to the threshold • Performance can be enhanced at the cost of less accurate result: • Set error margin around the threshold value Israel Innovation Summit
Balancing • Globally calculating average is costly • Often possible to average only some of the data vectors. Israel Innovation Summit
Scalability # messages per stream is constant. Israel Innovation Summit
Questions? Israel Innovation Summit
Window Size Israel Innovation Summit
Simultaneous Features Israel Innovation Summit