Probabilistic Data Aggregation

Probabilistic Data Aggregation Ling Huang, Ben Zhao, Anthony Joseph and John Kubiatowicz {hling, ravenben, adj, kubitrong}@eecs.berkeley.edu June, 2004

5 1 3 A 2 B C 1 D E Background • Aggregate functions • MIN, MAX, AVG, COUNT, …, etc. • In-Network hierarchical processing • Query propagation • Tree construction • Aggregates computed epoch by epoch • Addressing fault-tolerance • Multi-root • Multi-tree • Reliable transmission

Motivation • Data aggregation is an important function for all network infrastructures • Exact result not achievable in face of loss and faults • High cost when adding fault-tolerance • Low communication overhead, accurate approximation is crucial in • Sensor networks • P2P networks • Network monitoring and intrusion detection systems • But, it’s difficult to achieve • Many problems in existing approaches

Observation: Comparison of Data Streams Three real-world data traces and a random trace

Statistical Properties of Data Streams Relative Increment is defined as: There is temporal correlation in real data stream, by which we can leverage to maintain aggregate data accuracy, while reducing communication overhead and recovering from data loss. Density estimation for relative increment

Problems in Existing Approaches • Few approach exploits the temporal properties and is designed to handle data loss and corruption • Simple last-value algorithm for data loss recovery in TAG • Multi-root/tree make things worse by consuming more resource • Fragile for large process groups • Need all relevant nodes for participation • Difficult to trade accuracy for communication overhead • Good applications need this tradeoff • Only need approximation • But, minimize resource consumption • Centralize solution of adaptive filtering proposed by Olston et.al.

Our Approach • Probabilistic data aggregation: a scalable and robust approach • Exploit and leverage statistical properties of data stream in both temporal domain • Apply statistical algorithms to data aggregation • Develop protocol that handles loss and failures as essential part of normal operations • Nodes participate in aggregation and communication according to statistical sampling algorithm • In the absence of data, estimate value using time series algorithms • Differentiate between voluntary and involuntary Loss

Reducing Communication Overhead • Trade off between accuracy and resource consumption • Allow selective participation of nodes while maintaining aggregate accuracy • Node participates in the operation with certain probability, which is the design parameter of the algorithm • Sampling strategies: • Uniform Sampling: all nodes use the identical sampling rate • Subtree-size based Sampling: sampling rate of a node is proportional to the size of its subtree • Variance based sampling: a sensor only reports a new value if it is above or below a threshold percentage its last reported value.

Performance of Sampling algorithms Max Operation AVG Operation • As fewer nodes participate, overall accuracy decreases for all algorithms. • Uniform sampling performs worst. • Variance based sampling is most accurate,

Observation: Long-Term Pattern in Data Daily patterns in a weekly data stream Data source: bandwidth measurements for the CUDI network interface on an Abilene router with 5-minute average.

Two Level Representation of Data Monday Data The data stream can be decomposed into two layers: the long trend (pattern), which changes slowly; the residual, low amplitude and high frequency.

Algorithms for Recovering From Loss • Traditional Approaches • Last seen data as approximation for current epoch • Linear Prediction • Two-Level data representation and prediction • Long term trend: B-spline estimation • High frequency residual: ARMA modeling • Statistical estimation: ARMA model for chaotic stationary data

Prediction With Statistical Calibration Prediction Without Statistical Calibration The Danger of Prediction

Two-Level Data Prediction • B-spline modeling for long term trend • Piecewise continuous, low-degree B-spline can represent complex shapes • Least-square B-spline regression for two-level decomposition • B-Spline extension for future forecasting • ARMA forecasting for transient oscillation • System Identification to determine the order (p, q) • Parameter Estimation • Low complexity recursive equation for future forecasting • Statistical properties for the calibration of prediction results

Uniform Sampling Rate Subtree-size Sampling Rate Performance of Prediction Algorithms in Lossless Environment Performance of Prediction Algorithms

Uniform Sampling Rate Subtree-size Sampling Rate Performance of prediction algorithms in lossy environments. Average loss rate of the network is 20%. The ration of loss rate between wide-area links and local links is 3:1. Performance of Prediction Algorithms

Summary of Results • All prediction algorithms are effective in improving the accuracy of aggregation results • Two-level prediction approach perform the best in all situations • Achieve more than 90% of accuracy even under each node nonparticipation with rate up to 60%

Conclusion and Future Work • Apply statistical algorithms to data aggregation system • quantify the statistical properties of real-world measurement data • propose the concept of probabilistic participation of nodes according to user-defined policies • propose multi-level prediction mechanismto recover from sampling and data loss • Uniqueness: multi-level prediction enables high accuracy of even under high loss and voluntary non-participation • Future Work • Develop online algorithm and exploit tradeoff between prediction accuracy and computation and storage cost • Build real system for applications with health monitoring, traffic measurement and router statistics • Real system implementation and Deployment

Probabilistic Data Aggregation