Probabilistic Data Aggregation

Probabilistic Data Aggregation Ling Huang, Ben Zhao, Anthony Joseph Sahara Retreat January, 2004

Motivation • Definition of Data agg. • An important function for network infrastructures • Exact result not achievable in face of loss and faults • Low overhead, accurate approximation is crucial in • Sensor networks • P2P networks • Network monitoring and intrusion detection systems • But, it’s difficult to achieve • Many problems in existing approaches

5 1 3 1 2 2 3 1 4 5 Background • Aggregate functions • MIN, MAX, AGG, COUNT, …, etc. • In-Network hierarchical processing • Reduce overhead • Query propagation • Tree construction • Aggregates calculation • Addressing fault-tolerance • Multi-root • Multi-tree • Reliable transmission

Problems in Existing Approaches • Few approach is designed to handle data loss and corruption. • Simple algorithm for data loss recovery • Fragile for large process groups • Need all relevant nodes for participation • Difficult to trade accuracy for communication overhead • Good applications need this tradeoff • Only need approximation • But, minimize resource consumption

Our Approach • Probabilistic data aggregation: a scalable and robust approach • Model loss on links and failures on nodes • Apply statistical learning theory (SLT) into aggregation • Develop protocol that handles loss and failures as essential part of normal operations • Self-repairing algorithm for aggregation tree maintenance • Nodes participate in aggregation and communication according to statistical sampling algorithm • In the absence of data, estimate value using statistical learning algorithm

Aggregator Distribution Estimator Data Predictor Sampler Tree Constructor Design & System Architecture • Building blocks • Spanning tree with fault-detection and self-repairing algorithm for tree construction and maintenance • Statistical sampling for low-overhead and scalability without much loss of accuracy • Distribution estimation to provide information for work load analysis, data prediction and outlier detection • Data prediction to compensate the data loss in sampling, as well as the uncontrolled loss on links

Statistical Sampling • A simple approach: sampling on the agg. tree • Every child node report the aggregation result of its subtree to its parent with certain probability, which is the design parameter of the algorithm • Low overhead of in control traffic and easy for implementation. • Might result in high data loss close to the root • Distribution of sampling rate on the tree • Uniform distribution on each level • Linear distribution on each level • Proportional to the number of nodes on its subtree • Value-based sampling

Prediction Algorithm • Naive algorithm: use value in previous epoch as current one. • Linear Prediction: linear algorithm with Minimum Mean Square Estimation (MMSE) Where: • More sophisticate algorithm like Kalman Filter can be used to achieve better prediction results.

The Protocol • Tree construction and query propagation start from root of the query • Aggregates are computed in each epoch from bottom up • When a node receives data from a child, it updates the distribution statistics based on the distribution estimator. • If a node receives data from all its children in the epoch, it does a normal data aggregation. • If a node doesn't receive data from a child at the end of epoch, it does a data prediction to estimate a value, and then performs the aggregation. • Aggregates are report from children to parents with certain probability. • If necessary, a node might performance outlier detection on the data from a child. However • It is very danger to discard a data • Assume neighbor nodes has physical locality, a parent can use both temporal and spatial statistics to do the outlier detection.

Experimental Results

Future Work • Integrated optimization by combining tree construction with statistical learning theory • Sampling on graph before tree construction • Non-linear estimation algorithm for data prediction • Evaluation of outlier detector in data aggregation • System implementation • System deployment and evaluation in real environment

Probabilistic Data Aggregation