260 likes | 375 Views
Multisite Internet Data Analysis. Alfred O. Hero, Clyde Shih, David Barsic University of Michigan - Ann Arbor hero@eecs.umich.edu http://www.eecs.umich.edu/~hero. Network Data Collection Distributed Data Analysis Dimension Reduction Model-Based Data Analysis Conclusions.
E N D
Multisite Internet Data Analysis Alfred O. Hero, Clyde Shih, David Barsic University of Michigan - Ann Arbor hero@eecs.umich.edu http://www.eecs.umich.edu/~hero • Network Data Collection • Distributed Data Analysis • Dimension Reduction • Model-Based Data Analysis • Conclusions Research supported in part by: NSF CCR-0325571
1. Network Data Collection • Objectives • Global: monitoring centers aggregate statistics from sites distributed around network to detect, classify, or estimate global network state while ensuring information privacy constraints • Local: collection sites gather data relevant to local network state and share information as necessary to enhance local analysis. • Types of data measured • Active: queries and requests, packet probes • Passive: netflow, router fields, honeypots, backscatter
ISP 2 Local data collection and probing site ISP 1 Monitoring Center Data collection site ISP 3 : Data collector
Abilene Netflow Data No. Flows Avg. Duration Std. Duration Avg Packets Std. Packets Avg Bytes Std. Bytes Protocol Dataset 1 No. Flows Avg. Duration Std. Duration Avg Packets Std. Packets Avg Bytes Std. Bytes Dataset 2
Abilene Netflow Data No. Flows Avg. Duration Std. Duration Avg Packets Std. Packets Avg Bytes Std. Bytes Router Dataset 1 No. Flows Avg. Duration Std. Duration Avg Packets Std. Packets Avg Bytes Std. Bytes Dataset 2
Challenges and Approaches • Challenges • High dimensional measurement space • Non-linear dependencies and non-stationarity • Privacy and proprietary concerns • Insufficient bandwidth for cts sampled data • Approaches • Dimension reduction • Model-based distributed inference • Controlled information sharing • Hierarchical and modular collection/analysis
2. Distributed Data Analysis Site C Site A • Hypothesis: data collected at sites A,B,C follow a statistical distribution defined over a lower dimensional manifold. • Overall objective: Find distributed strategies to perform reliable statistical inference with minimum amount of data sharing Site B
2.1 Distributed Dimension Reduction Unknown Embedding Unknown Manifold Unknown Distribution Sampling Observed Sample
Geodesic Entropic GraphsA Planar Sample and its Euclidean MST
GMST Dimension Estimation GMST Estimates d=13 H=120(bits)_
Distributed GMST Estimator • Principal MST convergence result: • Distributed BHH (Aggregation rule): • Tight upper and lower bounds on limit: if exchange rooted dual graphs [Yukich:97] among sites BHH Theorem:
2.2 Distributed Model-based Inference • Global likelihood model • Global M-estimator recursion: • Global Fisher score function • Local Fisher score functions
Distributed M-estimator Compute Compute k=k+1 k=k+1 A B
Properties • Communication requirement is: • 2p bytes/update/site. • If data are independent attain stationary points of global likelihood • All local MLE’s are available to each site. • For multimodal likelihood, improvement on local MLE’s can be achieved by aggregation under mixture model.
Global Likelihood Function Global maximum Local maxima Local MLE’s x xx x xx xxxx x xx
Key Theoretical Result • The asymptotic distribution of local estimates is a Gaussian mixture dependent on global likelihood • Parameters Proof: asymptotic normal theory of local maxima (Huber:67):see Blatt&Hero:2003
Local Estimator Aggregation Algorithm Estimator 1 Estimator 2 Estimator N Estimation of Gaussian Mixture Parameters (FS,EM…) Sample Covariance Analysis Aggregation To Final Estimate
Simple Example IID Observation Model: • Each site observes 2 component Gaussian mixture • Identical component variances • Unknown mixing parameters • Unknown component means • 200 data collection sites • 100 samples/site • CEM2 algorithm implemented for estimation and aggregation Global maximum Local maximum Ambiguity function.
Clustering and Discrimination 3 2.5 2 1.5 1 0.5 0 0 0.5 1 1.5 2 2.5 3 Local maximum Inverse FIM Global maximum 2 m Empirically estimated covariances via CEM2 m 1
Validation of Key Result QQ for Cluster 1 QQ for Cluster 2
Conclusions • Lossless distributed dimension reduction and model-based inference requires: • Reliable local inference methods • Aggregation rules for combining local statistics • Information sharing constraints? • Effects of bandwidth constraints - data compression? • Tracking in dynamical models?
References • A. O. Hero, B. Ma, O. Michel and J. D. Gorman, “Application of entropic graphs,” IEEE Signal Processing Magazine, Sept 2002. • J. Costa and A. O. Hero, “Manifold learning with geodesic minimal spanning trees,” accepted in IEEE T-SP (Special Issue on Machine Learning), 2004. • D. Blatt and A. Hero, "Asymptotic distribution of log-likelihood maximization based algorithms and applications," in Energy Minimization Methods in Computer Vision and Pattern Recognition (EMM-CVPR), Eds. M. Figueiredo, R. Rangagaran, J. Zerubia, Springer-Verlag, 2003 • M.F. Shih and A. O. Hero, "Unicast-based inference of network link delay distributions using mixed finite mixture models," IEEE T-SP, vol. 51, No. 9, pp. 2219-2228, Aug. 2003 • N. Patwari, A. O. Hero, and Brian Sadler, "Hierarchical censoring sensors for Change Detection,” Proc. Of SSP, St. Louis, Sept. 2003.
Addition of other Discriminants Value-added due to transmission of likelihood values