410 likes | 427 Views
Explore outlier detection over large, dynamic networks, with epidemic-style algorithms. Learn how nodes converge on global properties and detect anomalies in wireless sensor networks. Discover the challenges and algorithm assumptions for effective outlier detection.
E N D
Data Mining Over a Large, Dynamic Network An Overview and Algorithm for Outlier Detection An Overview and Algorithm for Outlier Detection
Talk Outline • Introduction • Data mining over a large, dynamic network -- what and why? • Some recent research on “epidemic-style” algorithms • Example problem • Outlier detection over a large, dynamic network • An algorithm and experiments in a wireless sensor network • Wrap-up summary
Large, Dynamic Networks • Internet, peer-to-peer file sharing networks • Kazaa: 5M simultaneously on-line users (Jan 2004) • Wireless sensor networks • 800 node WSN built at Berkley (2001) • 10000+ node WSNs are expected in the near future
Large, Dynamic Networks • Large number of network nodes each with a local data set, Di • Dynamic environment: • nodes may enter/leave at will • nodes’ local data, Di, may change at will Distributed (time varying) dataset: D = UDi
Why Mine a Large, Dynamic Network? • Goal: learn global properties of the dataset, D. Does the fraction of nodes in the WSN reporting a problem exceed a threshold? Download patterns in an internet, P2P file-sharing network.
The Challenges • Large network size No global operators/synchronizations Be communication-efficient • Dynamic environment (changing local data, node joins, departures) Algorithms robust to data/network change
Epidemic-Style Algorithms Each node in the network • maintains an estimate of the desired global property (e.g. global average, majority) • selectively communicates with neighboring nodes in the network All nodes’ estimates eventually converge on the true global property.
Epidemic-Style Algorithms A good choice for data mining over large, dynamic networks: • highly decentralized • asynchronous • robust to data/network change
Epidemic-Style Algorithm Examples • Majority voting • A useful basic primitive • Facility location, decision tree induction • All using majority voting as a building block • Outlier detection This talk.
Outlier Detection Over a Large, Dynamic Network • Applications: • data cleaning in wireless sensor networks • detecting misconfigured machines grid networks But first, what do I mean by an “outlier”?
Overview of Outlier Detection on Centralized Data • Outlier – A point in a dataset which appears to be inconsistent with the remainder of the dataset. • Outlier detection is studied widely • Statistics • Machine Learning • Data Mining
Centralized Outlier Detection (cont.) • Outlier detection problem: • Given dataset D, score parameter k, and integer n, find the top n scored points in D.
Outlier Detection Over a Large, Dynamic Network • Problem: • Each node Ni has: a local data set Di (time varying), and parameters k and n. • Solve the unsupervised outlier detection problem over D = UDi. • At the end, each node has, O[D], the top n outliers over D.
Algorithm • Assumptions: • Reliable message delivery • Reliable immediate neighborhood knowledge • Choice of score function • Algorithm works for any scoring function satisfying • “anti-monotonicity” and “smoothness”
Nodes send data points to their immediate neighbors based on events: • Local data change (birth or death of points in Di) • Receipt of data points from immediate neighbors • Change in immediate neighborhood
Some Notation Given neighboring nodes N1 and N2, • Let P1→2 denote all the points sent by N1 to N2 • Let P2→1 denote all the points sent by N2 to N1 • Let P1 denote the set of points currently held by N1: D1 U P21
Node Event Response: Basic Idea • Each node maintains an estimate of O[D], the global outliers. • N1 maintains O[P1]; N2 maintains O[P2] • N1 sends any point to N2 that might change N2's estimate. • Call these necessary points from P1 How does N1 compute its necessary points?
N1 approximates N2’s estimate O[P2] as O[P12 U P21] The following are necessary points from P1 1. O[P1] and its k nearest neighbors 2. k nearest neighbors of O[P12 U P21] Z0 = union of 1. and 2. If Z0 were sent to N2, its estimate might change.
N1 would approximate N2’s new estimate as O[Z0 U P12 U P21] The following would be necessary points from P1 1. Z0 2. k-nn of O[Z0 U P12 U P21] Z1 = union of 1. and 2. If Z1 were sent to N2, its estimate might change.
N1 would approximate N2’s new estimate as O[Z1 U P12 U P21] Z2 = Z1 U k-nn of O[Z1 U P12 U P21] Etc…. Let Z* = Zi+1 = Zi for i as small as possible (fixed-point).
Z* is the set of necessary points from P1. N1 actually sends Z*/(P12 U P21) N1 knows N2 already has (P12 U P21)
Some Key Algorithm Features • Highly decentralized • Asynchronous • Robust to network/data change Each node’s estimate is guaranteed to converge to O[D] (the globally outliers).
Wrap-Up Summary • Data mining over a large, dynamic network • Motivation (wireless sensor networks, internet P2P information sharing networks) • Some recent research on epidemic-style algorithms • Outlier detection over a large, dynamic network • An algorithm and experiments in a wireless sensor network
THE END Thank you for your attention. Chris Giannella: cgiannel@acm.org Slides Available From: www.cs.nmsu.edu/~cgiannel/Mining_Dynamic_Network2008.ppt
Selected Research Contributions in Distributed Data Mining • DM over a large, dynamic network • K-means clustering (SDM 2006) • Outlier detection (ICDCS 2006) • Decision tree induction (Statistical Analysis and Data Mining 2008) • DDM on astronomy data (NASA funded) • Principal component analysis and outlier detection (SDM 2007) • Privacy preserving data mining • Distance-preserving data perturbation (PKDD 2006)
Mining a Large-Scale, Distributed Dataset The problem: • Each node Ni has: a data set Di (time varying). • Compute some function, F, of the global dataset D = U Di. • Each node learns F(D).
Two Main Centralization Problems • Communication Cost • Centralization creates a communication bottleneck • Privacy Loss • Data from one source may be too sensitive to reveal to others
DDM on Sensitive Data • Privacy Preserving Data Mining (PPDM) • Example: multiple banks • Detect fraudulent transactions based on all the companies' data → Secure Multi-party Computation (cryptography) • Example: census data • Release the data to other organizations for analysis → Data transformation
DDM on Non-Sensitive Data • Problem: data centralization creates a communication bottleneck • Static environment, data and network do not change • Parallel computing • Dynamic environment • Remainder of the talk
Some Common PPDM Techniques • Secure multi-party computation (cryptography) • Build complicated distributed algorithms from simple, secure primitives (e.g. secure sum) • Data transformation • Release a transformed dataset
DDM Introduction Summary • Basic DDM framework • DDM analysis of sensitive data (Privacy Preserving Data Mining) • Data transformation vs. cryptographic approaches • Analysis of non-sensitive data • Static environment (parallel computing) • Dynamic environment (remainder of the talk)
Centralized Outlier Detection (cont.) • Define a function, S, such that for any set A of data points in Rn, and a data point x in Rn, • S(A,x) defines the outlier score of x with respect to A • The larger the score, the more x is deemed to be inconsistent with A. • Example: S(A,x) = Average Euclidean distance of x to its k nearest neighbours in A. (k a fixed constant)
Epidemic-Style Algorithms ??????????Each node in the network • maintains an estimate of the desired global property (e.g. global average, majority) • selectively communicates with immediately neighboring nodes upon detecting an event • local data/network change, message receipt. All nodes’ estimates eventually converge on the true global property
Basic Idea • O[P1] is N1's current estimate of O[D]. • N1 sends any point to N2 that might change N2's outlier estimate, O[P2].
N1 should send its outlier estimate and their K nearest neighbours • B0 = O[P1] U KNN(O[P1]) • Assuming B0 is sent • N1 should also send its K nearest neighbours to N2’s outlier estimate, O[P2] • N1 guesses O[P2] to be O[B0 U P1→2 U P2→1] • N1 should send • B1 = B0 U KNN(O[B0 U P1→2 U P2→1])
Assuming B1 is sent, N1 should send • B2 = B0 U KNN(O[B1 U P1→2 U P2→1]) • Assuming B2 is sent, N1 should send • B3 = B0 U KNN(O[B2 U P1→2 U P2→1]) • Etc…. • N1 sendsa fixed-point B*: • B* = B0 U KNN(O[B* U P1→2 U P2→1])
Selected Research Contributions • Privacy preserving data mining • Distance preserving data transformation (PKDD 2006) • Outlier detection using secure multi-party computation (in preparation) • DDM in dynamic environments • K-means clustering (SDM 2006) • Outlier detection (ICDCS 2006) ===> Remainder of the talk
Theorem • If data and network are fixed, then • The algorithm terminates • i.e. All nodes stop sending messages • Upon termination, O[Pi] = O[D] for every node Ni • i.e. The outlier estimate for every node equals the true outliers. • The algorithm is guaranteed to converge on the correct result provided enough time elapses between changes
References • “K-Means Clustering Over a Large, Dynamic Network”, Siam Conf. Data Mining (SDM) 2006 • “In-Network Outlier Detection in Wireless Sensor Networks”, Int. Conf. Distributed Computing Systems (ICDCS) 2006 • “Distributed Decision Tree Induction in Peer-to-Peer Systems”, Statistical Analysis and Data Mining (to appear) • “Distributed Top-K Outlier Detection from Astronomy Catalogs using the DEMAC System”, Siam Conf. Data Mining (SDM) 2007 • “An Attacker's View of Distance Preserving Maps for Privacy Preserving Data Mining”, European Conf. Principals and Practice of Data Mining (PKDD) 2006