10 likes | 108 Views
= ?. = !. =. +. +. +. t = 1. t = 2. t = 3. t = 4. Summary graph. Day 220:. 0.3. 0.6. 0.1. 0.5. 0.7. 0.5. 0.2. 0.9. 0.4. Day 250:. 0.8. 0.7. 0.3. 0.1. 0.3. 0.2. Sorted by degree. Recency. MCD Analysis. x max. x min. Brian Thompson † bthom@cs.rutgers.edu
E N D
= ? = ! = + + + t = 1 t = 2 t = 3 t = 4 Summary graph Day 220: 0.3 0.6 0.1 0.5 0.7 0.5 0.2 0.9 0.4 Day 250: 0.8 0.7 0.3 0.1 0.3 0.2 Sorted by degree Recency MCD Analysis xmax xmin Brian Thompson† bthom@cs.rutgers.edu †Rutgers University Tina Eliassi-Rad†‡ eliassirad1@llnl.gov ‡Lawrence Livermore Lab 1 1 1 Introduction/Motivation Our Approach Experimental Results 4 1 1 2 1 1 • Consider the weighted graph Gt = (V,E) representing a communication network at time t, with w(e) = Rec(e,t) • For , let XE’,p= # of edges in E’ with w(e) ≤ p • We define the p-divergence of E’ as follows: • Experiments on 4 datasets: Enron email, LBNL IP traffic, Twitter messages, and Reality Mining Bluetooth proximity • Clear and intuitive visualization reveals anomalous activity in the Bluetooth dataset at two points in time • A communication network is a time-evolving graph that models interactions between entities over time • Pervasive in today’s world: phone calls, blog posts, email, social network messages, IP connections • Volatile: static network analysis tools not sufficient • Goal: Efficiently identify local or global changes in communication activity or graph structure over time , where X ~ Bin(|E’|,p) A Renewal Theory Approach to AnomalyDetection in Communication Networks • Let E’ be the set of thick edges • |E’| = 6 • XE’,0.3 = 4 • P(X ≥ 4) = 0.07 • Div0.3(E’) = 14.2 Model • The max-divergence of E’ is: • Intuitively, p-divergence of d means that the probability of at least XE’,pedges occurring p-recently is 1/d • A (maximal) p-component of G = (V,E) is a connected subgraph C = (V’,E’) such that (1) w(e) ≤ p for all e in E’ and (2) w(e) > p for all e not in E’ incident to V’ • The set of p-components partition V, for all p in [0,1] • The p-components of Gt for p = 0.3 are shown in blue • Communication across an edge is modeled as a sequence of time-stamped events, which yields a distribution of inter-arrival times (IATs) • A simple plot of MCD over time (left) identifies hand-labeled scanning activity in the LBNL dataset, as well as other anomalies overlooked by human analysts • The plot at right shows scalability using the Twitter dataset (263k nodes, 308k edges, 1.1 million timestamps) • IATs for human interaction frequently follow a power-law distribution • The Bounded Pareto allows us to model communication concisely, and make updates in real-time and constant space Algorithm • The MCD Algorithm: • Calculate edge weights using the Recency function • Gradually increase the edge threshold, updating components and divergence values as necessary • Output: Disjoint components with max divergence Conclusions • The recency function Rec : 2T x T → [0,1] assigns a weight to edge e at time t based on its age, i.e. the time since the last event, subject to the constraints: • Rec is uniquely determined by the constraints • The uniformity property eliminates time-scale bias • Traditional network analysis is inadequate for dealing with communication networks, which are dynamic and volatile • Studying the inter-arrival time distributions of edges is a novel approach for analyzing communication networks • Our algorithms are streaming, and run in O(m) space and O(m log m) time, where m is the # of edges in the dataset • MCD analysis can be easily visualized and used as a tool for monitoring activity in a variety of real-world domains • Rec(e,t) = 0 at the time an event occurs, 1 when age = xmax, and is increasing in between • Rec(e,t) is uniform over [0,1] when sampled uniformly in time This work performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344. IM Review and Release number<Insert number here>