400 likes | 520 Views
Streaming Models and Algorithms for Communication and Information Networks. Brian Thompson (joint work with James Abello ). Outline. Introduction and Motivation. A Streaming Model. Our Approach. Algorithms. Experimental Results. Conclusions and Future Work.
E N D
Streaming Models and Algorithms for Communication and Information Networks Brian Thompson (joint work with James Abello)
Outline • Introduction and Motivation • A Streaming Model • Our Approach • Algorithms • Experimental Results • Conclusions and Future Work Streaming Models and Algorithms for Communication and Information Networks
Outline • Introduction and Motivation • A Streaming Model • Our Approach • Algorithms • Experimental Results • Conclusions and Future Work Streaming Models and Algorithms for Communication and Information Networks
Problem Description • Data: A network (G;T) • G = (V,E) is a graph • T is a set of time-stamped events corresponding to nodes or edges in G • Goals: • Identify recent correlated activity • Measure influence between entities • Challenges: • Scalability – networks may be very large, limited space • Efficiency – high data rate, time-sensitive information • Variability – entities have different temporal dynamics Streaming Models and Algorithms for Communication and Information Networks
Related Work • Time-evolving graph model - sequence of “snapshots” • Time series analysis t = 1 t = 2 t = 3 t = 4 Streaming Models and Algorithms for Communication and Information Networks
Related Work • Cascade model – set of seed nodes, information (product, news, virus)propagates through network Streaming Models and Algorithms for Communication and Information Networks
Outline • Introduction and Motivation • A Streaming Model • Our Approach • Algorithms • Experimental Results • Conclusions and Future Work Streaming Models and Algorithms for Communication and Information Networks
Data Model • G is a graph • T is a set of time-stamped events corresponding to nodes or edges in G Alice Devika Bob Cheng Elina Streaming Models and Algorithms for Communication and Information Networks
Data Model (Node-centric) Devika Alice Bob Cheng Elina Streaming Models and Algorithms for Communication and Information Networks
Data Model (Edge-centric) Devika Alice Bob Cheng Elina Streaming Models and Algorithms for Communication and Information Networks
Renewal Theory • A renewal process is a continuous-time Markov process where state transitions occur with holding times sampled independently from a positive distribution . • Let be samples from , and consider a sequence of events corresponding to those holding times. • We call inter-arrival times, and refer to the sequence as the discrete-event sequence for . 0 t1 t2 t3 t4 t5 S3 : Streaming Models and Algorithms for Communication and Information Networks
Renewal Theory • The ageof a renewal process at time is the amount of time elapsed since the last event: t 0 t1 t2 t3 t4 t5 : Streaming Models and Algorithms for Communication and Information Networks
The REWARDS Model REneWal theory Approach for Real-time Data Streams • We model a stream of communication data from a node or across an edge as a renewal process Discrete-event sequence: Inter-Arrival Time Distribution xmax xmin t1 t2 t3 t4 t5 Streaming Models and Algorithms for Communication and Information Networks
The REWARDS Model REneWal theory Approach for Real-time Data Streams • Given a stream of time-stamped events, we estimate the parameters of the renewal process for each nodeor edge based on the inter-arrival times Discrete-event sequence: Inter-Arrival Time Distribution xmax xmin t1 t2 t3 t4 t5 Streaming Models and Algorithms for Communication and Information Networks
Outline • Introduction and Motivation • A Streaming Model • Our Approach • Algorithms • Experimental Results • Conclusions and Future Work Streaming Models and Algorithms for Communication and Information Networks
8:00 am 10:00 am 12:00 pm NOW! Recency • Goal: highlight recent activity • Key idea: more recent = more relevant • Challenge: The most frequent communicators will always seem “recent”, overshadowing others’ behavior. User: alice1337 User: bob_iz_kewl We call this time-scale bias. Streaming Models and Algorithms for Communication and Information Networks
Recency • We can overcome time-scale bias by using the REWARDS Model • We first derive the limit distribution of the function: • We define the recencyof at time to be: Streaming Models and Algorithms for Communication and Information Networks
Recency • is a decreasing function on every interval . It also satisfies the uniformity property: for any renewal process , the limit distribution of is Uniform(0,1). • Recency effectively normalizes the age of a process relative to its own temporal dynamics, making our approach robust to differences in time scale between networks or between entities within the same network. Recency of Edge <3,22> in Bluetooth Dataset Streaming Models and Algorithms for Communication and Information Networks
8:00 am 10:00 am 12:00 pm NOW! Delay • Goal: measure influence of entity A on entity B • Key idea: study pairwise (A,B)-gaps • Challenge: More frequent communicators will tend to always have shorter “gaps”. User: alice1337 User: bob_iz_kewl Another example of time-scale bias. Streaming Models and Algorithms for Communication and Information Networks
Delay • Given renewal processes and , we say the ordered pair of events are adjacent if and . We refer to the elapsed time as the pairwise gap. We denote by the most recent such gap at time . • If and are independent processes, then we can derive the limit distribution of pairwise gaps between consecutive event pairs. • We define the -delay at time to be: Streaming Models and Algorithms for Communication and Information Networks
Delay • is a constant function on every interval , and also satisfies the uniformity property: for any pair of independent renewal process and , the limit distribution of is Uniform(0,1). • By comparing an observed gap to the theoretical joint distribution of inter-arrival times for and , delay effectively normalizes the gap relative to the temporal dynamics of and individually. • Similarly to the recency function, this makes our approach robust to differences in time scale between networks or between entities within the same network. Streaming Models and Algorithms for Communication and Information Networks
Outline • Introduction and Motivation • A Streaming Model • Our Approach • Algorithms • Experimental Results • Conclusions and Future Work Streaming Models and Algorithms for Communication and Information Networks
Divergence Compares empirical EDF Fn(x)to hypothetical CDF F(x) KS = 0.32 • Recency divergence compares recency values for a set of nodes or edges to the CDF for Uniform(0,1) • Delay divergence compares delay values for a set of edges, or for all (A,B)-gaps, to the CDF for Uniform(0,1) Based on the Kolmogorov-Smirnov statistic: Streaming Models and Algorithms for Communication and Information Networks
Streaming Node-Centric Algorithm • Goal: Flag times at which a node exhibits anomalous activity (indicated by an unusually high concentration of recent outgoing communication) • Approach: Since the recency function is decreasing between consecutive communication, measure the recency divergence at a node only at times at which new activity occurs Streaming Models and Algorithms for Communication and Information Networks
The MCD Algorithm Maximal Component Divergence Algorithm • Goal: Identify subgraphs with correlated behavior • Recency divergence to find recent anomalous activity • Delay divergence to identify spheres of influence Challenge: How do we overcome the combinatorial explosion? Streaming Models and Algorithms for Communication and Information Networks
The MCD Algorithm Maximal Component Divergence Algorithm Calculate edge weights using recency or delay function Gradually decrease the threshold, updating components and divergence values as necessary Output: Disjoint components with max divergence 0.9 V2 2.4 V1 0.75 0.7 2.7 6.1 V3 0.1 V3 1.1 2.9 2.9 V5 0.3 V4 0.5 V5 V2 V4 V1 Streaming Models and Algorithms for Communication and Information Networks
Sample Output Streaming Models and Algorithms for Communication and Information Networks
Outline • Introduction and Motivation • A Streaming Model • Our Approach • Algorithms • Experimental Results • Conclusions and Future Work Streaming Models and Algorithms for Communication and Information Networks
Robustness to Time Scale • Simulation: R-MAT model, 128 vertices, avg. degree 16 • IATs for edge activity sampled from Bounded Pareto distributions, rate parameter btwn 10 mins. and 1 week • Every 5 days, a randomly selected node has anomalous activity at 10x its normal rate Streaming Models and Algorithms for Communication and Information Networks
Robustness to Time Scale Streaming Models and Algorithms for Communication and Information Networks
Robustness to Time Scale • Conclusion: While it takes longer for anomalous activity to be recognized at nodes with lower rates, the magnitudeof the peak seems to be independent of activity rate but highly correlated with degree Streaming Models and Algorithms for Communication and Information Networks
Accuracy and Precision • Simulation: star network, 100 trials w/ only normal activity and 100 trials including a period of anomalous activity • ROC curves show accuracy and precision for several methods for distinguishing between the two scenarios • Conclusion: Especially when variability is introduced, our approach out-performs the WtdDeg and Z-Score metrics Streaming Models and Algorithms for Communication and Information Networks
Detection Latency • Data: Enron corpus, 1k nodes, 2k edges, 4k timestamps • Compare our approach with GraphScope Algorithm • Conclusion: The two algorithms seem to identify similar times of anomalous activity, but our approach based on the REWARDS model has shorter response time Streaming Models and Algorithms for Communication and Information Networks
Anomaly Detection in IP Traffic • Data: LBNL network trace, > 9 million timestamps during one hour on December 15, 2004 • Compare our approach with total network volume and with “scanning activity” labeled by LBNL analysts Streaming Models and Algorithms for Communication and Information Networks
Anomaly Detection in IP Traffic • Three of the four times of highest correspond to labeled scanning activity • The peak in scanning activity at 12:07pm is primarily due to an increase in DNS and NBNS lookups • The peak at 12:26pm was not flagged by the analysts since the sequence of IP addresses was not monotonic Streaming Models and Algorithms for Communication and Information Networks
Complexity Analysis • Dataset: Twitter messages, Nov. 2008 – Oct. 2009 (263k nodes, 308k edges, 1.1 million timestamps) • Updates O(1) per communication • MCD Algorithm O(m log m), where m = # of edges; can be approximated in effectively O(m) time Streaming Models and Algorithms for Communication and Information Networks
Outline • Introduction and Motivation • A Streaming Model • Our Approach • Algorithms • Experimental Results • Conclusions and Future Work Streaming Models and Algorithms for Communication and Information Networks
Future Work • Incorporate duration of communication and other node or edge attributes into our model • Make use of geographical and textual content • Use gap divergence to infer links, compare to approach of Gomez-Rodriguez et. al. • Develop streaming algorithm to identify emerging trends Streaming Models and Algorithms for Communication and Information Networks
Acknowledgements Part of this work was conducted at Lawrence Livermore National Laboratory, under the guidance of Tina Eliassi-Rad. This project is partially supported by a DHS Career Development Grant, under the auspices of CCICADA, a DHS Center of Excellence. Streaming Models and Algorithms for Communication and Information Networks
Questions? Streaming Models and Algorithms for Communication and Information Networks