330 likes | 441 Views
Measuring Two-Event Structural Correlations on Graphs. Ziyu Guan, Nan Li, Xifeng Yan Department of Computer Science UC Santa Barbara. Outline. Motivations Measuring Two Event Structural Correlation (TESC) Efficient Computation Experiments Discussions and Future work. Intrusion.
E N D
Measuring Two-Event Structural Correlations on Graphs Ziyu Guan, Nan Li, Xifeng Yan Department of Computer Science UC Santa Barbara
Outline • Motivations • Measuring Two Event Structural Correlation (TESC) • Efficient Computation • Experiments • Discussions and Future work
Intrusion Attraction Ping Sweep SMB Service Sweep
Product Sales • How is the relationship between the sales of two products in a social network? Attraction Repulsion
A New Notion of Correlation • Two-Event Structural Correlation (TESC) • Defined on graph structures • Capture relationships between distributions of two events on a graph • Events can be different things in different contexts: • Topics or products (social networks) • Virus (contact networks) • Intrusion alerts (computer networks)
It Is A Nontrivial Problem • Simply computing average distance between occurrences of two events will not work • Distance for positive could be longer than that for negative • gScore cannot be adapted[Z. Guan et al., SIGMOD2011] • Significance cannot be assessed by randomization!
Outline • Motivations • Measuring Two Event Structural Correlation (TESC) • Efficient Computation • Experiments • Discussions and Future work
How To Measure? • Positive correlation: the presence of event A tend to imply the presence of event B. More A also tend to attract more B. • Negative correlation: the presence of one event is likely to imply the absence of the other one. More A means less B. • Our idea: employ reference nodes in the graph as observers to capture these characteristics quantitatively. Avoid randomization for significance testing.
Preliminaries • A graph G = (V, E)and an event set Q ={qi}. Given two eventsaandbinQ, Va and Vbare sets of nodes having a and b, respectively. • Def. (Node h-hop neighborhood): given a node, subgraph induced by nodes within distance hfrom that node • Def. (Node Set h-hop neighborhood): given a node set, subgraph induced by the union of all nodes which are within distance h from at least one node in the set.
Measuring Concordance • Concordance score • Density function If the density changes are consistent If the density changes are inconsistent Tie Fraction of nodes possessing event a in r’s h-hop neighborhood
Kendall’s Tau as The Measure Density of a Density of b • Kendall’s Tau rank correlation is used to compute the overall concordance among reference nodes with regard to density changes of the two events: • : the number of all reference nodes • lies in [-1,1]. A higher positive value means a stronger positive correlation. A lower negative value means a stronger negative correlation. 0 means no correlation.
Significance Testing • Impractical to compute directly • Testing: choose uniformly a sample of n reference nodes, and compute score over this sample • It is proved the distribution of under null hypothesis tends to the normal distribution with mean 0 and variance related to n • Thus, correlation significance (z-score) is
Reference Nodes • The reasons of choosing to be the set of all reference nodes: • Nodes outside cannot reach any event nodes in h hops • Incorporating them can only increase the number of consistent pairs, and increase the size of ties (decrease variance in the null case), leading to unexpected high z-scores: Out-of-sight-nodes
Outline • Motivations • Measuring Two Event Structural Correlation (TESC) • Efficient Computation • Experiments • Discussions and Future work
Efficient Computation • The key problem in efficient computation is how to get a uniform sample of reference nodes from , but only have . • We explore three algorithms for reference node sampling • BFS, importance sampling, whole graph sampling
Batch_BFS • Batch_BFS is just like a h-hop Breadth-first search, but with the queue initialized with a set of nodes. • Initialize the queue with all event nodes ( ) to enumerate all reference nodes ( ) Correctness can be easily verified by imagining we start with a virtual node which connects to all nodes in and then do a (h+1) BFS. Queue:
Importance Sampling (1) • Sample size n is usually much smaller than . The idea is to directly sample nodes from , avoid enumerating . Time cost depends on n, rather than • The basic operation is peeking the h-hop neighborhood of an event node Difficulties: • different nodes have different sizes of h-hop neighborhoods • there could be many overlapped regions
Importance Sampling (2) • Uniform sampling by rejection sampling Step 1: select an event node u with probability proportional to the size of its h-hop neighborhood Step 2: perform a h-hop BFS search to retrieve u’s h-hop neighborhood Step 3: randomly sample a node rfrom u’s h-hop neighborhood Step 4: Do a h-hop BFS search from r to see how many event nodes it can reach (say, c event nodes). Step 5: With probability 1 / c, accept r as a reference node. Otherwise get nothing from this run. r u v w Problem: heavy overlap leads to high fail probability!
Importance Sampling (3) • Follow the same sampling scheme, but do not reject any node, resulting in a nonuniform distribution over all reference nodes where is proportional to the number of event nodes r can reach in h-hop • Intrinsically, is an estimator of . The goal is to design a proper estimator for , which can leverage samples from to estimate as a surrogate to • A consistent estimator Concordance scores Number of times rj is sampled
Importance Sampling (4) • The importance sampling procedure Step 1: select an event node u with probability proportional to the size of its h-hop neighborhood Step 2: perform a h-hop BFS search to retrieve u’s h-hop neighborhood Step 3: randomly sample a node rfrom u’s h-hop neighborhood Step 4: if rhas been selected before, wr++; else add r to the sample set and set wr = 1.
Whole Graph Sampling • When the set of all reference nodes, i.e. , is large enough, we simply sample nodes from the graph
Complexity Comparison • Space cost is the same: • Reference node sampling • Batch_BFS: • Importance sampling: • Whole graph sampling: • Additional costs in common • Event density computation: • Z-score computation: Linear in the number of nodes and edges in the h-hop neighborhood of Inverse proportional to the size of the h-hop neighborhood of Average cost of a h-hop BFS search Do not need too many sample reference nodes since the variance of t(a,b) is upper bounded
Outline • Motivations • Measuring Two Event Structural Correlation (TESC) • Efficient Computation • Experiments • Discussions and Future work
Experiments – Datasets • DBLP • Co-author network • Events: keywords in paper titles • 964,677nodes, 3,547,014 edges, 0.19M keywords • Intrusion • Obtained from log of intrusion alerts in a computer network • Events: intrusion alerts • 200,858 nodes, 703,020 edges, 545 alerts • Twitter • 20 million nodes and 0.16 billion edges (Scalability)
Experiments – Event Simulation (1) • Simulate positive and negative correlations (on DBLP graph) • Generate for three h levels: 1, 2, 3 • Positive: linked pair, Gaussian distributed distance • Negative: Every b is kept h+1 hop away from a. • Noises: break correlation structure by relocation a fraction of nodes
Experiments – Event Simulation (2) • Results for positive case h = 1 h = 2 h = 3 Recall Noise Noise Noise
Experiments – Real Events (DBLP) Treating nodes as baskets Highly positive pairs: Highly negative pairs:
Experiments – Real Events (Intrusion) Highly positive pairs: Highly negative pairs:
Experiments – Scalability h = 1 h = 3 Running time when increasing the number of event nodes ( ). Results are obtained from Twittter graph.
Outline • Motivations • Measuring Two Event Structural Correlation (TESC) • Efficient Computation • Experiments • Discussions and Future work
Discussions (based on constructive comments from Dr. Kaplan) • TESC as correlation of local densities • Why nonparametric statistic • No distribution assumption, no linear assumption • Nonparametric statistics are less powerful because they use less information • Model nonlinear correlation of data [Kaplan et al., JSTSP, 2009] • Kendall correlation and Spearman correlation • Both can be used • Choose Kendall’s Tau because • Intuitive interpretation • Facilitate importance sampling • Intra-correlation and inter-correlation? Rank
Future Work • Structure help explain the distribution of events. Reversely, events could also help explain structure Discuss very similar topics Buy very similar products
Thank You!!! Questions?