Measuring Two-Event Structural Correlations on Graphs

Measuring Two-Event Structural Correlations on Graphs Ziyu Guan, Nan Li, Xifeng Yan Department of Computer Science UC Santa Barbara

Outline • Motivations • Measuring Two Event Structural Correlation (TESC) • Efficient Computation • Experiments • Discussions and Future work

Intrusion Attraction Ping Sweep SMB Service Sweep

Product Sales • How is the relationship between the sales of two products in a social network? Attraction Repulsion

A New Notion of Correlation • Two-Event Structural Correlation (TESC) • Defined on graph structures • Capture relationships between distributions of two events on a graph • Events can be different things in different contexts: • Topics or products (social networks) • Virus (contact networks) • Intrusion alerts (computer networks)

It Is A Nontrivial Problem • Simply computing average distance between occurrences of two events will not work • Distance for positive could be longer than that for negative • gScore cannot be adapted[Z. Guan et al., SIGMOD2011] • Significance cannot be assessed by randomization!

How To Measure? • Positive correlation: the presence of event A tend to imply the presence of event B. More A also tend to attract more B. • Negative correlation: the presence of one event is likely to imply the absence of the other one. More A means less B. • Our idea: employ reference nodes in the graph as observers to capture these characteristics quantitatively. Avoid randomization for significance testing.

Preliminaries • A graph G = (V, E)and an event set Q ={qi}. Given two eventsaandbinQ, Va and Vbare sets of nodes having a and b, respectively. • Def. (Node h-hop neighborhood): given a node, subgraph induced by nodes within distance hfrom that node • Def. (Node Set h-hop neighborhood): given a node set, subgraph induced by the union of all nodes which are within distance h from at least one node in the set.

Measuring Concordance • Concordance score • Density function If the density changes are consistent If the density changes are inconsistent Tie Fraction of nodes possessing event a in r’s h-hop neighborhood

Kendall’s Tau as The Measure Density of a Density of b • Kendall’s Tau rank correlation is used to compute the overall concordance among reference nodes with regard to density changes of the two events: • : the number of all reference nodes • lies in [-1,1]. A higher positive value means a stronger positive correlation. A lower negative value means a stronger negative correlation. 0 means no correlation.

Significance Testing • Impractical to compute directly • Testing: choose uniformly a sample of n reference nodes, and compute score over this sample • It is proved the distribution of under null hypothesis tends to the normal distribution with mean 0 and variance related to n • Thus, correlation significance (z-score) is

Reference Nodes • The reasons of choosing to be the set of all reference nodes: • Nodes outside cannot reach any event nodes in h hops • Incorporating them can only increase the number of consistent pairs, and increase the size of ties (decrease variance in the null case), leading to unexpected high z-scores: Out-of-sight-nodes

Efficient Computation • The key problem in efficient computation is how to get a uniform sample of reference nodes from , but only have . • We explore three algorithms for reference node sampling • BFS, importance sampling, whole graph sampling

Batch_BFS • Batch_BFS is just like a h-hop Breadth-first search, but with the queue initialized with a set of nodes. • Initialize the queue with all event nodes ( ) to enumerate all reference nodes ( ) Correctness can be easily verified by imagining we start with a virtual node which connects to all nodes in and then do a (h+1) BFS. Queue:

Importance Sampling (1) • Sample size n is usually much smaller than . The idea is to directly sample nodes from , avoid enumerating . Time cost depends on n, rather than • The basic operation is peeking the h-hop neighborhood of an event node Difficulties: • different nodes have different sizes of h-hop neighborhoods • there could be many overlapped regions

Importance Sampling (2) • Uniform sampling by rejection sampling Step 1: select an event node u with probability proportional to the size of its h-hop neighborhood Step 2: perform a h-hop BFS search to retrieve u’s h-hop neighborhood Step 3: randomly sample a node rfrom u’s h-hop neighborhood Step 4: Do a h-hop BFS search from r to see how many event nodes it can reach (say, c event nodes). Step 5: With probability 1 / c, accept r as a reference node. Otherwise get nothing from this run. r u v w Problem: heavy overlap leads to high fail probability!

Importance Sampling (3) • Follow the same sampling scheme, but do not reject any node, resulting in a nonuniform distribution over all reference nodes where is proportional to the number of event nodes r can reach in h-hop • Intrinsically, is an estimator of . The goal is to design a proper estimator for , which can leverage samples from to estimate as a surrogate to • A consistent estimator Concordance scores Number of times rj is sampled

Importance Sampling (4) • The importance sampling procedure Step 1: select an event node u with probability proportional to the size of its h-hop neighborhood Step 2: perform a h-hop BFS search to retrieve u’s h-hop neighborhood Step 3: randomly sample a node rfrom u’s h-hop neighborhood Step 4: if rhas been selected before, wr++; else add r to the sample set and set wr = 1.

Whole Graph Sampling • When the set of all reference nodes, i.e. , is large enough, we simply sample nodes from the graph

Complexity Comparison • Space cost is the same: • Reference node sampling • Batch_BFS: • Importance sampling: • Whole graph sampling: • Additional costs in common • Event density computation: • Z-score computation: Linear in the number of nodes and edges in the h-hop neighborhood of Inverse proportional to the size of the h-hop neighborhood of Average cost of a h-hop BFS search Do not need too many sample reference nodes since the variance of t(a,b) is upper bounded

Experiments – Datasets • DBLP • Co-author network • Events: keywords in paper titles • 964,677nodes, 3,547,014 edges, 0.19M keywords • Intrusion • Obtained from log of intrusion alerts in a computer network • Events: intrusion alerts • 200,858 nodes, 703,020 edges, 545 alerts • Twitter • 20 million nodes and 0.16 billion edges (Scalability)

Experiments – Event Simulation (1) • Simulate positive and negative correlations (on DBLP graph) • Generate for three h levels: 1, 2, 3 • Positive: linked pair, Gaussian distributed distance • Negative: Every b is kept h+1 hop away from a. • Noises: break correlation structure by relocation a fraction of nodes

Experiments – Event Simulation (2) • Results for positive case h = 1 h = 2 h = 3 Recall Noise Noise Noise

Experiments – Real Events (DBLP) Treating nodes as baskets Highly positive pairs: Highly negative pairs:

Experiments – Real Events (Intrusion) Highly positive pairs: Highly negative pairs:

Experiments – Scalability h = 1 h = 3 Running time when increasing the number of event nodes ( ). Results are obtained from Twittter graph.

Discussions (based on constructive comments from Dr. Kaplan) • TESC as correlation of local densities • Why nonparametric statistic • No distribution assumption, no linear assumption • Nonparametric statistics are less powerful because they use less information • Model nonlinear correlation of data [Kaplan et al., JSTSP, 2009] • Kendall correlation and Spearman correlation • Both can be used • Choose Kendall’s Tau because • Intuitive interpretation • Facilitate importance sampling • Intra-correlation and inter-correlation? Rank

Future Work • Structure help explain the distribution of events. Reversely, events could also help explain structure Discuss very similar topics Buy very similar products

Thank You!!! Questions?

Measuring Two-Event Structural Correlations on Graphs

Measuring Two-Event Structural Correlations on Graphs

Presentation Transcript

Predictive Hepatotoxicity: Correlations in a Structural Database

Two Approaches to Measuring “ Green ”

Structural Immunoinformatics – two case studies

Data Structures on Event Graphs

Structural Search: Two Examples

Comparing Two Graphs

Two particle correlations at ALICE

HBT two-pion correlations at LHC

HBT two-pion correlations at LHC

Two- and three-particle Bose-Einstein correlations

Two pion correlations at SPS energies

Measuring Network Security Using Attack Graphs

Motion Graphs Part Two

WSIS Parallel Event Partnership on Measuring ICT for Development

Event-by-Event Two-Pion Correlations in Smoothed Particle Hydrodynamics

First Results on Two-Particle Correlations Determined by PHENIX

Two-Pion Correlations with @ RHIC

Two and Three particle Flavor Dependent Correlations

Two and Three particle Flavor Dependent Correlations

A Study on Measuring Distance between Two Trees

2008 Global Event on Measuring the Information Society

I. Fluctuations from nonequilibrium two-body correlations