360 likes | 483 Views
Mining Graph Patterns Efficiently via Randomized Summaries. Chen Chen 1 , Cindy X. Lin 1 , Matt Fredrikson 2 , Mihai Christodorescu 3 , Xifeng Yan 4 , Jiawei Han 1 1 University of Illinois at Urbana-Champaign 2 University of Wisconsin at Madison 3 IBM T. J. Watson Research Center
E N D
Mining Graph Patterns Efficiently via Randomized Summaries Chen Chen1, Cindy X. Lin1, Matt Fredrikson2, Mihai Christodorescu3, Xifeng Yan4, Jiawei Han1 1University of Illinois at Urbana-Champaign 2University of Wisconsin at Madison 3IBM T. J. Watson Research Center 4University of California at Santa Barbara
Outline • Motivation • The efficiency bottleneck encountered in big networks • Patterns must be preserved • Summarize-Mine • Experiments • Summary
query graph graph database Frequent Subgraph Mining • Find all graphs p such that |Dp| >= min_sup • Get into the topological structures of graph data • Useful for many downstream applications
Challenges • Subgraph isomorphism checking is inevitable for any frequent subgraph mining algorithm • This will have problems on big networks • Suppose there is only one triangle in the network • But there are 1,000,000 length-2 paths • We must enumerate all these 1,000,000, because any one of them has the potential to grow into a full triangle
Too Many Embeddings • Subgraph isomorphism is NP-hard • So, when the problem size increases, … • During the checking, large graphs are grown from small subparts • For small subparts, there might be too many (overlapped) embeddings in a big network • Such embedding enumerations will finally kill us
Motivating Application • System call graphs from security research • Model dependencies among system calls • Unique subgraph signatures for malicious programs • Compare malicious/benign programs • These graphs are very big • Thousands of nodes on average • We tried state-of-art mining technologies, but failed
Our Approach • Subgraph isomorphism checking cannot be done on large networks • So we do it on small graphs • Summarize-Mine • Summarize: Merge nodes by label and collapse corresponding edges • Mine: Now, state-of-art algorithms should work
Remedy for Pattern Changes • Frequent subgraphs are presented on a different abstraction level • False negatives & false positives, compared to true patterns mined from the un-summarized database D • False negatives (recover) • Randomized technique + multiple rounds • False positives (delete) • Verify against D • Substantial work can be transferred to the summaries
Outline • Motivation • Summarize-Mine • The algorithm flow-chart • Recovering false negatives • Verifying false positives • Experiments • Summary
False Negatives • For a pattern p, if each of its vertices bears a different label, then the embeddings of p must be preserved after summarization • Since we are merging groups of vertices by label, the nodes of p should stay in different groups • Otherwise,
Missing Prob. of Embeddings • Suppose • Assign xj nodes for label lj(j=1,…,L) in the summary Si => xjgroups of nodes with label lj in the original graph Gi • Pattern p has mj nodes with label lj • Then
No “Collision” for Same Labels • Consider a specific embedding f: p->Gi, f is preserved if vertices in f(p) stay in different groups • Randomly assign mj nodes with label lj to xjgroups, the probability that they will not “collide” is: • Multiply probabilities for independent events
Example • A pattern with 5 labels, each label => 2 vertices • m1 = m2 = m3 = m4 = m5 = 2 • Assign 20 nodes in the summary (i.e., 20 node groups in the original graph) for each label • The summary has 100 vertices • x1 = x2 = x3 = x4 = x5 = 20 • The probability that an embedding will persist
Extend to Multiple Graphs • Setting x1,…,xL to the same values across all Gi’s in the database • only depends on m1,…,mL, i.e., pattern p’s vertex label distribution • We denote this probability as q(p) • For each of p’s support graphs in D, it has a probability of at least q(p) to continue support p • Thus, the overall support can be bounded below by a binomial random variable
Example, Cont. • As above, q(p)=0.774 • min_sup=50
False Positives • Much easier to handle • Just check against the original database D • Discard if this “actual” support is less than min_sup
The Same Skeleton as gSpan • DFS code tree • Depth-first search • Minimum DFS code? • Check support by isomorphism tests • Record all one-edge extensions along the way • Pass down the projected database and recurse
Integrate Verification Schemes • Top-Down and Bottom-Up • Possible factors • Amount of false positives • Top-down verification can be performed early • Top-down preferred by experiments Transaction ID list for p1 => Dp1 Just search within D-Dp2; if frequent, can stop Just search within Dp1 Transaction ID list for p2 => Dp2
Summary-Guided Verification • Substantial verification work can be performed on the summaries, as well Got it!
Iterative Summarize-Mine • Use a single pattern tree to hold all results spanning across multiple iterations • No need to combine pattern sets in a final step • Avoid verifying patterns that have already been checked by previous iterations • Verified support graphs are accurate, they can help pre-pruning in later iterations • Details omitted
Outline • Motivation • Summarize-Mine • Experiments • Summary
Dataset • Real data • W32.Stration, a family of mass-mailing worms • W32.Virut, W32.Delf, W32.Ldpinch, W32.Poisonivy, etc. • Vertex # up to 20,000 and edge # even higher • Avg. # of vertices: 1,300 • Synthetic data • Size, # of distinct node/edge labels, etc. • Generator details omitted
A Sample Malware Signature • Mined from W32.Stration • A malware reading and leaking certain registry settings related to the network devices
Comparison with gSpan • gSpan is an efficient graph pattern mining algorithm • Graphs with different size are randomly drawn • Eventually, gSpan cannot work
The Influence of min_sup' • Total vs. False Positives • The gap corresponds to true patterns • It gradually widens as we decrease min_sup'
Summarization Ratio • 10/1 node(s) before/after summarization => ratio=10 • Trading-off min_sup' and t as the inner loop • A range of reasonable parameters in the middle
Scalability • On the synthetic data • Parameters are tuned as done above
Outline • Motivation • Summarize-Mine • Experiments • Summary
Summary • We solve the frequent subgraph mining problem for graphs with big size • We found interesting malware signatures • Our algorithm is much more efficient, while the state-of-art mining technologies do not work • We show that patterns can be well preserved on higher-level by a good generalization scheme • Very useful, given the emerging trend of huge networks • The data has to be preprocessed and summarized
Summary • Our method is orthogonal to many previous works on this topic => Combine for further improvement • Efficient pattern space traversal • Other data space reduction techniques different from our compression within individual transactions • Transaction sampling, merging, etc. • They perform compression between transactions