1 / 36

Mining Graph Patterns Efficiently via Randomized Summaries

This paper presents a method for efficiently mining graph patterns in large networks using randomized summaries. The approach involves summarizing the graph, mining patterns on the summarized graph, and then verifying and refining the results using the original graph. Experimental results show that the proposed method efficiently handles the challenges of subgraph isomorphism checking, minimizing false negatives and false positives.

valentinef
Download Presentation

Mining Graph Patterns Efficiently via Randomized Summaries

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Mining Graph Patterns Efficiently via Randomized Summaries Chen Chen1, Cindy X. Lin1, Matt Fredrikson2, Mihai Christodorescu3, Xifeng Yan4, Jiawei Han1 1University of Illinois at Urbana-Champaign 2University of Wisconsin at Madison 3IBM T. J. Watson Research Center 4University of California at Santa Barbara

  2. Outline • Motivation • The efficiency bottleneck encountered in big networks • Patterns must be preserved • Summarize-Mine • Experiments • Summary

  3. query graph graph database Frequent Subgraph Mining • Find all graphs p such that |Dp| >= min_sup • Get into the topological structures of graph data • Useful for many downstream applications

  4. Challenges • Subgraph isomorphism checking is inevitable for any frequent subgraph mining algorithm • This will have problems on big networks • Suppose there is only one triangle in the network • But there are 1,000,000 length-2 paths • We must enumerate all these 1,000,000, because any one of them has the potential to grow into a full triangle

  5. Too Many Embeddings • Subgraph isomorphism is NP-hard • So, when the problem size increases, … • During the checking, large graphs are grown from small subparts • For small subparts, there might be too many (overlapped) embeddings in a big network • Such embedding enumerations will finally kill us

  6. Motivating Application • System call graphs from security research • Model dependencies among system calls • Unique subgraph signatures for malicious programs • Compare malicious/benign programs • These graphs are very big • Thousands of nodes on average • We tried state-of-art mining technologies, but failed

  7. Our Approach • Subgraph isomorphism checking cannot be done on large networks • So we do it on small graphs • Summarize-Mine • Summarize: Merge nodes by label and collapse corresponding edges • Mine: Now, state-of-art algorithms should work

  8. Mining after Summarization

  9. Remedy for Pattern Changes • Frequent subgraphs are presented on a different abstraction level • False negatives & false positives, compared to true patterns mined from the un-summarized database D • False negatives (recover) • Randomized technique + multiple rounds • False positives (delete) • Verify against D • Substantial work can be transferred to the summaries

  10. Outline • Motivation • Summarize-Mine • The algorithm flow-chart • Recovering false negatives • Verifying false positives • Experiments • Summary

  11. False Negatives • For a pattern p, if each of its vertices bears a different label, then the embeddings of p must be preserved after summarization • Since we are merging groups of vertices by label, the nodes of p should stay in different groups • Otherwise,

  12. Missing Prob. of Embeddings • Suppose • Assign xj nodes for label lj(j=1,…,L) in the summary Si => xjgroups of nodes with label lj in the original graph Gi • Pattern p has mj nodes with label lj • Then

  13. No “Collision” for Same Labels • Consider a specific embedding f: p->Gi, f is preserved if vertices in f(p) stay in different groups • Randomly assign mj nodes with label lj to xjgroups, the probability that they will not “collide” is: • Multiply probabilities for independent events

  14. Example • A pattern with 5 labels, each label => 2 vertices • m1 = m2 = m3 = m4 = m5 = 2 • Assign 20 nodes in the summary (i.e., 20 node groups in the original graph) for each label • The summary has 100 vertices • x1 = x2 = x3 = x4 = x5 = 20 • The probability that an embedding will persist

  15. Extend to Multiple Graphs • Setting x1,…,xL to the same values across all Gi’s in the database • only depends on m1,…,mL, i.e., pattern p’s vertex label distribution • We denote this probability as q(p) • For each of p’s support graphs in D, it has a probability of at least q(p) to continue support p • Thus, the overall support can be bounded below by a binomial random variable

  16. Support Moves Downward

  17. False Negative Bound

  18. Example, Cont. • As above, q(p)=0.774 • min_sup=50

  19. False Positives • Much easier to handle • Just check against the original database D • Discard if this “actual” support is less than min_sup

  20. The Same Skeleton as gSpan • DFS code tree • Depth-first search • Minimum DFS code? • Check support by isomorphism tests • Record all one-edge extensions along the way • Pass down the projected database and recurse

  21. Integrate Verification Schemes • Top-Down and Bottom-Up • Possible factors • Amount of false positives • Top-down verification can be performed early • Top-down preferred by experiments Transaction ID list for p1 => Dp1 Just search within D-Dp2; if frequent, can stop Just search within Dp1 Transaction ID list for p2 => Dp2

  22. Summary-Guided Verification • Substantial verification work can be performed on the summaries, as well Got it!

  23. Iterative Summarize-Mine • Use a single pattern tree to hold all results spanning across multiple iterations • No need to combine pattern sets in a final step • Avoid verifying patterns that have already been checked by previous iterations • Verified support graphs are accurate, they can help pre-pruning in later iterations • Details omitted

  24. Outline • Motivation • Summarize-Mine • Experiments • Summary

  25. Dataset • Real data • W32.Stration, a family of mass-mailing worms • W32.Virut, W32.Delf, W32.Ldpinch, W32.Poisonivy, etc. • Vertex # up to 20,000 and edge # even higher • Avg. # of vertices: 1,300 • Synthetic data • Size, # of distinct node/edge labels, etc. • Generator details omitted

  26. A Sample Malware Signature • Mined from W32.Stration • A malware reading and leaking certain registry settings related to the network devices

  27. Comparison with gSpan • gSpan is an efficient graph pattern mining algorithm • Graphs with different size are randomly drawn • Eventually, gSpan cannot work

  28. The Influence of min_sup' • Total vs. False Positives • The gap corresponds to true patterns • It gradually widens as we decrease min_sup'

  29. Summarization Ratio • 10/1 node(s) before/after summarization => ratio=10 • Trading-off min_sup' and t as the inner loop • A range of reasonable parameters in the middle

  30. Scalability • On the synthetic data • Parameters are tuned as done above

  31. Outline • Motivation • Summarize-Mine • Experiments • Summary

  32. Summary • We solve the frequent subgraph mining problem for graphs with big size • We found interesting malware signatures • Our algorithm is much more efficient, while the state-of-art mining technologies do not work • We show that patterns can be well preserved on higher-level by a good generalization scheme • Very useful, given the emerging trend of huge networks • The data has to be preprocessed and summarized

  33. Summary • Our method is orthogonal to many previous works on this topic => Combine for further improvement • Efficient pattern space traversal • Other data space reduction techniques different from our compression within individual transactions • Transaction sampling, merging, etc. • They perform compression between transactions

  34. Thank You!

More Related