300 likes | 503 Views
Graph-based Pattern Learning. Dr. Larry Holder School of EECS, WSU. Graphs. Social Network. Protein-protein Interaction. Internet. Power Grid. Web. Some Graph Statistics. Web 10B pages, 1T hyperlinks Topology storage: 10TB
E N D
Graph-based Pattern Learning Dr. Larry Holder School of EECS, WSU
Graphs Social Network Protein-protein Interaction Internet Power Grid Web
Some Graph Statistics • Web • 10B pages, 1T hyperlinks • Topology storage: 10TB • Google PageRank: Eigenvector on 10Bx10B adjacency matrix (sparse) • MySpace • 100M users, 10B friendship links • Clique/community detection • 300K new users per day
Graph Problems • Degree • Diameter • Centrality • Shortest path • Cycles/tours • Minimum spanning tree • Traversals/search • Connectivity • Clustering • Partitioning • Cliques • Motifs • Subgraph isomorphism • Frequent subgraphs • Pattern learning • Dynamics
Graph-based Pattern Learning • Unsupervised pattern discovery • Hierarchical conceptual clustering • Supervised pattern learning • Anomaly detection • Dynamic graph pattern learning
Unsupervised Pattern Discovery • Frequency-based (AGM, gSpan, FSG, Gaston) • “Graph-based Data Mining” • Find all subgraphs g within a set of graph transactions G such that • where is subgraph isomorphism and • t is the minimum support • Focus on pruning and fast, code-based graph matching • Still requires subgraph isomorphism
S1 Unsupervised Pattern Discovery • Graph compression and the minimum description length (MDL) principle • The best theory minimizes the description length of the theory and the description length of the data given the theory • The best graphical pattern S minimizes the description length of S and the description length of the graph G compressed with pattern S • where description length DL(G) is the minimum number of bits needed to represent G (SUBDUE) • Compression can be based on inexact matches to pattern S1 S1 S1 S1 S2 S2 S2
Hierarchical Conceptual Clustering • Use iterative process on input graph G • Repeat • Find best pattern S in graph G • Add S to hierarchy • G = G compressed with S • Until no more compression • Clustering is a lattice • Clusters described by pattern • Not just instances as in traditional clustering techniques
Hierarchical pattern discovered at 7th iteration of SUBDUE SRA TEES Text Extraction System Mock Terrorist Scenario Event Generator Observables Message Traffic Reports (142) Fund raising Recruitment Training Reconnaissance ... Convert to Graph Entities and Relationships SUBDUE Pattern Learner Patterns DHS Insight Project Terrorist Group Data
Supervised Learning • Given positive graph G+ and negative graph G- • Find pattern S minimizing DL(G+ | S) / DL(G- | S) • When |G+|,|G-| >> 1, find pattern S maximizing classification accuracy: Positive Graphs Negative Graphs SUBDUE Pattern(s)
DARPA/AFRL Evidence Assessment, Grouping, Linking and Evaluation (EAGLE) Program Convert EDB to SUBDUE graph format Positive & negative examples EDB Threat • Evidence DB (EDB) • contains simulated data • on threat and non-threat activity • Persons, targets, capabilities, • resources, transfers, and • communications Non-threat SUBDUE Patterns Evaluate
Graph Regression (with Nikhil Ketkar, WSU) • Learn a model Yi = f(Gi ), where Yi is a real number and Gi is a graph • E.g., solubility or binding activity of chemical compounds • One approach • Apply frequent-graph miner to set of training graphs Gi • Frequent subgraphs form a feature vector V • Input {(Yi, Vi)} to linear support-vector machine • gRegress approach • Prune feature set based on correlation with other features and lack of correlation with Y • Learn model using non-linear SVM or piece-wise regression
Anomaly Detection (with Bill Eberle, TTU) • Learn normative patterns of activity • Detect small, unlikely deviations from normative patterns • Present anomalies and their context to analyst Convert to graph Normative Pattern Graph-Based Anomaly Detection (GBAD) SUBDUE Activity Data Anomaly GBAD
GBAD Approach • Determine normative pattern S using SUBDUE minimum description length (MDL) heuristic that minimizes: M(S,G) = DL(G|S) + DL(S) • Three algorithms for handling each of the different anomaly categories • GBAD-MDL finds anomalous modifications • GBAD-P (Probability) finds anomalous insertions • GBAD-MPS (Maximum Partial Substructure) finds anomalous deletions
DHS Insight Project: Cargo Data • Shipment data from PIERS (Port Import Export Reporting Service) • Only North American imports (U.S., Puerto Rico, Canada) • 65,535 records (shipments) • Information categories: • General • Commodity codes • Countries and ports • U.S. company names and locations • Foreign shipper names and locations • Notification party names and locations • Shipping line, vessel and packaging • Container • Weight and shipment • Financial
Anomaly Detection in Cargo Data • Marijuana seized at port on Florida [U.S. Customers Service 2000]. • Smuggler did not disclose some financial information, and ship traversed extra port. • GBAD-P discovers the extra traversed port; GBAD-MPS discovers the missing financial information.
DHS CyberSecurity R&D Program: Insider Threat Detection using Graphs Gov’t ID Request Processing Insider Threat Scenarios (CERT Insider Threat Documents) Frontline staff reviews case (invasion of privacy). Frontline staff submits case directly to a case officer (bypassing the approval officer). Frontline staff recommends or decides case. Approval officer reverses accept/reject recommendation from assigned case officer. Unassigned case officer updates or recommends case. Applicant communicates with approval officer or case officer. Unassigned case officer communicates with applicant. Database access from an external source or after hours. GBAD on Scenario 1 GBAD on Scenario 4 • 1000 cases • Multiple normative patterns • 1-3 anomalies • No false positives
Dynamic Graph Pattern Learning(with Chang hun You, WSU) • Dynamic graph DG = {G1, G2, …, Gn} • Find graph rewrite rules between pairs of graphs Gi / Gi+1 • Find common subgraph between Gi and Gi+1 • Remainder of Gi to be removed (GR) • Remainder of Gi+1 to be added (GA) • Find transformation rules of temporal patterns in rewrite rules • Remove (GR) at time t, then add (GA) at time t+k
Example: Circadian Rhythm in Drosophila (Fruit Fly) Transformation rule (Sub 1): Structure appearing and disappearing in network. Full temporal transformation rule: Boxes are removals (after 5 hours), and ellipses are additions (after 7 hours) of Sub 1. Cycles every 12 hours. Time 6-47 is training; time 54-66 is prediction.
Graph-based Pattern Learning • Algorithms • Pattern discovery and clustering • Supervised learning • Anomaly detection • Dynamic graphs • Applications • Social networks • Biological networks • Computer networks • Process flows • (Semantic) Web • … linkeddata.org
High Performance Computing Issues • Memory bottleneck • Most real-world graphs do not fit in main memory • Patterns of access to graph not sequential • Computational bottleneck • Graph and subgraph isomorphism
High Performance Computing Issues • Functional parallelism • Parallel search over space of candidate subgraph patterns • High communication to avoid redundancy • Child patterns rely on embeddings kept with parent • Hinders parallelism • Computing embeddings from scratch is NPC • Data parallelism • Partition graphs, find patterns in each partition, evaluate patterns in other partitions • Edge cuts may break patterns • May require NPC subgraph isomorphism
Data-Intensive Scalable Computing • MapReduce [Google] • Dean & Ghemawat, “MapReduce: Simplified Data Processing on Large Clusters,” OSDI 2004. • Hadoop [Yahoo] • MapReduce • Distributed filesystem Map Reduce
Multiscale Issues • Hierarchical networks • Higher-level hyper-nodes summarize detail at lower levels • E.g., Netflix prize (www.netflixprize.com) • 17K movies, 400K users, 100M reviews • E.g., user’s average rating vs. specific ratings • E.g., movie’s average rating vs. specific rating 5 rating 3.5 4.5 review avg. rating user avg. rating movie user movie title (reviews…) “Matrix”
Conclusions • Graph representation of relational data • Graph-based pattern learning improves understanding of modeled behavior • Massive, dynamic graphs • Numerous application domains • Graph problems computationally and memory intensive • HPC (data-intensive computing) and multiscale approaches
For More Information • Larry Holder, School of EECS, WSU • Email: holder@wsu.edu • URL: www.eecs.wsu.edu/~holder • SUBDUE • Source code in C • Datasets • www.subdue.org • D. Cook and L. Holder (2006). Mining Graph Data, Wiley. (www.eecs.wsu.edu/mgd)