280 likes | 316 Views
Efficient Mining of Graph-Based Data. Jesus Gonzalez, Istvan Jonyer, Larry Holder and Diane Cook University of Texas at Arlington Department of Computer Science and Engineering http://cygnus.uta.edu/subdue. Motivation. Structural/relational data Ease of graph representation.
E N D
Efficient Mining of Graph-Based Data Jesus Gonzalez, Istvan Jonyer, Larry Holder and Diane Cook University of Texas at Arlington Department of Computer Science and Engineering http://cygnus.uta.edu/subdue SRL Workshop
Motivation • Structural/relational data • Ease of graph representation SRL Workshop
Graph-Based Discovery Input Database Substructure S1 (graph form) Compressed Database T1 triangle shape C1 C1 S1 B1 object R1 R1 on square S1 S1 S1 shape T2 T3 T4 object B2 B3 B4 SRL Workshop
triangle on circle square on on rectangle on on on triangle triangle triangle on on on square square square Algorithm • Create substructure for each unique vertex label Substructures: triangle (4), square (4), circle (1), rectangle (1) SRL Workshop
triangle triangle on on circle square square on on rectangle on on on triangle triangle triangle rectangle on on on on square circle square square square triangle on on rectangle rectangle Algorithm • Expand best substructure by an edge or edge+neighboring vertex Substructures: SRL Workshop
Algorithm • Keep only best beam-width substructures on queue • Terminate when queue is empty or #discovered substructures >= limit • Compress graph and repeat to generate hierarchical description Note: polynomially constrained SRL Workshop
Evaluation Metric • Substructures evaluated based on ability to compress input graph • Compression measured using minimum description length (DL) • Best substructure S in graph G minimizes: DL(S) + DL(G|S) SRL Workshop
Examples SRL Workshop
Inexact Graph Match • Some variations may occur between instances • Want to abstract over minor differences • Difference = cost of transforming one graph to isomorphism of another • Match if cost/size < threshold SRL Workshop
Parallel/Distributed Discovery • Divide graph into P partitions using Metis, distribute to P processors • Each processor performs serial Subdue on local partition • Broadcast best substructures, evaluate on other processors • Master processor stores best global substructures • Close to linear speedup SRL Workshop
Graph-Based Concept Learning • One graph stores positive examples • One graph stores negative examples • Find substructure that compresses positive graph but not negative graph • (PosEgsNotCovered) + (NegEgsCovered) • Multiple iterations implements set-covering approach SRL Workshop
shape object triangle on shape object square on object Concept-Learning Example SRL Workshop
Concept-Learning Results • Chess endgames (19,257 examples) • Black King is (+) or is not (-) in check • 99.8% FOIL, 99.21% Subdue SRL Workshop
More Concept-Learning Results • Tic-Tac-Toe endgames • + is win for X (958 examples) • 100% Subdue, 92.35% FOIL • Bach chorales • Musical sequences (20 sequences) • 100% Subdue, 85.71% FOIL SRL Workshop
Graph-Based Clustering • Iterate Subdue until single vertex • Each cluster (substructure) inserted into a classification lattice Root SRL Workshop
Name Body Cover Heart Chamber Body Temp. Fertilization mammal hair four regulated internal bird feathers four regulated internal reptile cornified-skin imperfect-four unregulated internal mammal Name four hair BodyCover amphibian moist-skin three unregulated external HeartChamber animal Fertilization BodyTemp regulated internal fish scales two unregulated external Clustering Example: Animals SRL Workshop
Animals HeartChamber: four BodyTemp: regulated Fertilization: internal BodyTemp: unregulated Name: mammal BodyCover: hair Name: bird BodyCover: feathers Name: reptile BodyCover: cornified-skin HeartChamber: imperfect-four Fertilization: internal Fertilization: external Name: amphibian BodyCover: moist-skin HeartChamber: three Name: fish BodyCover: scales HeartChamber: two Graph-Based Clustering Results SRL Workshop
animals amphibian/fish mammal/bird reptile fish amphibian mammal bird Cobweb Results • Comparison of Subdue and Cobweb results • Subdue lattice produced better generalization, resulting in less clusters at higher levels • Subdue lattice identifies overlap between (reptile) and (amphibian/fish) SRL Workshop
Clustering Example: DNA SRL Workshop
DNA O | O == P — OH C — N C — C C — C \ O C \ N — C \ C O | O == P — OH | O | CH2 O \ C / \ C — C N — C / \ O C Graph-Based Clustering Results Coverage • 61% • 68% • 71% SRL Workshop
Evaluation of Clusterings • Traditional evaluation: • Not applicable to hierarchical domains • Does not make sense to compare clusters in different subtrees • Not applicable to relational clusterings SRL Workshop
Properties of Good Clusterings • Small number of clusters • Large coverage good generality • Big cluster descriptions • More features more inferential power • Minimal or no overlap between clusters • More distinct clusters better defined concepts SRL Workshop
New Evaluation Heuristic for Hierarchical Clusterings • Clustering rooted at C with c children Hi having |Hi| instances Hi,k • distance() measured by inexact graph match • Animals: SubdueCQ=2.6, CobwebCQ=1.7 SRL Workshop
… hyperlink hyperlink web_page web_page web_page hyperlink … home Graph-Based Data Mining: Application Domains • Biochemical domains • Protein data • DNA data • Toxicology (cancer) data • Spatial-temporal domains • Earthquake data • Aircraft Safety and Reporting System • Telecommunications data • Program source code • Web topology SRL Workshop
Theoretical Analysis • Galois lattice [Lequiere et al.] • Conceptual graphs [Sowa et al.] • PAC analysis [Jappy et al.] SRL Workshop
Graph-based Data Mining • Pattern (substructure) discovery • Hierarchical discovery • Distributed discovery • Concept learning • Clustering • Compression heuristic based on minimum description length SRL Workshop
Future Work • Concept learning • Theoretical analysis • Comparison to ILP systems • Clustering • Classification lattice • Hierarchical relational conceptual clustering evaluation metric • Probabilistic substructures • Domains: WWW, source code SRL Workshop
Subdue Source Code and Data http://cygnus.uta.edu/subdue SRL Workshop