320 likes | 527 Views
GRAPH-BASED HIERARCHICAL CONCEPTUAL CLUSTERING. by Istvan Jonyer, Lawrence B. Holder and Diane J. Cook The University of Texas at Arlington. Outline. What is hierarchical conceptual clustering? Overview of Subdue Conceptual clustering in Subdue Evaluation of hierarchical clusterings
E N D
GRAPH-BASED HIERARCHICAL CONCEPTUAL CLUSTERING by Istvan Jonyer, Lawrence B. Holder and Diane J. Cook The University of Texas at Arlington
Outline • What is hierarchical conceptual clustering? • Overview of Subdue • Conceptual clustering in Subdue • Evaluation of hierarchical clusterings • Experiments and results • Conclusions
What is hierarchical conceptual clustering? • Unsupervised concept learning • Generating hierarchies to explain data • Applications • Hypothesis generation and testing • Prediction based on groups • Finding taxonomies
Animals HeartChamber: four BodyTemp: regulated Fertilization: internal BodyTemp: unregulated Name: mammal BodyCover: hair Name: bird BodyCover: feathers Name: reptile BodyCover: cornified-skin HeartChamber: imperfect-four Fertilization: internal Fertilization: external Name: amphibian BodyCover: moist-skin HeartChamber: three Name: fish BodyCover: scales HeartChamber: two Example hierarchical conceptualclustering
The Problem • Hierarchical conceptual clustering in discrete-valued structural databases • Existing systems: • Continuous-valued • Discrete but unstructured • We can do better! (Field under explored)
Related Work • Cobweb • Labyrinth • AutoClass • Snob • In Euclidian space: Chameleon, Cure • Unsupervised learning algorithms
The Solution • Take Subdue and extend it!
E e A A g a a d B D D B b b c c f C C F Overview of Subdue • Data mining in graph representations of structural databases
A a D B b c C Overview of Subdue • Iteratively searching for best substructure by MDL heuristic
E e g d S S f F Overview of Subdue • Compress using best substructure
Overview of Subdue • Fuzzy match • Inexact matching of subgraphs • Applications: • Defining fuzzy concepts • Evaluation of clusterings
Conceptual Clustering with Subdue • Use Subdue to identify clusters • The best subgraph in an iteration defines a cluster • When to stop within an iteration? • Use –limit option • Use –size option • Use first minimum heuristic (new)
The First Minimum Heuristic • Use subgraph at first local minimum • Detect it using –prune2 option
The First Minimum Heuristic • Not a greedy heuristic! • Although first local minimum is usually the global minimum • First local minimum is caused by a smaller, more frequently occurring subgraph • Subsequent minima are caused by bigger, less frequently occurring subgraphs => First subgraph is more general
The First Minimum Heuristic A multi-minimum search space:
Lattice vs. Tree • Previous work defined classification trees • Inadequate in structured domains • Better hierarchical description: classification lattice • A cluster can have more than one parent • A parent can be at any level (not only one level above)
Hierarchical Clustering in Subdue • Subdue can compress by a subgraph after each iteration • Subsequent clusters may be defined in terms of previously defined clusters • This results in a hierarchy
Root Hierarchical Conceptual Clustering of an Artificial Domain
Evaluation of Clusterings • Traditional evaluation: • Not applicable to hierarchical domains • No known evaluation for hierarchical clusterings • Most hierarchical evaluations are anecdotal
New Evaluation Heuristic for Hierarchical Clusterings Properties of a good clustering: • Small number of clusters • Large coverage good generality • Big cluster descriptions • More features more inferential power • Minimal or no overlap between clusters • More distinct clusters better defined concepts
New Evaluation Heuristic for Hierarchical Clusterings Big clusters: bigger distance between disjoint clusters Overlap: less overlap bigger distance Few clusters: averaging comparisons
Experiments and Results • Validation in an artificial domain • Validation in unstructured domains • Comparison to existing systems • Real world applications
Name Body Cover Heart Chamber Body Temp. Fertilization mammal hair four regulated internal bird feathers four regulated internal reptile cornified-skin imperfect-four unregulated internal mammal Name four hair BodyCover amphibian moist-skin three unregulated external HeartChamber animal Fertilization BodyTemp regulated internal fish scales two unregulated external The Animal Domain
Animals HeartChamber: four BodyTemp: regulated Fertilization: internal BodyTemp: unregulated Name: mammal BodyCover: hair Name: bird BodyCover: feathers Name: reptile BodyCover: cornified-skin HeartChamber: imperfect-four Fertilization: internal Fertilization: external Name: amphibian BodyCover: moist-skin HeartChamber: three Name: fish BodyCover: scales HeartChamber: two Hierarchical Clustering of the Animal Domain
animals amphibian/fish mammal/bird reptile fish amphibian mammal bird Hierarchical Clustering of the Animal Domain by Cobweb
Comparison of Subdue and Cobweb • Quality of Subdue’s lattice (tree): 2.60 • Quality of Cobweb’s tree: 1.74 • Therefore Subdue is better • Reasons for a higher score: • Better generalization resulting in less clusters • Eliminating overlap between (reptile) and (amphibian/fish)
DNA O | O == P — OH C — N C — C C — C \ O C \ N — C \ C O | O == P — OH | O | CH2 O \ C / \ C — C N — C / \ O C Chemical Application: Clustering of a DNA sequence Coverage • 61% • 68% • 71%
Conclusions • Goal of hierarchical conceptual clustering of structured databases was achieved • Synthesized classification lattice • Developed new evaluation heuristic for hierarchical clusterings • Good performance in comparison to other systems, even in unstructured domains
Future Work • More experiments on real-world domains • Comparison to other systems • Incorporation of evaluation tool into Subdue