270 likes | 369 Views
Small World Clustering Algorithms. Brant Chee. Experiments. 3 clustering algorithms Complete Link (Cluto) K means (Cluto) Small World. Test Collections. Experimental Setup. Parameters left at package defaults Clustered with n = 50,100,150 and 200.
E N D
Small World Clustering Algorithms Brant Chee
Experiments • 3 clustering algorithms • Complete Link (Cluto) • K means (Cluto) • Small World
Experimental Setup • Parameters left at package defaults • Clustered with n = 50,100,150 and 200. • Clusters with less than 4 elements or more than 50 elements were eliminated and the clustering which resulted in less than 40 clusters was chosen to be evaluated.
Qualitative Evaluation • 2 Criteria: Utility and Coherence • 3 point scale: 1 good, 2 poor, 3 bad • Good: >60% of articles • Poor: 59-41% • Bad: <40% • Evaluate terms in cluster to get context.
Other Approaches Statistical Methods
Other Clustering Approaches • Can we choose other types of clustering algorithms which could provide better quality results or provide better cluster labels? • SOM (Self Organizing Map) • Slow for high numbers of dimensions and large numbers of objects. • Carrot2 • Slow for large numbers of items. • Huge memory consumption.
Random Projection • Can we reduce the dimensionality of vectors (ie 50,0001000) while preserving distances? • Speed up similarity calculations • Various methods: • Random projection. • “Latent semantic indexing”. • Multi Dimensional Scaling
Very Sparse Random Projections • A ∈ R× be our n points in D dimensions • A x Random matrix ∈ RD×k • R of entries in {−1, 0, 1} with probabilty • O(nDk + n2k)
Reducing Dimensionality • Bank Dataset 11,000 articles from 11 categories in Dmoz. • 11,000 articles reduced from 30K terms 1GB heap in 11s. • Increase in Purity and decrease in Entropy (measures of clustering quality).
MI on Phrases • More context than single words • More meaningful term clusters
Other approaches Knowledge Intensive Approaches
Hypernym • “Is-a” relationship • Shakespeare is an author. • Pug is a dog. • Implicitly hierarchical. • Basis of many ontology and semantic networks. • Wordnet • UMLS
Hypernym Relations • NP such as {, NP}* {(or | and)} NP • Vegetables such as Beets, Carrots and Peas. • Such NP as {NP,}* {(or|and)} NP • …works by such authors as Herrick, Goldsmith and Shakespeare. • NP {, NP}* {,} or|and other NP • Bruises, …, broken bones or other injuries • NP {,} including {NP,} * {or|and} NP • All common-law countries, including Canada and England … • NP {,} especially {NP,} * {or|and} NP • … most European countries, especially France, England and Spain.
Uses of Hypernym Trees • Search • Query Expansion • Facted metadata • Clustering • Parent node defines a cluster • Keyword assignment
Trivial Hypernyms • organic compounds d-ribose • organic compounds d-arabinose • organic compounds l-arabinose • organic compounds sucrose • substances cortisone • substances vitamins a and c • substances zinc • organs liver • organs kidney • sugar-containing products honey • sugar-containing products jam • sugar-containing products glucose • sugar-containing products fruit juice concentrates • sugar-containing products tomato • largely populated countries china • largely populated countries russia
Bad Hypernyms • suicidal patients appears • other agents plasmin • other agents plasminogen • such common sensations illness • phenomena founder effects • phenomena migration • phenomena gene flow • clinical manifestations 80 • chemical agents homocystine • no other explanation anencephaly • conditions azure a-0.5 % nahco3 solution • conditions ph 8.1 • fewer side-effects vegetative disfunction • techniques carpentier • techniques 's ring
Good? Hypernyms • entirely synthetic steroids norgestrel and quingestanol • menstrual disorders metrorrhagia • menstrual disorders oligoamenorrhea • menstrual disorders amenorrhea • mild venous disorders swollen veins • mild venous disorders heavy limbs • mild venous disorders varicosities • obstructive pulmonary lung diseases alveolar proteinosis • obstructive pulmonary lung diseases pneumonia • obstructive pulmonary lung diseases asthma • obstructive pulmonary lung diseases bronchiectasis • obstructive pulmonary lung diseases cystic fibrosis • choline analogues n,n'-dimethylethanolamine • choline analogues n-monomethylethanolamine • choline analogues ethanolamine • 3alpha-oh-containing steroids androsterone