160 likes | 306 Views
The 5th annual UK Workshop on Computational Intelligence London, 5-7 September 2005. Learning Topic Hierarchies from Text Documents using a Scalable Hierarchical Fuzzy Clustering Method. E. Mendes Rodrigues and L. Sacks {mmendes, lsacks}@ee.ucl.ac.uk http://www.ee.ucl.ac.uk/~mmendes/.
E N D
The 5th annual UK Workshop on Computational Intelligence London, 5-7 September 2005 Learning Topic Hierarchies from Text Documents using a Scalable Hierarchical Fuzzy Clustering Method E. Mendes Rodrigues and L. Sacks {mmendes, lsacks}@ee.ucl.ac.uk http://www.ee.ucl.ac.uk/~mmendes/ Department of Electronic & Electrical Engineering University College London, UK
Outline • Document clustering process • H-FCM: Hyper-spherical Fuzzy C-Means • H2-FCM: Hierarchical H-FCM • Clustering experiments • Topic hierarchies
Pre-processing Document Similarity Clustering Method Document Clusters Pre-processing Document Similarity Clustering Method Document Clusters Document Representation Document Encoding Document Clustering Document Representation Document Encoding Document Clustering Document Collection Cluster Validity Application Document Collection Cluster Validity Application Identify all unique words in the document collection x11 x12 x1k x21 x22 xN1 xN2 xNk Discard common words that are included in the stop list X = Apply stemming algorithm and combine identical word stems Document Vectors Discard terms using pre-processing filters Apply term weighting scheme to the final set of k indexing terms Document Clustering Process Vector-Space Model of Information Retrieval • Very high-dimensional • Very sparse (+95%)
Measures of Document Relationship • FCM applies the Euclidean distance, which is inappropriate for high-dimensional text clustering • non-occurrence of the same terms in both documents is handled in a similar way as the co-occurrence of terms • Cosine (dis)similarity measure: • widely applied in Information Retrieval • represents the cosine of the angle between two document vectors • insensitive to different document lengths, since it is normalised bythe length of the document vectors
H-FCM: Hyper-spherical Fuzzy C-Means • Applies the cosine measure to assess document relationships • Modified objective function: • Subject to an additional constraint: • Fuzzy memberships (u) and cluster centroids (v):
How many clusters? • Usually the final number of clusters is not know a priori • Run the algorithm for a range of c values • Apply validity measures and determine which c leads to the best partition (clusters compactness, density, separation, etc.) • How compact and dense are clusters in a sparse high-dimensional problem space? • Very small percentage of documents within a cluster present high similarity to the respective centroid clusters are not compact • However, there is always a clear separation between intra- and inter-cluster similarity distributions
H2-FCM: Hierarchical Hyper-spherical Fuzzy C-Means • Key concepts • Apply partitional algorithm (H-FCM) to obtain a sufficiently large number of clusters • Exploit the granularity of the topics associated with each cluster to link cluster centroids hierarchically • Form a topic hierarchy • Asymmetric similarity measure • Identify parent-child type relationships between cluster centroids • Child should be less similar to parent, than parent to child
Add child Compute S(va,vb),a,b Select parent While VF ≠ Select centroid S≥tPCS? VH=? Y Y N N Add root c=c-K N VF Document Cluster centroid vbVF S(v5 ,vb) = max[S(vi ,vj)], vi,vjVF VH v1 C2 v1 v1 v1 C1 v2 v2 v2 v2 S(v8,v5)<tPCS S(v8,v1)<tPCS S(v1,v5)≥tPCS C3 C5 v3 C4 v3 v3 v5 v5 v5 v4 v4 v4 v8 v8 v8 v8 C6 v6 v7 v6 v6 v7 v7 C8 C7 v11 v11 v11 C10 C9 v10 v10 v10 v10 v9 v9 v9 C11 v12 v12 v12 C12 The H2-FCM Algorithm All clusters have size≥tND? Apply H-FCM (c, m) Asymmetric Similarity
Scalability of the Algorithm • H2-FCM time complexity depends on H-FCM and centroid linking heuristic • H-FCM computation time is O(Nc2k) • Linking heuristic is at most O(c2k) • Computation of the asymmetric similarity between every pair of cluster centroids - O(c2k) • Generation of the cluster hierarchy - O(c2) • Overall, H2-FCM time complexity is O(Nc2k) • Scales well to large document sets!
Description of Experiments • Goal: evaluate the H2-FCM performance • Evaluation measures: clustering Precision (P) and Recall (R) • H2-FCM algorithm run for a range of c values • No. hierarchy roots=No. reference classes tPCSdynamically set • Are sub-clusters of the same topic assigned to the same branch?
Test Document Collections Reuters-21578 test collection: http://www.daviddlewis.com/resources/testcollections/reuters21578/ Open Directory Project (ODP): http://dmoz.org/ INSPEC database: http://www.iee.org/publish/inspec/
Clustering Results: H2-FCM Precision and Recall reuters1 reuters2 odp inspec
Topic Hierarchy • Each centroid vector consists of a set of weighted terms • Terms describe the topics associated with the document cluster • Centroid hierarchy produces a topic hierarchy • Useful for efficient access to individual documents • Provides context to users in exploratory information access
Concluding Remarks • H2-FCM clustering algorithm • Partitional clustering (H-FCM) • Linking heuristic organizes centroids hierarchically bases on asymmetric similarity • Scales linearly with the number of documents • Exhibits good clustering performance • Topic hierarchy can be extracted from the centroid hierarchy