320 likes | 429 Views
Hierarchical Document Clustering Using Frequent Itemsets. Benjamin C.M. Fung, Ke Wangy and Martin Ester Proceeding of International Conference on Data Mining, SIAM 2003. 報告人 : 吳建良. Outline. Hierarchical Document Clustering Proposed Approach
E N D
Hierarchical Document Clustering Using Frequent Itemsets Benjamin C.M. Fung, Ke Wangy and Martin Ester Proceeding of International Conference on Data Mining, SIAM 2003 報告人:吳建良
Outline • Hierarchical Document Clustering • Proposed Approach • Frequent Itemset-based Hierarchical Clustering (FIHC) • Experimental Evaluation • Conclusions
Hierarchical Document Clustering • Document Clustering • Automatically organize documents into clusters • Documents within a cluster have high similarity • Documents within different clusters are very dissimilar • Hierarchical Document Clustering Sports Soccer Tennis Tennis ball
Challenges in Hierarchical Document Clustering • High dimensionality. • High volume of data • Consistently high clustering quality. • Meaningful cluster description
Overview of FIHC (High dimensional doc vectors) Generate frequent itemsets Documents Preprocessing (Reduced dimensions feature vectors) Pruning Build a Tree Construct clusters Cluster Tree
Preprocessing • Remove stop words and Stemming • Construct vector model • doci= ( item frequency1, if2, if3, …, ifm) • EX:
Generate Frequent Itemsets • Use Agrawal et al. proposed algorithm to find global frequent itemsets • Minimum global support • a percentage of all documents • Global frequent itemset • a set of items (words) that appear together in more than a minimum global support of the whole document set • Global frequent item • an item that belongs to some global frequent itemset
Reduced Dimensions Vector Model • High dimensional vector model • Set the minimum global support to 35% • Store the frequencies only for global frequent items
Construct Initial Clusters • Construct a cluster for each global frequent itemset • All documents containing this itemset are included in the same cluster Its cluster label is {result} C(flow) C(form) C(layer) C(patient) C(result) C(treatment) C(flow, layer) C(patient, treatment) cran.1 cran.2 cran.3 cran.4 cran.5 cisi.1 cran.1 cran.3 med.2 med.5 cran.1 cran.2 cran.3 cran.4 cran.5 med.1 med.2 med.3 med.4 med.5 med.6 cran.3 med.1 med.2 med.4 med.6 med.1 med.2 med.3 med.4 med.6 cran.1 cran.2 cran.3 cran.4 cran.5 med.1 med.2 med.3 med.4 med.6
Cluster Frequent Items • A global frequent item is cluster frequent in a cluster Ci if the item is contained in some minimum fraction of documents in Ci • Suppose the minimum cluster support is set to 70% C(patient) (flow, form, layer, patient, result, treatment) med.1=( 0 0 0 8 1 2 ) med.2=( 0 1 0 4 3 1 ) med.3=( 0 0 0 3 0 2 ) med.4=( 0 0 0 6 3 3 ) med.5=( 0 1 0 4 0 0 ) med.6=( 0 0 0 9 1 1 )
Cluster Label vs. Cluster Frequent Items • Cluster label • Use global frequent itemset as cluster label • A set of mandatory items in the cluster • Every document in the cluster must contain all the items in the cluster label • Used in hierarchical structure establishment • Cluster frequent items • Appear in some minimum fraction of documents in the cluster • Used in similarity measurement • Topic description of the cluster
Make Clusters Disjoint • Initial clusters are not disjoint • Remove the overlapping of clusters • Assign a document to the “best” initial cluster • Define the score function Score(Ci ← docj) • Measure the goodness of a cluster Ci for a document docj
Score Function • Assign each docj to the initial cluster Ci that has the highest score • x represents a global frequent item in docj and the item is also cluster frequent in Ci • x’ represents a global frequent item in docj but the item is not cluster frequent in Ci • n(x) is the frequency of x in the feature vector of docj • n(x’) is the frequency of x’ in the feature vector of docj
Score Function (cont.) • If the highest score are more than one • Choose the one that has the most number of items in the cluster label • Key idea: • A cluster Ci is good for a document docj if there are many global frequent items in docj that appear in many documents in Ci
Score Function - Example C(flow) flow= 100% layer=100% C(form) form= 100% C(layer) flow= 100% layer= 100% C(patient) patient= 100% treatment= 83% C(result) result= 100% patient= 80% treatment= 80% C(treatment) patient= 100% treatment= 100% result= 80% C(flow, layer) flow= 100% layer= 100% C(patient, treatment) patient= 100% treatment= 100% result= 80% -5.34 -5.34 9.41 10.6 10.8 -5.34 0+0-[(9 × 0.5)+(1 × 0.42)+(1 × 0.42)] = -5.34 (9 × 1)+(1 × 1)+(1 × 0.8)= 10.8 global support of patient, result, and treatment (flow, form, layer, patient, result, treatment) med.6=( 0 0 0 9 1 1 )
Recompute the Cluster Frequent Items • Recompute Ci, also include all descendants of Ci • A descendant of Ci if its cluster label is a superset of the cluster label of Ci • Consider C(patient) (flow, form, layer, patient, result, treatment) med.5=( 0 1 0 4 0 0 ) Include descendant: C(patient, treatment) none Disjoint cluster result
Building the Cluster Tree • Put the more specific clusters at the bottom of the tree • Put the more general clusters at the top of the tree
Building the Cluster Tree (cont.) • Tree level • Level 0: root, mark “null” and store unclustered documents • Level k: cluster label is global frequent k-itemset • Bottom-up manner • Start from the cluster Ci with the largest number k of items in its cluster label • Identify all potential parents that are (k-1)-clusters and have the cluster label being a subset of Ci’s cluster label • Choose the “best” among potential parents
Building the Cluster Tree (contd.) • The criterion for selecting the best • Similar to choosing the best cluster for a document • Method: (1)Merge all the documents in the subtree of Ci into a single conceptual document doc(Ci) (2)Compute the score of doc(Ci) against each potential parent Cj • The potential parent with the highest score would become the parent of Ci • All leaf clusters that contain no document can be removed
Example • Start from 2-cluster • C(flow, layer) and C(patient, treatment) • C(flow, layer) is emptye remove • C(patient, treatment) • Potential parents: C(patient) and C(treatment) • C(treatment) is empty remove • C(patient) gets a higher score and becomes the parent of C(patient, treatment) null C(flow) C(form) C(patient) cran.1 cran.2 cran.3 cran.4 cran.5 cisi.1 med.5 C(patient, treatment) med.1 med.2 med.3 med.4 med.6
Prune Cluster Tree • A small minimum global support • A cluster tree can be broad and deep • Documents of the same topic are distributed over several small clusters • Poor clustering accuracy • The aim of tree pruning • Produce a natural topic hierarchy for browsing • Increase the clustering accuracy
Inter-Cluster Similarity • Inter_Sim of Ca and Cb • Reuse the score function to calculate Sim(Ci←Cj)
Property of Sim(Ci←Cj) • Global support and cluster support are between 0 and 1 • Maximum value of Score= , minimum is • Normalize Score by , Sim is within -1~1 • Avoid negative similarity value, add the term +1 • The range of Sim function is 0~2, so is Inter_Sim • Inter_Sim value is below 1 • Weight of dissimilar items has exceeded the weight of similar items • A good threshold to distinguish two clusters
Child Pruning • Objective: shorten the depth of a tree • Procedure • Scan the tree in the bottom-up order • For each non-leaf node, calculate Inter_Sim between the node and each of its children • If Inter_Sim is above 1, prune the child cluster • If a cluster is pruned, its children become the children of their grandparent • Child pruning is only applicable to level 2
Example • Determine whether cluster C(patient, treatment) should be pruned • Compute the inter-cluster similarity between C(patient) and C(patient, treatment) • Sim(C(patient)←C(patient, treatment)) • Combine all the documents in cluster C(patient, treatment) by adding up their feature vectors (flow, form, layer, patient, result, treatment) med.1=( 0 0 0 8 1 2 ) med.2=( 0 1 0 4 3 1 ) med.3=( 0 0 0 3 0 2 ) med.4=( 0 0 0 6 3 3 ) med.6=( 0 0 0 9 1 1 ) Sum = ( 0 1 0 30 8 9 )
Example(cont.) • Sim(C(patient, treatment)←C(patient))=1.92 • Inter_Sim(C(patient)↔C(patient, treatment))= • Inter_Sim is above 1, cluster C(patient, treatment) is pruned null C(flow) C(form) C(patient) cran.1 cran.2 cran.3 cran.4 cran.5 cisi.1 med.1 med.2 med.3 med.4 med.5 med.6
Sibling Merging • Sibling merging is applicable to level 1 • Procedure • Calculate the Inter_Sim for each pair of clusters at level 1 • Merge the cluster pair that has the highest Inter_Sim • Repeat above steps until • User-specified number of clusters is reached, or • All cluster pairs at level 1 have Inter_Sim below or equal to 1 null C(flow) C(patient) cran.1 cran.2 cran.3 cran.4 cran.5 cisi.1 med.1 med.2 med.3 med.4 med.5 med.6
Experimental Evaluation • Dataset • Clustering Quality (F-measure) • Recall , Precision • Corresponding F-measure: • F-measure for whole clustering result: nij Natural Class i Cluster j
Conclusions & Discussion • This research exploits frequent itemsets for • Define a cluster • Use score function, construct initial clusters, make disjoint clusters • Organize the cluster hierarchy • Build cluster tree, prune cluster tree • Discussion: • Use unordered frequent word sets • Different order of words may deliver different meaning • Multiple topics of documents