1 / 32

Hierarchical Document Clustering Using Frequent Itemsets

Hierarchical Document Clustering Using Frequent Itemsets. Benjamin C.M. Fung, Ke Wangy and Martin Ester Proceeding of International Conference on Data Mining, SIAM 2003. 報告人 : 吳建良. Outline. Hierarchical Document Clustering Proposed Approach

simeon
Download Presentation

Hierarchical Document Clustering Using Frequent Itemsets

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Hierarchical Document Clustering Using Frequent Itemsets Benjamin C.M. Fung, Ke Wangy and Martin Ester Proceeding of International Conference on Data Mining, SIAM 2003 報告人:吳建良

  2. Outline • Hierarchical Document Clustering • Proposed Approach • Frequent Itemset-based Hierarchical Clustering (FIHC) • Experimental Evaluation • Conclusions

  3. Hierarchical Document Clustering • Document Clustering • Automatically organize documents into clusters • Documents within a cluster have high similarity • Documents within different clusters are very dissimilar • Hierarchical Document Clustering Sports Soccer Tennis Tennis ball

  4. Challenges in Hierarchical Document Clustering • High dimensionality. • High volume of data • Consistently high clustering quality. • Meaningful cluster description

  5. Overview of FIHC (High dimensional doc vectors) Generate frequent itemsets Documents Preprocessing (Reduced dimensions feature vectors) Pruning Build a Tree Construct clusters Cluster Tree

  6. Preprocessing • Remove stop words and Stemming • Construct vector model • doci= ( item frequency1, if2, if3, …, ifm) • EX:

  7. Generate Frequent Itemsets • Use Agrawal et al. proposed algorithm to find global frequent itemsets • Minimum global support • a percentage of all documents • Global frequent itemset • a set of items (words) that appear together in more than a minimum global support of the whole document set • Global frequent item • an item that belongs to some global frequent itemset

  8. Reduced Dimensions Vector Model • High dimensional vector model • Set the minimum global support to 35% • Store the frequencies only for global frequent items

  9. Construct Initial Clusters • Construct a cluster for each global frequent itemset • All documents containing this itemset are included in the same cluster Its cluster label is {result} C(flow) C(form) C(layer) C(patient) C(result) C(treatment) C(flow, layer) C(patient, treatment) cran.1 cran.2 cran.3 cran.4 cran.5 cisi.1 cran.1 cran.3 med.2 med.5 cran.1 cran.2 cran.3 cran.4 cran.5 med.1 med.2 med.3 med.4 med.5 med.6 cran.3 med.1 med.2 med.4 med.6 med.1 med.2 med.3 med.4 med.6 cran.1 cran.2 cran.3 cran.4 cran.5 med.1 med.2 med.3 med.4 med.6

  10. Cluster Frequent Items • A global frequent item is cluster frequent in a cluster Ci if the item is contained in some minimum fraction of documents in Ci • Suppose the minimum cluster support is set to 70% C(patient) (flow, form, layer, patient, result, treatment) med.1=( 0 0 0 8 1 2 ) med.2=( 0 1 0 4 3 1 ) med.3=( 0 0 0 3 0 2 ) med.4=( 0 0 0 6 3 3 ) med.5=( 0 1 0 4 0 0 ) med.6=( 0 0 0 9 1 1 )

  11. Initial Cluster (minimum cluster support=70%)

  12. Cluster Label vs. Cluster Frequent Items • Cluster label • Use global frequent itemset as cluster label • A set of mandatory items in the cluster • Every document in the cluster must contain all the items in the cluster label • Used in hierarchical structure establishment • Cluster frequent items • Appear in some minimum fraction of documents in the cluster • Used in similarity measurement • Topic description of the cluster

  13. Make Clusters Disjoint • Initial clusters are not disjoint • Remove the overlapping of clusters • Assign a document to the “best” initial cluster • Define the score function Score(Ci ← docj) • Measure the goodness of a cluster Ci for a document docj

  14. Score Function • Assign each docj to the initial cluster Ci that has the highest score • x represents a global frequent item in docj and the item is also cluster frequent in Ci • x’ represents a global frequent item in docj but the item is not cluster frequent in Ci • n(x) is the frequency of x in the feature vector of docj • n(x’) is the frequency of x’ in the feature vector of docj

  15. Score Function (cont.) • If the highest score are more than one • Choose the one that has the most number of items in the cluster label • Key idea: • A cluster Ci is good for a document docj if there are many global frequent items in docj that appear in many documents in Ci

  16. Score Function - Example C(flow) flow= 100% layer=100% C(form) form= 100% C(layer) flow= 100% layer= 100% C(patient) patient= 100% treatment= 83% C(result) result= 100% patient= 80% treatment= 80% C(treatment) patient= 100% treatment= 100% result= 80% C(flow, layer) flow= 100% layer= 100% C(patient, treatment) patient= 100% treatment= 100% result= 80% -5.34 -5.34 9.41 10.6 10.8 -5.34 0+0-[(9 × 0.5)+(1 × 0.42)+(1 × 0.42)] = -5.34 (9 × 1)+(1 × 1)+(1 × 0.8)= 10.8 global support of patient, result, and treatment (flow, form, layer, patient, result, treatment) med.6=( 0 0 0 9 1 1 )

  17. Recompute the Cluster Frequent Items • Recompute Ci, also include all descendants of Ci • A descendant of Ci if its cluster label is a superset of the cluster label of Ci • Consider C(patient) (flow, form, layer, patient, result, treatment) med.5=( 0 1 0 4 0 0 ) Include descendant: C(patient, treatment) none Disjoint cluster result

  18. Building the Cluster Tree • Put the more specific clusters at the bottom of the tree • Put the more general clusters at the top of the tree

  19. Building the Cluster Tree (cont.) • Tree level • Level 0: root, mark “null” and store unclustered documents • Level k: cluster label is global frequent k-itemset • Bottom-up manner • Start from the cluster Ci with the largest number k of items in its cluster label • Identify all potential parents that are (k-1)-clusters and have the cluster label being a subset of Ci’s cluster label • Choose the “best” among potential parents

  20. Building the Cluster Tree (contd.) • The criterion for selecting the best • Similar to choosing the best cluster for a document • Method: (1)Merge all the documents in the subtree of Ci into a single conceptual document doc(Ci) (2)Compute the score of doc(Ci) against each potential parent Cj • The potential parent with the highest score would become the parent of Ci • All leaf clusters that contain no document can be removed

  21. Example • Start from 2-cluster • C(flow, layer) and C(patient, treatment) • C(flow, layer) is emptye remove • C(patient, treatment) • Potential parents: C(patient) and C(treatment) • C(treatment) is empty remove • C(patient) gets a higher score and becomes the parent of C(patient, treatment) null C(flow) C(form) C(patient) cran.1 cran.2 cran.3 cran.4 cran.5 cisi.1 med.5 C(patient, treatment) med.1 med.2 med.3 med.4 med.6

  22. Prune Cluster Tree • A small minimum global support • A cluster tree can be broad and deep • Documents of the same topic are distributed over several small clusters • Poor clustering accuracy • The aim of tree pruning • Produce a natural topic hierarchy for browsing • Increase the clustering accuracy

  23. Inter-Cluster Similarity • Inter_Sim of Ca and Cb • Reuse the score function to calculate Sim(Ci←Cj)

  24. Property of Sim(Ci←Cj) • Global support and cluster support are between 0 and 1 • Maximum value of Score= , minimum is • Normalize Score by , Sim is within -1~1 • Avoid negative similarity value, add the term +1 • The range of Sim function is 0~2, so is Inter_Sim • Inter_Sim value is below 1 • Weight of dissimilar items has exceeded the weight of similar items • A good threshold to distinguish two clusters

  25. Child Pruning • Objective: shorten the depth of a tree • Procedure • Scan the tree in the bottom-up order • For each non-leaf node, calculate Inter_Sim between the node and each of its children • If Inter_Sim is above 1, prune the child cluster • If a cluster is pruned, its children become the children of their grandparent • Child pruning is only applicable to level 2

  26. Example • Determine whether cluster C(patient, treatment) should be pruned • Compute the inter-cluster similarity between C(patient) and C(patient, treatment) • Sim(C(patient)←C(patient, treatment)) • Combine all the documents in cluster C(patient, treatment) by adding up their feature vectors (flow, form, layer, patient, result, treatment) med.1=( 0 0 0 8 1 2 ) med.2=( 0 1 0 4 3 1 ) med.3=( 0 0 0 3 0 2 ) med.4=( 0 0 0 6 3 3 ) med.6=( 0 0 0 9 1 1 ) Sum = ( 0 1 0 30 8 9 )

  27. Example(cont.) • Sim(C(patient, treatment)←C(patient))=1.92 • Inter_Sim(C(patient)↔C(patient, treatment))= • Inter_Sim is above 1, cluster C(patient, treatment) is pruned null C(flow) C(form) C(patient) cran.1 cran.2 cran.3 cran.4 cran.5 cisi.1 med.1 med.2 med.3 med.4 med.5 med.6

  28. Sibling Merging • Sibling merging is applicable to level 1 • Procedure • Calculate the Inter_Sim for each pair of clusters at level 1 • Merge the cluster pair that has the highest Inter_Sim • Repeat above steps until • User-specified number of clusters is reached, or • All cluster pairs at level 1 have Inter_Sim below or equal to 1 null C(flow) C(patient) cran.1 cran.2 cran.3 cran.4 cran.5 cisi.1 med.1 med.2 med.3 med.4 med.5 med.6

  29. Experimental Evaluation • Dataset • Clustering Quality (F-measure) • Recall , Precision • Corresponding F-measure: • F-measure for whole clustering result: nij Natural Class i Cluster j

  30. Efficiency & Scalability

  31. Conclusions & Discussion • This research exploits frequent itemsets for • Define a cluster • Use score function, construct initial clusters, make disjoint clusters • Organize the cluster hierarchy • Build cluster tree, prune cluster tree • Discussion: • Use unordered frequent word sets • Different order of words may deliver different meaning • Multiple topics of documents

More Related