1 / 31

Prototype Hierarchy Based Clustering for the Categorization and Navigation of Web Collections

Prototype Hierarchy Based Clustering for the Categorization and Navigation of Web Collections. Zhao-Yan Ming, Kai Wang and Tat-Seng Chua School of Computing, National University of Singapore SIGIR 2010 Speaker: Tom Chao Zhou 2010.10.26, Tuesday. Outline. Motivation

gitel
Download Presentation

Prototype Hierarchy Based Clustering for the Categorization and Navigation of Web Collections

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Prototype Hierarchy Based Clustering for the Categorization and Navigation of Web Collections Zhao-Yan Ming, Kai Wang and Tat-Seng Chua School of Computing, National University of Singapore SIGIR 2010 Speaker: Tom Chao Zhou 2010.10.26, Tuesday

  2. Outline • Motivation • Prototype Hierarchy Based Clustering • Problem Formulation and Approach • Experiments

  3. Outline • Motivation • Prototype Hierarchy Based Clustering • Problem Formulation and Approach • Experiments

  4. Motivation • Utility of user-generated-contents • Quality: distinguish good, bad quality content. • Accessibility: • question search • Organizing the huge collections of data for information navigation: Categorization, hierarchical clustering with labels and descriptions of clusters.

  5. Categorization • Users to construct fine-grained topic hierarchies and assign objects • Open Directory Project and Wikipedia • Disadvantage: too many manual efforts. • Coarse grain hierarchies • Yahoo! Answers’ categories. • Disadvantage: too coarse, does not have “IPod”.

  6. Categorization • Supervised techniques. Not appropriate for dynamic Web services. • Unsupervised • Clustering the collections into smaller groups. • Extracting labels for clustered groups.

  7. Prototype Hierarchy based Clustering (PHC) • Tackle web collection categorization and navigation problem. • PHC utilizes the world knowledge in the form of prototype hierarchies, while adapts to the underlying topic structures of the collections.

  8. Prototype Hierarchy based Clustering (PHC) • Advantages • Eliminate the problem of determining the number of clusters and assigning initial clusters by following the structure of the prototype hierarchy. • Results are interpretable, comprehensive, and organized. • Flexible forms of supervision: prototype hierarchy can come in different level of granularity.

  9. Outline • Motivation • Prototype Hierarchy Based Clustering • Problem Formulation and Approach • Experiments

  10. Prototype Hierarchy Based Clustering • Prototype Hierarchy (PH) • A hierarchy whose nodes set V represent a set of <l,p> tuples. p: prototype serving as description of concept l. • Data Hierarchy (DH) • A hierarchy organizes a collection of objects d. Each node represents a category of objects CO.

  11. Problem Formulation • Given a collection D of objects on a topic τ, PHCpartitions and maps D into the categories that are predefined by a PH on τ, such that the formed objects clusters CO1, CO2,..., COk are organized in a DH with similar structures.

  12. some PH node does not have objects. • some questions have no appropriate category to assign to.

  13. Requirements • Data hierarchy is evolving into a compact structure encoding the underlying topics of the collection. • Data and prototype hierarchy matched at both node and relation level. • Distance between objects are measured by appropriate metrics.

  14. Outline • Motivation • Prototype Hierarchy Based Clustering • Problem Formulation and Approach • Experiments

  15. Problem Formulation and Approach • Hierarchy Metric and Information Function • A hierarchy metric as a function that operates on all nodes. • h: V×V->R+, adjacent pair of nodes , • Quality of the structure measured by the amount of information carried in H.

  16. Minimum Evolution • Minimum Evolution (obj1) • Intuition :DH that compactly “encodes” the collection into topic categories is the best. • Monitor the structural evolution of the data hierarchy. • The optimal DH on a collection is the one that contains the least information.

  17. Matching of Prototype Data Hierarchy • Data Hierarchy Centroid • Centroids of DH nodes are generated in an incremental manner. • New object in a leaf node automatically becomes member of its ancestor nodes. • Magnitude of the change decreases with the levels from the leaf node.

  18. Prototype Centrality • Prototype centrality (obj2) • Intuition: Adding a data object into a node, so that the updated centroids are most similar to their corresponding prototypes. • A prototype is located at the center of an object cluster.

  19. Prototype-Data Hierarchy Resemblance • Matching between two hierarchies H1, H2 • Full match, V1=V2 and R1=R2. • Partial match • common hierarchy: matched nodes and relations. • Incomplete match: V1+Vin=V2,R1+Rin=R2 • Excess match:V1=V2+Vin,R1=R2+Rin

  20. Prototype-Data Hierarchy Resemblance • Prototype-Data Hierarchy Resemblance (obj3) • Common part of the data hierarchy and the prototype hierarchy.

  21. Partially Matched Prototype Hierarchy • PH is an incomplete match of DH • Adding dummy child nodes to the existing nodes in PH. • Employ label extraction algorithms. • PH is an excess match of DH • Empty nodes will be removed.

  22. Object Metric • M(di,dj) defined as the similarity between a pair of objects di and dj within a node. • Translation-based Language Model. • semantic. • Syntactic Tree Kernel Matching. • syntactic.

  23. Category Cohesiveness • Category Cohesiveness (obj4) • Objects in the same category are similar to each other. • Objects in different categories are dissimilar to each other.

  24. Multi-Criterion Optimization Function • Minimum evolution. • Prototype centrality. • Prototype-Data Hierarchy Resemblance. • Category cohesiveness.

  25. Outline • Motivation • Prototype Hierarchy Based Clustering • Problem Formulation and Approach • Experiments

  26. Datasets • Hierarchy • Dental: Wikipedia • IPod: manually constructed by combining Wikipedia article, Wordnet, product spec. • Dataset diversity • CS: deep hierarchy. Hierarchies are noise. • RS: broad hierarchy, abstract domain. Hierarchies are noise. • IPod: concrete domain. • Dental: Hierarchy is well constructed.

  27. Experimental Setting • proKmeans • Prototype hierarchy enhanced K-means divisive hierarchical clustering. • LiveClassifier • PHC • CFC Classifier • Supervised text categorization technique.

  28. Specifying a prototype hierarchy for a collection, even a simple method can categorize the collection reasonable well. • PHC is superior in terms of utilizing the prototype hierarchy. • Comparable with supervised method. • PHC introduces new nodes into predefined hierarchy. • PHC works better in concrete domains than on abstract domains.

  29. Ablation Study on Optimization Objectives • Prototype Centrality(obj2) • Category Cohesiveness(obj4) • Prototype-Data Hierarchy Resemblance.(obj3) • Minimum Evolution(obj1) • Data hierarchy varies less from the prototype hierarchy without minimum evolution. (create new node with minimum evolution) • Minimum evolution objective leads to a self-contained data hierarchy.

  30. Robustness with Mismatched Prototype Hierarchy • PHC is robust against overfitted prototype hierarchies. • PHC has only limited ability to create categories.

More Related