310 likes | 435 Views
Prototype Hierarchy Based Clustering for the Categorization and Navigation of Web Collections. Zhao-Yan Ming, Kai Wang and Tat-Seng Chua School of Computing, National University of Singapore SIGIR 2010 Speaker: Tom Chao Zhou 2010.10.26, Tuesday. Outline. Motivation
E N D
Prototype Hierarchy Based Clustering for the Categorization and Navigation of Web Collections Zhao-Yan Ming, Kai Wang and Tat-Seng Chua School of Computing, National University of Singapore SIGIR 2010 Speaker: Tom Chao Zhou 2010.10.26, Tuesday
Outline • Motivation • Prototype Hierarchy Based Clustering • Problem Formulation and Approach • Experiments
Outline • Motivation • Prototype Hierarchy Based Clustering • Problem Formulation and Approach • Experiments
Motivation • Utility of user-generated-contents • Quality: distinguish good, bad quality content. • Accessibility: • question search • Organizing the huge collections of data for information navigation: Categorization, hierarchical clustering with labels and descriptions of clusters.
Categorization • Users to construct fine-grained topic hierarchies and assign objects • Open Directory Project and Wikipedia • Disadvantage: too many manual efforts. • Coarse grain hierarchies • Yahoo! Answers’ categories. • Disadvantage: too coarse, does not have “IPod”.
Categorization • Supervised techniques. Not appropriate for dynamic Web services. • Unsupervised • Clustering the collections into smaller groups. • Extracting labels for clustered groups.
Prototype Hierarchy based Clustering (PHC) • Tackle web collection categorization and navigation problem. • PHC utilizes the world knowledge in the form of prototype hierarchies, while adapts to the underlying topic structures of the collections.
Prototype Hierarchy based Clustering (PHC) • Advantages • Eliminate the problem of determining the number of clusters and assigning initial clusters by following the structure of the prototype hierarchy. • Results are interpretable, comprehensive, and organized. • Flexible forms of supervision: prototype hierarchy can come in different level of granularity.
Outline • Motivation • Prototype Hierarchy Based Clustering • Problem Formulation and Approach • Experiments
Prototype Hierarchy Based Clustering • Prototype Hierarchy (PH) • A hierarchy whose nodes set V represent a set of <l,p> tuples. p: prototype serving as description of concept l. • Data Hierarchy (DH) • A hierarchy organizes a collection of objects d. Each node represents a category of objects CO.
Problem Formulation • Given a collection D of objects on a topic τ, PHCpartitions and maps D into the categories that are predefined by a PH on τ, such that the formed objects clusters CO1, CO2,..., COk are organized in a DH with similar structures.
some PH node does not have objects. • some questions have no appropriate category to assign to.
Requirements • Data hierarchy is evolving into a compact structure encoding the underlying topics of the collection. • Data and prototype hierarchy matched at both node and relation level. • Distance between objects are measured by appropriate metrics.
Outline • Motivation • Prototype Hierarchy Based Clustering • Problem Formulation and Approach • Experiments
Problem Formulation and Approach • Hierarchy Metric and Information Function • A hierarchy metric as a function that operates on all nodes. • h: V×V->R+, adjacent pair of nodes , • Quality of the structure measured by the amount of information carried in H.
Minimum Evolution • Minimum Evolution (obj1) • Intuition :DH that compactly “encodes” the collection into topic categories is the best. • Monitor the structural evolution of the data hierarchy. • The optimal DH on a collection is the one that contains the least information.
Matching of Prototype Data Hierarchy • Data Hierarchy Centroid • Centroids of DH nodes are generated in an incremental manner. • New object in a leaf node automatically becomes member of its ancestor nodes. • Magnitude of the change decreases with the levels from the leaf node.
Prototype Centrality • Prototype centrality (obj2) • Intuition: Adding a data object into a node, so that the updated centroids are most similar to their corresponding prototypes. • A prototype is located at the center of an object cluster.
Prototype-Data Hierarchy Resemblance • Matching between two hierarchies H1, H2 • Full match, V1=V2 and R1=R2. • Partial match • common hierarchy: matched nodes and relations. • Incomplete match: V1+Vin=V2,R1+Rin=R2 • Excess match:V1=V2+Vin,R1=R2+Rin
Prototype-Data Hierarchy Resemblance • Prototype-Data Hierarchy Resemblance (obj3) • Common part of the data hierarchy and the prototype hierarchy.
Partially Matched Prototype Hierarchy • PH is an incomplete match of DH • Adding dummy child nodes to the existing nodes in PH. • Employ label extraction algorithms. • PH is an excess match of DH • Empty nodes will be removed.
Object Metric • M(di,dj) defined as the similarity between a pair of objects di and dj within a node. • Translation-based Language Model. • semantic. • Syntactic Tree Kernel Matching. • syntactic.
Category Cohesiveness • Category Cohesiveness (obj4) • Objects in the same category are similar to each other. • Objects in different categories are dissimilar to each other.
Multi-Criterion Optimization Function • Minimum evolution. • Prototype centrality. • Prototype-Data Hierarchy Resemblance. • Category cohesiveness.
Outline • Motivation • Prototype Hierarchy Based Clustering • Problem Formulation and Approach • Experiments
Datasets • Hierarchy • Dental: Wikipedia • IPod: manually constructed by combining Wikipedia article, Wordnet, product spec. • Dataset diversity • CS: deep hierarchy. Hierarchies are noise. • RS: broad hierarchy, abstract domain. Hierarchies are noise. • IPod: concrete domain. • Dental: Hierarchy is well constructed.
Experimental Setting • proKmeans • Prototype hierarchy enhanced K-means divisive hierarchical clustering. • LiveClassifier • PHC • CFC Classifier • Supervised text categorization technique.
Specifying a prototype hierarchy for a collection, even a simple method can categorize the collection reasonable well. • PHC is superior in terms of utilizing the prototype hierarchy. • Comparable with supervised method. • PHC introduces new nodes into predefined hierarchy. • PHC works better in concrete domains than on abstract domains.
Ablation Study on Optimization Objectives • Prototype Centrality(obj2) • Category Cohesiveness(obj4) • Prototype-Data Hierarchy Resemblance.(obj3) • Minimum Evolution(obj1) • Data hierarchy varies less from the prototype hierarchy without minimum evolution. (create new node with minimum evolution) • Minimum evolution objective leads to a self-contained data hierarchy.
Robustness with Mismatched Prototype Hierarchy • PHC is robust against overfitted prototype hierarchies. • PHC has only limited ability to create categories.