1 / 25

Clustering Web Search Results

organizing web pages (search results) into groups, so that different groups correspond to ... Web search results clustering is a version of document clustering, but...

RexAlvis
Download Presentation

Clustering Web Search Results

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


    Slide 1:Clustering Web Search Results

    Iwona Bialynicka-Birula

    Slide 2:Overview

    What is clustering? Applying clustering to web search results Clustering algorithms Case studies Related topics not covered Clustering Clustering in general Document clustering in general Other search and browsing aids Classification Visualization Query expansion

    Slide 3:Clustering – the act of grouping similar object into sets In the web search context: organizing web pages (search results) into groups, so that different groups correspond to different user needs search engine i.e.: engine car part Engine Corp.

    What is clustering?

    Slide 4:Clustering vs. Classification

    Classification assigns objects to predefined groups Clustering infers groups based on clustered objects

    Slide 5:Why cluster web search results?

    Flat ranked list not enough Documents pertaining to different topics cannot be compared Relationships between the results Cluster Hypothesis (van Rijsbergen 1979): „Closely related documents tend to be relevant to the same requests.” Aids user-engine interaction Browsing Help user express his need

    Slide 6:Why not just document clustering?

    Web search results clustering is a version of document clustering, but… Billions of pages Constantly changing Data mainly unstructured and heterogeneous Additional information to consider (i.e. links, click-through data, etc.)

    Slide 7:Some requirements

    Fast Immediate response to query Flexible Web content changes constantly User-oriented Main goal is to aid the user in finding sought information

    Slide 8:Main issues

    Online or offline clustering? What to use as input Entire documents Snippets Structure information (links) Other data (i.e. click-through) Use stop word lists, stemming, etc. How to define similarity? Content (i.e. vector-space model) Link analysis Usage statistics How to group similar documents? How to label the groups?

    Slide 9:Clustering algorithms

    Flat or hierarchical? Overlapping? Hard or soft? Incremental? Predefined cluster number? Requiring explicit similarity measure? Distance measure?

    Slide 10:Clustering algorithms

    Distance-based Hierarchical Agglomerative Hierarchical Clustering (AHC) Flat K-means (can be fuzzy) Single-pass (incremental) Other Suffix Tree Clustering (Grouper) Self-organizing (Kohonen) maps (neural networks) Latent Semantic Indexing (LSI) (reducing the dimensionality of the vector-space)

    Slide 11:Agglomerative hierarchical clustering

    Slide 12:Clustering result: dendrogram

    Slide 13:AHC variants

    Various ways of calculating cluster similarity single-link (minimum) complete-link (maximum) Group-average (average)

    Slide 14:K-means clustering (k=3)

    Slide 15:Single-pass

    threshold

    Slide 16:Selected systems

    Scatter/Gather Grouper Carrot2 Vivisimo Mapuccino (Su et. al. 2001) SHOC

    Slide 17:Scatter/Gather

    (Cutting et. al. 1992) Designed for browsing Based on two novel clustering algorithms Buckshot – fast for online clustering Fractionation – accurate for offline initial clustering of the entire set

    Slide 18:Grouper

    (Zamir and Etzioni 1997, 1999) Online Operates on query result snippets Clusters together documents with large common subphrases Suffix Tree Clustering (STC) STC induces labeling

    Slide 19:Suffix Tree Clustering (STC)

    Linear Incremental Overlapping Can be extended to hierarchical

    Slide 20:STC algorithm

    Step 1: Cleaning Stemming Sentence boundary identification Punctuation elimination Step 2: Suffix tree construction Produces base clusters (internal nodes) Base clusters are scored based on size and phrase score (which depends on length and word „quality”) Step 3: Merging base clusters Highly overlapping clusters are merged

    Slide 21:Carrot2

    (Stefanowski and Weiss 2003) http://www.cs.put.poznan.pl/dweiss/carrot/ Component framework Allows substituting components for Input (i.e. snippets from other search engines) Filter Stemming Distance measure Clustering Output

    Slide 22:Vivísimo

    Commercial http://www.vivisimo.com/ Online Hierarchical Conceptual

    Slide 23:Other

    Mapuccino (IBM) (Maarek et. al. 2000) http://www.alphaworks.ibm.com/tech/mapuccino Relatively efficient AHC (O(n2)) Similarity based on vector-space model (Su et. al. 2001) Only usage statistics used as input Recursive Density Based Clustering SHOC (Zhang and Dong 2004) Grouper-like Key phrase discovery

    Slide 24:References

    Douglass Cutting, David Karger, Jan Pedersen, and John W. Tukey, Scatter/Gather: A Cluster-based Approach to Browsing Large Document Collections, 1992. Proceedings of the 15th Annual International ACM/SIGIR Conference, Copenhagen. O. Zamir and O. Etzioni, Grouper: a dynamic clustering interface to web search results, May 1999. In Proceedings of the Eighth International World Wide Web Conference, Toronto, CanadaM. Steinbach, G. Y.S. Maarek, R. Fagin, I.Z. Ben-Shaul, D. Pelleg, Ephemeral document clustering for web applications, 2000. Technical Report RJ 10186, IBM Research Zhong Su, Qiang Yang, HongHiang Zhang, Xiaowei Xu and Yuhen Hu, Correlation-based Document Clustering using Web Logs, 2001. J. Stefanowski, D. Weiss. Carrot2 and Language Properties in Web Search Results Clustering, 2003. In: Lecture Notes in Artificial Intelligence: Advances in Web Intelligence, Proceedings of the First International Atlantic Web Intelligence Conference, Madrit, Spain, vol. 2663 (—), pp. 240—249 Dell Zhang, Yisheng Dong. Semantic, Hierarchical, Online Clustering of Web Search Results, Apr 2004. In Proceedings of the 6th Asia Pacific Web Conference (APWEB), Hangzhou, China

    Slide 25:Thank you

    Questions? http://www.di.unipi.it/~iwona/Clustering.ppt

More Related