250 likes | 793 Views
organizing web pages (search results) into groups, so that different groups correspond to ... Web search results clustering is a version of document clustering, but...
E N D
Slide 1:Clustering Web Search Results
Iwona Bialynicka-Birula
Slide 2:Overview
What is clustering? Applying clustering to web search results Clustering algorithms Case studies Related topics not covered Clustering Clustering in general Document clustering in general Other search and browsing aids Classification Visualization Query expansion
Slide 3:Clustering – the act of grouping similar object into sets In the web search context:organizing web pages (search results) into groups, so that different groups correspond to different user needs search enginei.e.: engine car part Engine Corp.
What is clustering?
Slide 4:Clustering vs. Classification
Classification assigns objects to predefined groups Clustering infers groups based on clustered objects
Slide 5:Why cluster web search results?
Flat ranked list not enough Documents pertaining to different topics cannot be compared Relationships between the results Cluster Hypothesis (van Rijsbergen 1979): „Closely related documents tend to be relevant to the same requests.” Aids user-engine interaction Browsing Help user express his need
Slide 6:Why not just document clustering?
Web search results clustering is a version of document clustering, but… Billions of pages Constantly changing Data mainly unstructured and heterogeneous Additional information to consider (i.e. links, click-through data, etc.)
Slide 7:Some requirements
Fast Immediate response to query Flexible Web content changes constantly User-oriented Main goal is to aid the user in finding sought information
Slide 8:Main issues
Online or offline clustering? What to use as input Entire documents Snippets Structure information (links) Other data (i.e. click-through) Use stop word lists, stemming, etc. How to define similarity? Content (i.e. vector-space model) Link analysis Usage statistics How to group similar documents? How to label the groups?
Slide 9:Clustering algorithms
Flat or hierarchical? Overlapping? Hard or soft? Incremental? Predefined cluster number? Requiring explicit similarity measure? Distance measure?
Slide 10:Clustering algorithms
Distance-based Hierarchical Agglomerative Hierarchical Clustering (AHC) Flat K-means (can be fuzzy) Single-pass (incremental) Other Suffix Tree Clustering (Grouper) Self-organizing (Kohonen) maps (neural networks) Latent Semantic Indexing (LSI) (reducing the dimensionality of the vector-space)
Slide 11:Agglomerative hierarchical clustering
Slide 12:Clustering result: dendrogram
Slide 13:AHC variants
Various ways of calculating cluster similarity single-link (minimum) complete-link (maximum) Group-average (average)
Slide 14:K-means clustering (k=3)
Slide 15:Single-pass
threshold
Slide 16:Selected systems
Scatter/Gather Grouper Carrot2 Vivisimo Mapuccino (Su et. al. 2001) SHOC
Slide 17:Scatter/Gather
(Cutting et. al. 1992) Designed for browsing Based on two novel clustering algorithms Buckshot – fast for online clustering Fractionation – accurate for offline initial clustering of the entire set
Slide 18:Grouper
(Zamir and Etzioni 1997, 1999) Online Operates on query result snippets Clusters together documents with large common subphrases Suffix Tree Clustering (STC) STC induces labeling
Slide 19:Suffix Tree Clustering (STC)
Linear Incremental Overlapping Can be extended to hierarchical
Slide 20:STC algorithm
Step 1: Cleaning Stemming Sentence boundary identification Punctuation elimination Step 2: Suffix tree construction Produces base clusters (internal nodes) Base clusters are scored based on size and phrase score (which depends on length and word „quality”) Step 3: Merging base clusters Highly overlapping clusters are merged
Slide 21:Carrot2
(Stefanowski and Weiss 2003) http://www.cs.put.poznan.pl/dweiss/carrot/ Component framework Allows substituting components for Input (i.e. snippets from other search engines) Filter Stemming Distance measure Clustering Output
Slide 22:Vivísimo
Commercial http://www.vivisimo.com/ Online Hierarchical Conceptual
Slide 23:Other
Mapuccino (IBM) (Maarek et. al. 2000) http://www.alphaworks.ibm.com/tech/mapuccino Relatively efficient AHC (O(n2)) Similarity based on vector-space model (Su et. al. 2001) Only usage statistics used as input Recursive Density Based Clustering SHOC (Zhang and Dong 2004) Grouper-like Key phrase discovery
Slide 24:References
Douglass Cutting, David Karger, Jan Pedersen, and John W. Tukey, Scatter/Gather: A Cluster-based Approach to Browsing Large Document Collections, 1992.Proceedings of the 15th Annual International ACM/SIGIR Conference, Copenhagen. O. Zamir and O. Etzioni, Grouper: a dynamic clustering interface to web search results, May 1999.In Proceedings of the Eighth International World Wide Web Conference, Toronto, CanadaM. Steinbach, G. Y.S. Maarek, R. Fagin, I.Z. Ben-Shaul, D. Pelleg, Ephemeral document clustering for web applications, 2000.Technical Report RJ 10186, IBM Research Zhong Su, Qiang Yang, HongHiang Zhang, Xiaowei Xu and Yuhen Hu, Correlation-based Document Clustering using Web Logs, 2001. J. Stefanowski, D. Weiss. Carrot2 and Language Properties in Web Search Results Clustering, 2003.In: Lecture Notes in Artificial Intelligence: Advances in Web Intelligence, Proceedings of the First International Atlantic Web Intelligence Conference, Madrit, Spain, vol. 2663 (—), pp. 240—249 Dell Zhang, Yisheng Dong. Semantic, Hierarchical, Online Clustering of Web Search Results, Apr 2004.In Proceedings of the 6th Asia Pacific Web Conference (APWEB), Hangzhou, China
Slide 25:Thank you
Questions? http://www.di.unipi.it/~iwona/Clustering.ppt