250 likes | 393 Views
Iwona Bialynicka-Birula - Clustering Web Search Results. Overview. What is clustering?Applying clustering to web search resultsClustering algorithmsCase studiesRelated topics not coveredClusteringClustering in generalDocument clustering in generalOther search and browsing aidsClassification
E N D
1. Clustering Web Search Results Iwona Bialynicka-Birula
2. Iwona Bialynicka-Birula - Clustering Web Search Results Overview What is clustering?
Applying clustering to web search results
Clustering algorithms
Case studies
Related topics not covered
Clustering
Clustering in general
Document clustering in general
Other search and browsing aids
Classification
Visualization
Query expansion
3. Iwona Bialynicka-Birula - Clustering Web Search Results Clustering – the act of grouping similar object into sets
In the web search context:organizing web pages (search results) into groups, so that different groups correspond to different user needs search enginei.e.: engine car part Engine Corp. What is clustering?
4. Iwona Bialynicka-Birula - Clustering Web Search Results Clustering vs. Classification Classification assigns objects to predefined groups
Clustering infers groups based on clustered objects
5. Iwona Bialynicka-Birula - Clustering Web Search Results Why cluster web search results? Flat ranked list not enough
Documents pertaining to different topics cannot be compared
Relationships between the results
Cluster Hypothesis (van Rijsbergen 1979): „Closely related documents tend to be relevant to the same requests.”
Aids user-engine interaction
Browsing
Help user express his need
6. Iwona Bialynicka-Birula - Clustering Web Search Results Why not just document clustering? Web search results clustering is a version of document clustering, but…
Billions of pages
Constantly changing
Data mainly unstructured and heterogeneous
Additional information to consider (i.e. links, click-through data, etc.)
7. Iwona Bialynicka-Birula - Clustering Web Search Results Some requirements Fast
Immediate response to query
Flexible
Web content changes constantly
User-oriented
Main goal is to aid the user in finding sought information
8. Iwona Bialynicka-Birula - Clustering Web Search Results Main issues Online or offline clustering?
What to use as input
Entire documents
Snippets
Structure information (links)
Other data (i.e. click-through)
Use stop word lists, stemming, etc.
How to define similarity?
Content (i.e. vector-space model)
Link analysis
Usage statistics
How to group similar documents?
How to label the groups?
9. Iwona Bialynicka-Birula - Clustering Web Search Results Clustering algorithms Flat or hierarchical?
Overlapping?
Hard or soft?
Incremental?
Predefined cluster number?
Requiring explicit similarity measure? Distance measure?
10. Iwona Bialynicka-Birula - Clustering Web Search Results Clustering algorithms Distance-based
Hierarchical
Agglomerative Hierarchical Clustering (AHC)
Flat
K-means (can be fuzzy)
Single-pass (incremental)
Other
Suffix Tree Clustering (Grouper)
Self-organizing (Kohonen) maps (neural networks)
Latent Semantic Indexing (LSI) (reducing the dimensionality of the vector-space)
11. Iwona Bialynicka-Birula - Clustering Web Search Results Agglomerative hierarchical clustering
12. Iwona Bialynicka-Birula - Clustering Web Search Results Clustering result: dendrogram
13. Iwona Bialynicka-Birula - Clustering Web Search Results AHC variants Various ways of calculating cluster similarity
14. Iwona Bialynicka-Birula - Clustering Web Search Results K-means clustering (k=3)
15. Iwona Bialynicka-Birula - Clustering Web Search Results Single-pass
16. Iwona Bialynicka-Birula - Clustering Web Search Results Selected systems Scatter/Gather
Grouper
Carrot2
Vivisimo
Mapuccino
(Su et. al. 2001)
SHOC
17. Iwona Bialynicka-Birula - Clustering Web Search Results Scatter/Gather (Cutting et. al. 1992)
Designed for browsing
Based on two novel clustering algorithms
Buckshot – fast for online clustering
Fractionation – accurate for offline initial clustering of the entire set
18. Iwona Bialynicka-Birula - Clustering Web Search Results Grouper (Zamir and Etzioni 1997, 1999)
Online
Operates on query result snippets
Clusters together documents with large common subphrases
Suffix Tree Clustering (STC)
STC induces labeling
19. Iwona Bialynicka-Birula - Clustering Web Search Results Suffix Tree Clustering (STC) Linear
Incremental
Overlapping
Can be extended to hierarchical
20. Iwona Bialynicka-Birula - Clustering Web Search Results STC algorithm Step 1: Cleaning
Stemming
Sentence boundary identification
Punctuation elimination
Step 2: Suffix tree construction
Produces base clusters (internal nodes)
Base clusters are scored based on size and phrase score (which depends on length and word „quality”)
Step 3: Merging base clusters
Highly overlapping clusters are merged
21. Iwona Bialynicka-Birula - Clustering Web Search Results Carrot2 (Stefanowski and Weiss 2003)
http://www.cs.put.poznan.pl/dweiss/carrot/
Component framework
Allows substituting components for
Input (i.e. snippets from other search engines)
Filter
Stemming
Distance measure
Clustering
Output
22. Iwona Bialynicka-Birula - Clustering Web Search Results Vivísimo Commercial
http://www.vivisimo.com/
Online
Hierarchical
Conceptual
23. Iwona Bialynicka-Birula - Clustering Web Search Results Other Mapuccino (IBM)
(Maarek et. al. 2000)
http://www.alphaworks.ibm.com/tech/mapuccino
Relatively efficient AHC (O(n2))
Similarity based on vector-space model
(Su et. al. 2001)
Only usage statistics used as input
Recursive Density Based Clustering
SHOC
(Zhang and Dong 2004)
Grouper-like
Key phrase discovery
24. Iwona Bialynicka-Birula - Clustering Web Search Results References Douglass Cutting, David Karger, Jan Pedersen, and John W. Tukey, Scatter/Gather: A Cluster-based Approach to Browsing Large Document Collections, 1992.Proceedings of the 15th Annual International ACM/SIGIR Conference, Copenhagen.
O. Zamir and O. Etzioni, Grouper: a dynamic clustering interface to web search results, May 1999.In Proceedings of the Eighth International World Wide Web Conference, Toronto, CanadaM. Steinbach, G.
Y.S. Maarek, R. Fagin, I.Z. Ben-Shaul, D. Pelleg, Ephemeral document clustering for web applications, 2000.Technical Report RJ 10186, IBM Research
Zhong Su, Qiang Yang, HongHiang Zhang, Xiaowei Xu and Yuhen Hu, Correlation-based Document Clustering using Web Logs, 2001.
J. Stefanowski, D. Weiss. Carrot2 and Language Properties in Web Search Results Clustering, 2003.In: Lecture Notes in Artificial Intelligence: Advances in Web Intelligence, Proceedings of the First International Atlantic Web Intelligence Conference, Madrit, Spain, vol. 2663 (—), pp. 240—249
Dell Zhang, Yisheng Dong. Semantic, Hierarchical, Online Clustering of Web Search Results, Apr 2004.In Proceedings of the 6th Asia Pacific Web Conference (APWEB), Hangzhou, China
25. Iwona Bialynicka-Birula - Clustering Web Search Results Thank you Questions?
http://www.di.unipi.it/~iwona/Clustering.ppt