1 / 12

A Personalized Search Engine Based on Web Snippet Hierarchical Clustering

A Personalized Search Engine Based on Web Snippet Hierarchical Clustering. Paolo Ferragina, Antonio Gulli Presented by Bin Tan. Clustering Web Search Results. Challenges: On short snippets instead of whole docs Clustering must be done on the fly

creda
Download Presentation

A Personalized Search Engine Based on Web Snippet Hierarchical Clustering

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Personalized Search Engine Based on Web Snippet Hierarchical Clustering Paolo Ferragina, Antonio Gulli Presented by Bin Tan

  2. Clustering Web Search Results • Challenges: • On short snippets instead of whole docs • Clustering must be done on the fly • Clusters should be labeled with meaningful text (accurate and intelligible) • Clusters need to be distinctive • Vivisimo • SNAKET

  3. Categorization of Works • Flat clustering vs. Hierarchical clustering • Label representation: Bag of words vs. contiguous phrase vs. non-contiguous phrase (“gapped sentence”)

  4. Preprocessing • Fetch snippets from 16 search engines • Enrich snippets with anchor texts from a crawled database of 200M web pages

  5. Identification of Candidate Phrases for Labels • Enumerate all pairs of words within a certain proximity window (of size 4) in snippets • Score them based on: • NLP features: PoS, NE • ODP occurrences: term frequency (col freq * inv cat freq?), containing category • Discard low-score pairs

  6. Identification of Candidate Phrases for Labels (cont.) • Word pairs are atomic phrases (how about single words?) • Incrementally merge word pairs into longer phrases (preserve ordering and limit size) • Score phrases based on its constitutes’ scores • Discard low-score phrases

  7. Hierarchical Clustering • Group all snippets containing a candidate phrase into an atomic cluster – allow overlapping • Primary label: the aforementioned candidate phrase • Secondary labels: other candidate phrases occurring in 80% of the snippets in the cluster

  8. Hierarchical Clustering (cont.) • Merge atomic clusters into candidate second-level clusters if they share primary/secondary labels • Primary label: the shared label • Secondary label: other labels occurring in 80% of the snippets in the cluster • Prune second-level clusters that are have similar coverage or similar labels • Recursively produce third-level clusters

  9. How SNAKET can be Used • Hierarchical browsing for knowledge extraction • Hierarchical browsing for result selection • Query reformulation • Personalized ranking(?)

  10. Evaluation

  11. Evaluation (cont.)

  12. Clustering technology: PageRank of the future? • Pros: • Ambiguous query: narrow down result list • Less-ambiguous query: get a bird’s eye view of different aspects • Cons: • Clustering is slow but often unnecessary • Takes time to look at the clusters • Cluster and label quality still to be desired

More Related