1 / 27

Matching DOM Trees to Search Logs for Accurate Webpage Clustering

Matching DOM Trees to Search Logs for Accurate Webpage Clustering. Deepayan Chakrabarti Rupesh Mehta. Data extraction. Website-specific wrappers. Structured DB. (product_name, price, rating). Webpages from a site. Wrapper 1. Data Extraction.

aysha
Download Presentation

Matching DOM Trees to Search Logs for Accurate Webpage Clustering

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Matching DOM Trees to Search Logs for Accurate Webpage Clustering Deepayan Chakrabarti Rupesh Mehta

  2. Data extraction Website-specific wrappers Structured DB (product_name, price, rating) Webpages from a site

  3. Wrapper 1 Data Extraction • Building wrappers[Muslea+/98, Crescenzi+/01, Cohen+/02, Hogue+/05, Irmak+/06] • Cluster pages from the website based on similarity of DOM structure • Pick a few example pages per cluster • Manually annotate the DOM nodes which contain the data • Automatic wrapper induction using these annotations Wrapper 2

  4. Data Extraction • Clustering affects quality • Too few clusters: • Heterogeneity of clusters • Imperfect wrappers, or even inability to build wrappers • Too many clusters: • Significant editorial effort required to build wrappers • We want to automatically get a good clustering, for any website

  5. Main Idea Wrappers extract it Users search for it “Useful” info on a page search terms match page content html h1 search +click b “html h1” and “html h1 b” are key paths DOM paths repeatedly referenced by search terms are “key” paths

  6. Main Idea • Clustering using key paths • Pre-processing step (for each site) • Given a large sample of pages and search logs • Identify key paths • Run-time (for that website) • Given a new webpage • Find which key paths exist on the page • Map page to cluster using its key paths

  7. Mapping pages to clusters • Pages in a cluster should have similar tree structure • and hence, similar paths • Represent a page by a shingle of its paths [Buttler/04] • Using key paths: • Shingle preferentially picks key paths in the page • Requires a global ranking of key paths

  8. Mapping pages to clusters • One cluster per shingle All pages in a cluster share the same k “key” paths

  9. Main Idea • Clustering using key paths • Pre-processing step (for each site) • Given a large sample of pages and search logs • Identify key paths • Run-time (for that website) • Given a new webpage • Find which key paths exist on the page • Map page to cluster using its key paths

  10. Identify key paths title html • For every (query, webpage) pair • match query terms to text of a DOM path • yields precision and recall for every path • Need to aggregate over all queries and webpages • Expected precision and recall of a path • High if path appears on many queried pages, • and has high precision/recall in most of them h1 price b

  11. Identify key paths • How can we combine expected precision and recall into one ranking of key paths? • F-measure, but • Precision typically more important than recall • Precision and recall may be in completely different scales • This scaling factor varies among websites

  12. Identify key paths • How can we combine expected precision and recall into one ranking of key paths? • Borda method [Borda/1781] • Create two rankings of paths, one by precision and one by recall • Combine rankings into one ranking, using relative importance of precision to recall • Immune to varying scales of precision/recall values among websites

  13. Main Idea • Clustering using key paths • Pre-processing step (for each site) • Given a large sample of pages and search logs • Identify key paths, but • Key paths can be dependent • Run-time (for that website) • Given a new webpage • Find which key paths exist on the page • Map page to cluster using its key paths

  14. Handling dependent paths • Consider the following two paths: • html body div div table tr td h1 span (“product name”) • html body div div table tr td h1 • If one is a key path, probably the other is too • Shingle can get “swamped” • Shingle of a page becomes:(product_name, product_name_parent, product_name_ancestor) • instead of:(product_name, buy_button, rating)

  15. Handling dependent paths • Several sources of dependence • Multiple paths may have similar content • “product name” header and its parent • product name mentioned in a header and in the text • Multiple paths may always co-occur • “product name” header and “price”

  16. Handling dependent paths • Identify key independent paths • Build a graph of dependencies between paths • Pick an independent set of pathsi.e., a set of paths where no one is connected to another • Computation is weighted strongly towards the top-ranked paths • Under our weighting scheme, greedily picking an independent set is optimal

  17. Main Idea • Clustering using key paths • Pre-processing step (for each site) • Given a large sample of pages and search logs • Identify key paths • Run-time (for that website) • Given a new webpage • Find which key paths exist on the page • Map page to cluster using its key paths • Several other optimizations (in paper)

  18. Experiments • 10 major websites • Sampled ~20,000 pages each • Built ground truth • Ran an existing clustering algorithm • Manually checked results • Homogeneous clusters: merge when necessary • Heterogeneous clusters: change parameters, repeat • Small sample of search logs • ~5K unique queries per site • Far fewer than the number of pages per site

  19. Experiments • Compared to clustering using well-known tree-similarity metrics • Path Shingles: Shingle of DOM paths without using key paths [Buttler/04] • pq-Grams: Shingle of sub-trees of DOM tree [Augsten+/05] • m/k Path Shingles: Like path shingles, except only m out of k shingle elements need to match

  20. Experiments Our algorithm [Buttler/04] [Augsten+/05] • Compared clustering using Adjusted RAND index • higher is better, 1.0 is perfect Search logs give significant lift, with very low variance

  21. Experiments Precision of IndepPaths Comparison against paths actually used by manually-designed wrappers Key Paths correspond to paths used in wrappers

  22. Experiments Examples of top-ranked paths

  23. Conclusions • Clusters affect both • wrapper quality, and • degree of editorial effort • We use search logs to automatically find good clusters • Current efforts: • Combining search features with content features to pick key paths

  24. Mapping pages to clusters • Given an ranked list of key paths • Given a shingle-size k • For any page P • Find KP = all key paths in P • If |KP| < k • Shingle = KP plus randomly chosen paths from page • else • Shingle = top-ranked k paths in KP

  25. Experiments

  26. Experiments

  27. Experiments [Buttler/04] [Augsten+/05] • Compared clustering using Adjusted RAND index • higher is better, 1.0 is perfect Our algorithm Shingles w/o key paths Shingles of DOM subtrees Shingle of 8 paths; only 6 need to match Search logs give significant lift, with very low variance

More Related