1 / 26

Incorporating Site-Level Knowledge for Incremental Crawling of Web Forums: A List-wise Strategy

Incorporating Site-Level Knowledge for Incremental Crawling of Web Forums: A List-wise Strategy. Jiang-Ming Yang, Rui Cai, Lei Zhang, and Wei-Ying Ma Microsoft Research, Asia Chun-song Wang University of Wisconsin-Madison Hua Huang Beijing University of Posts and Telecommunications.

vanya
Download Presentation

Incorporating Site-Level Knowledge for Incremental Crawling of Web Forums: A List-wise Strategy

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Incorporating Site-Level Knowledge for Incremental Crawling of Web Forums:A List-wise Strategy Jiang-Ming Yang, Rui Cai, Lei Zhang, and Wei-Ying Ma Microsoft Research, Asia Chun-song Wang University of Wisconsin-Madison Hua Huang Beijing University of Posts and Telecommunications

  2. Web Forums WebSearch Q & A Social Network Forumsis a huge resource with human knowledge !

  3. Forum Data Crawl and Mining Crawling Data Parsing KDD 2009 Incremental Crawling SIGIR 2008 Exploring Traversal Strategy Content Mining WWW 2008 iRobot: Sitemap Reconstruction SIGIR 2009 Expert Finding & Junk detection WWW 2009 Automation Data Parsing KDD 2009 User Behavior in Forums

  4. Characteristics of Forums Pages with Different Functions Pagination Index Page Pagination Post Page

  5. Incremental Crawling • GeneralWeb Pages • Treating page independently, i.e., page-wise • Forum Pages • Considering pagination, i.e., list-wise

  6. OurSolution • Incorporating Site-levelKnowledge • How many kinds of pages in a website • How various pages linked with each others • Purposes • Distinguishindex and post pages • Concatenate pages to list by following paginations Sitemap Construction ListConstruction & Classification Timestamp Extraction Prediction Models Bandwidth Control

  7. Sitemap Construction ListConstruction & Classification Timestamp Extraction Prediction Models Bandwidth Control

  8. ForumSitemap • A sitemap is a directed graph consisting of a set of vertices and links http://forums.asp.net

  9. PageLayoutClustering • Forum pages are based on database & template • Layout is robust to describe template • Layout can be characterized by the HTML elements in different DOM paths (e.g. repetitive patterns) Rui Cai, Jiang-Ming Yang, Wei Lai, Yida Wang and Lei Zhang. iRobot: An Intelligent Crawler for Web Forums. In Proceedings of WWW 2008 Conference

  10. LinkAnalysis Rui Cai, Jiang-Ming Yang, Wei Lai, Yida Wang and Lei Zhang. iRobot: An Intelligent Crawler for Web Forums. In Proceedings of WWW 2008 Conference YidaWang, Jiang-Ming Yang, Wei Lai, Rui Cai and Lei Zhang. Exploring Traversal Strategy for Web Forum Crawling. In Proceedings of SIGIR 2008 Conference

  11. ListConstruction & Classification Sitemap Construction Timestamp Extraction Prediction Models Bandwidth Control

  12. Indentify Index & Post Nodes • A SVM-based Classifier • Site independent • Features • Node size • Link structure • Keywords • Node classification is robust that page • Robust to noise on individual pages

  13. List Reconstruction • Given a new page • Classify into a node • Detect pagination links • Find out link orders

  14. YYYY/MM/DD Timestamp Extraction Sitemap Construction ListConstruction & Classification Prediction Models Bandwidth Control

  15. Timestamp Extraction • Distinguish real timestamps from noises • The temporal order can help !

  16. Prediction Models Sitemap Construction ListConstruction & Classification Timestamp Extraction Bandwidth Control

  17. FeatureExtraction • Features to describe update frequency • List-dependent & independent (site-level statistics) • Absolute & Relative

  18. RegressionModel • Linear regression • Advantages • Lightweight computational cost • Efficient for online process • Predict when the next new record arrives • CT: current time • LT: last (re-)visit time by crawler

  19. Bandwidth Control Sitemap Construction ListConstruction & Classification Timestamp Extraction Prediction Models

  20. Bandwidth Control • Index and post pages are quite different • Post pages blocks the bandwidth • Cannot discover new threads in time • A simple but practical solution

  21. Experiment Setup • 18 web forums in diverse categories • March 1999 ~ June 2008 • 990,476 pages and 5,407,854 posts • Simulation • Repeatable and Controllable • Comparison • List-wise strategy (LWS), • LWS with bandwidth control (LWS + BC) • Curve-fitting policy (CF) • Bound-based policy (BB, WWW 2008) • Oracle (Most ideal case)

  22. Measurements • Bandwidth Utilization • Inew: #pages with new information • IB: #pages crawled • Coverage • Icrawl: #new posts crawled • Iall: #new posts published on forums • Timeliness • ∆ti: #minutes between publish and download

  23. PerformanceComparison • Warm-upStage • Bandwidth: 3000 pages / day

  24. PerformanceComparison(Cont.) • Comparison with various bandwidth

  25. PerformanceComparison(Cont.) • Detailed performance of Index and Post pages • Bandwidth: 3000 pages / day

  26. Conclusions and Future Work • Targeted on web forums, a specific but interesting field. • Developing an effective solution for incremental forum crawling • Integrating site-level knowledge • Some practical engineering implementation • Future work • Improve timestamps extraction algorithm • Stronger prediction model than linear regression

More Related