260 likes | 377 Views
Incorporating Site-Level Knowledge for Incremental Crawling of Web Forums: A List-wise Strategy. Jiang-Ming Yang, Rui Cai, Lei Zhang, and Wei-Ying Ma Microsoft Research, Asia Chun-song Wang University of Wisconsin-Madison Hua Huang Beijing University of Posts and Telecommunications.
E N D
Incorporating Site-Level Knowledge for Incremental Crawling of Web Forums:A List-wise Strategy Jiang-Ming Yang, Rui Cai, Lei Zhang, and Wei-Ying Ma Microsoft Research, Asia Chun-song Wang University of Wisconsin-Madison Hua Huang Beijing University of Posts and Telecommunications
Web Forums WebSearch Q & A Social Network Forumsis a huge resource with human knowledge !
Forum Data Crawl and Mining Crawling Data Parsing KDD 2009 Incremental Crawling SIGIR 2008 Exploring Traversal Strategy Content Mining WWW 2008 iRobot: Sitemap Reconstruction SIGIR 2009 Expert Finding & Junk detection WWW 2009 Automation Data Parsing KDD 2009 User Behavior in Forums
Characteristics of Forums Pages with Different Functions Pagination Index Page Pagination Post Page
Incremental Crawling • GeneralWeb Pages • Treating page independently, i.e., page-wise • Forum Pages • Considering pagination, i.e., list-wise
OurSolution • Incorporating Site-levelKnowledge • How many kinds of pages in a website • How various pages linked with each others • Purposes • Distinguishindex and post pages • Concatenate pages to list by following paginations Sitemap Construction ListConstruction & Classification Timestamp Extraction Prediction Models Bandwidth Control
Sitemap Construction ListConstruction & Classification Timestamp Extraction Prediction Models Bandwidth Control
ForumSitemap • A sitemap is a directed graph consisting of a set of vertices and links http://forums.asp.net
PageLayoutClustering • Forum pages are based on database & template • Layout is robust to describe template • Layout can be characterized by the HTML elements in different DOM paths (e.g. repetitive patterns) Rui Cai, Jiang-Ming Yang, Wei Lai, Yida Wang and Lei Zhang. iRobot: An Intelligent Crawler for Web Forums. In Proceedings of WWW 2008 Conference
LinkAnalysis Rui Cai, Jiang-Ming Yang, Wei Lai, Yida Wang and Lei Zhang. iRobot: An Intelligent Crawler for Web Forums. In Proceedings of WWW 2008 Conference YidaWang, Jiang-Ming Yang, Wei Lai, Rui Cai and Lei Zhang. Exploring Traversal Strategy for Web Forum Crawling. In Proceedings of SIGIR 2008 Conference
ListConstruction & Classification Sitemap Construction Timestamp Extraction Prediction Models Bandwidth Control
Indentify Index & Post Nodes • A SVM-based Classifier • Site independent • Features • Node size • Link structure • Keywords • Node classification is robust that page • Robust to noise on individual pages
List Reconstruction • Given a new page • Classify into a node • Detect pagination links • Find out link orders
YYYY/MM/DD Timestamp Extraction Sitemap Construction ListConstruction & Classification Prediction Models Bandwidth Control
Timestamp Extraction • Distinguish real timestamps from noises • The temporal order can help !
Prediction Models Sitemap Construction ListConstruction & Classification Timestamp Extraction Bandwidth Control
FeatureExtraction • Features to describe update frequency • List-dependent & independent (site-level statistics) • Absolute & Relative
RegressionModel • Linear regression • Advantages • Lightweight computational cost • Efficient for online process • Predict when the next new record arrives • CT: current time • LT: last (re-)visit time by crawler
Bandwidth Control Sitemap Construction ListConstruction & Classification Timestamp Extraction Prediction Models
Bandwidth Control • Index and post pages are quite different • Post pages blocks the bandwidth • Cannot discover new threads in time • A simple but practical solution
Experiment Setup • 18 web forums in diverse categories • March 1999 ~ June 2008 • 990,476 pages and 5,407,854 posts • Simulation • Repeatable and Controllable • Comparison • List-wise strategy (LWS), • LWS with bandwidth control (LWS + BC) • Curve-fitting policy (CF) • Bound-based policy (BB, WWW 2008) • Oracle (Most ideal case)
Measurements • Bandwidth Utilization • Inew: #pages with new information • IB: #pages crawled • Coverage • Icrawl: #new posts crawled • Iall: #new posts published on forums • Timeliness • ∆ti: #minutes between publish and download
PerformanceComparison • Warm-upStage • Bandwidth: 3000 pages / day
PerformanceComparison(Cont.) • Comparison with various bandwidth
PerformanceComparison(Cont.) • Detailed performance of Index and Post pages • Bandwidth: 3000 pages / day
Conclusions and Future Work • Targeted on web forums, a specific but interesting field. • Developing an effective solution for incremental forum crawling • Integrating site-level knowledge • Some practical engineering implementation • Future work • Improve timestamps extraction algorithm • Stronger prediction model than linear regression