280 likes | 386 Views
Incorporating Site-Level Knowledge to Extract Structured Data from Web Forums. Jiang-Ming Yang, Rui Cai, Yida Wang, Jun Zhu, Lei Zhang, and Wei-Ying Ma Web Search & Mining Group Microsoft Research Asia 2009-04. Web Forum Data. An important information resource with a lot of human knowledge.
E N D
Incorporating Site-Level Knowledge to Extract Structured Data from Web Forums Jiang-Ming Yang, Rui Cai, Yida Wang, Jun Zhu, Lei Zhang, and Wei-Ying Ma Web Search & Mining Group Microsoft Research Asia 2009-04
Web Forum Data • An important information resource with a lot of human knowledge. • These information include recreation, sports, games, computers, art, society, science, home, health; • 20% pages on the search results are from forums
Understanding Forum Crawling Data Extraction Quality Assessment WWW’08 iRobot: An Intelligent Crawler for Web Forums SIGIR’08 Exploring Traversal Strategy KDD’09 Incremental Crawling WWW’09, Automation Data Extraction SIGIR’09 Quality Assessment
Challenge • Leverage more site-level knowledge
ForumSitemap • A sitemap is a directed graph corresponding consisting of a set of vertices and the links • Rui Cai, Jiangming Yang, Wei Lai, Yida Wang and Lei Zhang. iRobot: An Intelligent Crawler for Web Forums. In Proceedings of WWW 2008 Conference
PageClustering • Forum pages are based on database & template • Layout is robust to describe template • Layout can be characterized by the HTML elements in different DOM paths
Page Clustering Dom Path Feature Discovery Clustering by Virtual Tables
Link Analysis A Link = URL Pattern + Location
Inner-Page Features • The inclusion relation. Data records usually have inclusion relations. • The alignment relation. Since data is generated from database and represented via templates, data records with the same label may appear repeatedly in a page. • Time Order. Since post records are generated sequentially along timeline, the post time should be sorted ascending or descending.
Problem Setting Author Title Content
Formulas of list page • Formulas for identifying list record • Formulas for identifying list title
Formulas of post page • Formulas for identifying post record • Formulas for identifying post author
Formulas of post page • Formulas for identifying post time • Formulas for identifying post content
Markov Logic Networks • An MLN can be viewed as a template for constructing Markov Random Fields. • With a set of formulas and constants, MLNs define a Markov network with one node per ground atom and one feature per ground formula. The probability of a state x in such a network is given by:
Markov Logic Networks • Divide DOM tree elements into three categories : • Text element • Hyperlink element • Inner element • Benefit • Reduce the number of possible groundings in inference. • Reduce the ambiguity and achieve better performance.
Experiments List Pages Post Pages
Future works http://discussions.apple.com/
Conclusion • A template-independent approach to extract structured data from web forum sites. • we can leverage power of site-level information, such as the mutual information among pages, inner or inter vertices of the sitemap. • http://research.microsoft.com/people/jmyang/