1 / 34

Title Extraction from Bodies of HTML Documents and its Application to Web Page Retrieval

Title Extraction from Bodies of HTML Documents and its Application to Web Page Retrieval. Yunhua Hu 1 , Guomao Xin 2 , Ruihua Song, Guoping Hu 3 , Shuming Shi, Yunbo Cao, and Hang Li Microsoft Research Asia 1: Xi’an Jiaotong University 2: Peking University

waneta
Download Presentation

Title Extraction from Bodies of HTML Documents and its Application to Web Page Retrieval

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Title Extraction from Bodies of HTML Documents and its Application to Web Page Retrieval Yunhua Hu1, Guomao Xin2, Ruihua Song, Guoping Hu3, Shuming Shi, Yunbo Cao, and Hang Li Microsoft Research Asia 1: Xi’an Jiaotong University 2: Peking University 3: University of Science and Technology of China

  2. Outline • Motivation • Related work • Problem description • Our approach • Experimental results • Conclusions

  3. Outline • Motivation • Related work • Problem description • Our approach • Experimental results • Conclusions

  4. Motivation • Title of HTML document should be defined in title filed • Title fields of HTML documents are not reliable

  5. Can We Extract Title from Body of HTML?

  6. Outline • Motivation • Related work • Problem description • Our approach • Experimental results • Conclusions

  7. Related Work: Web Information Extraction • Information type: data record, news article, summary • Data structure: DOM tree, block • Approach: rule-based approach vs machine learning based approach • Domain specific vs domain independent • Not clear how to extract title from body

  8. Related Work: Web Information Retrieval • Title filed, anchor text, and URL are useful for web page retrieval • Not clear whether extracted title is useful

  9. Outline • Motivation • Related work • Problem description • Our approach • Experimental results • Conclusions

  10. National Weather Service Oxnard Los Angeles Marine Weather Statement Title Extraction Task • Input: HTML document (web page) • Output: title(s) from body of HTML document • Condition: domain independent Extracted titles HTML document

  11. Spec on HTML Title • Intuitively, title is ‘most conspicuous’ part • Can have 0-2 titles • Must be on top region • Font size, font weight, etc are noticeable • Can cross several lines, but usually in same format • Cannot be in bullets and list • Cannot be expressions like “under construction”, … • Image is not considered

  12. Examples

  13. Outline • Motivation • Related work • Problem description • Our approach • Experimental results • Conclusions

  14. Title Extraction Processing • Title extraction as information extraction • Using DOM tree • Leaf node containing ‘text’ as unit (instance) • Mainly using format information Title

  15. DOM tree HTML document DOM Tree

  16. General framework for Information Extraction Model Learning Tool Extraction Tool

  17. HTML Title Extraction Perceptron Classifier Learning Tool x: unit Y: title? Extraction Tool

  18. Information Used in Features (1) • Rich format information • Font size: 1~7 levels • Font weight: bold face or not • Font family: Times New Roman, Arial, etc • Font style: normal or italic • Font color: #000000, #FF0000, etc • Background color: #FFFFFF, #FF0000, etc • Alignment: center, left, right, and justify. • Tag information • H1,H2,…,H6: levels as header • LI: a listed item • DIR: a directory list • A: a link or anchor • U: an underline • BR: a line break • HR: a horizontal ruler • IMG: an image • Class name: ‘sectionheader’, ‘title’, ‘titling’,’ header’, etc.

  19. Information Used in Features (2) • Position information • Position from beginning of body • Width of unit in page • DOM tree information • Number of sibling nodes in the DOM tree. • Relations with root node, parent node and sibling nodes in terms of font size change, etc. • Relations with previous leaf node and next leaf node, in terms of font size change, etc. • Linguistic information • Length of text: number of characters • Length of real text: number of alphabetic letters • Negative words: ‘by’, ‘date’, ‘phone’, ‘fax’, ‘email’, ‘author’, etc. • Positive words: ‘abstract’, ‘introduction’, ‘summary’, ‘overview’, ‘subject’, ‘title’, etc.

  20. Use of Extracted Title in Web Page Retrieval • Employing BM25 framework • BasicField: texts in body and title are used • BaiscField+Title • BasicField+ExtTitle • BasicField+CombTitle

  21. Outline • Motivation • Related work • Problem description • Our approach • Experimental results • Conclusions

  22. Data for Title Extraction Experiments

  23. Title Extraction Results (TREC, Cross-Validation)

  24. Title Extraction Results(MS, Cross Validation)

  25. Title Extraction:Feature Contribution MS

  26. Title Extraction:Domain Adaptation

  27. Query Data for Retrieval Experiments

  28. Web Page Retrieval Results (TREC) TREC-2003 NP

  29. Web Page Retrieval Results(TREC) TREC-2003 HP

  30. Web Page Retrieval Results (TREC) 2003 TD

  31. Average Precision for Each Method

  32. Outline • Motivation • Related work • Problem description • Our approach • Experimental results • Conclusions

  33. Conclusions • Title fields of HTML documents are not reliable • We propose conducting title extraction from bodies of HTML documents • Construct domain-independent model using machine learning and format features • Use of extracted titles can help improve precision of web page retrieval, particularly TREC name page finding

  34. Thanks!

More Related