1 / 26

A densitometric approach to web page segmentation

遠山研 - M 輪 Brice Pesci. A densitometric approach to web page segmentation. About the paper. « A densitometric approach to web page segmentation » Leibniz Universität Hannover Germany CIKM 2008 Conference on Information and Knowledge management. Introduction.

ggeorge
Download Presentation

A densitometric approach to web page segmentation

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. 遠山研- M輪 Brice Pesci A densitometric approach to web page segmentation M輪 Brice Pesci

  2. About the paper • « A densitometric approach to web page segmentation » • Leibniz Universität Hannover • Germany • CIKM 2008 • Conference on Information and Knowledge management M輪 Brice Pesci

  3. Introduction • It is more and more difficult to retrieve distinct information elements on the Web • Menu • Text ads • Snippet... • Need to identify the informative sections • Remove the noise M輪 Brice Pesci

  4. Web Page Segmentation • Goals • De-duplication • Abstract content from layout • Content Extraction • Remove noise, increase classifier performance... • Keyword-based web search • Accuaracy of the results M輪 Brice Pesci

  5. Related works • DOM tree analysis • Mine block speficic patterns • Determine template blocks • Shingles, elements frequencies, isotonic regression, ... • Entropies, common DOM subtries, ... • Vision-based • Still render the DOm • Graph theoric perspective M輪 Brice Pesci

  6. Segmentation as... (1) • A visual problem • Heterogeneity • Various kind of layouts • Various way to generate the same layout • DOM level rule-based algorithms are bound to fail • High complexity • Too slow? • Relationship to image segmentation • Image recognition M輪 Brice Pesci

  7. Segmentation as... (2) • A linguistic problem • Statistical measures to identify structure patterns in plain text documents • Subtopic, ... • : probability of class x depends only on the probability of the neighboring lower class • Examine the statical properties of subsequent blocks with respect to the quantitives properties M輪 Brice Pesci

  8. Segmentation as... (2) • A linguistic problem • Distribution of document lengths • Zipf’s law : • Reasonnable for segmenting the intra-doc text docs? • Sentence length • Stochastic process • « The sentence lengths change along with the text flow » • Occurence probability of a sentence length x q hyperpascal distrbution y : frequency of objects of a class x : rank of the class M輪 Brice Pesci

  9. Segmentation as... (3) • A densitometric problem • Atomic text portion : without element tag • Gap : a sequence of text portions interleaved by opening/closing element tag(s) • Which gaps seperate? Do not separate? • Most likely caused by a change in text flow • Short sentences : navigational menu... • We cannot use « sentences » because of templates M輪 Brice Pesci

  10. Segmentation as... (3) • A densitometric problem • Text density • Number of words within a 2D area • Similar to intensity of a region in computer vision • Word-wrap text : wmax = 80 • English : 5.1 chars / word thus at max words / line • French : 5,13 chars / word • German : 6,26 chars / word M輪 Brice Pesci

  11. Segmentation as... (3) • A densitometric problem • Need to remove last line (might not be complete) • Text density becomes : • Where • Not influanced by the number of additional tokens • Does not measure lexical/grammatical properties • Studies on language show that this may be suffisent T set of tokens in L set of wrapped lines bx block M輪 Brice Pesci

  12. Segmentation as... (4) • A 1-dimentional problem • Detecting block-separating gaps on a web page • Finding neighbored text portions with signficant change in text density • Ex : M輪 Brice Pesci

  13. Onto the block fusion algorithm • A greedy approach is plausible • Thanks to the relation between text flow and sentence length and text density • Based on the Block Growing algorithm • From Computer Vision • Slope delta between 2 blocks • Surrounding blocks dominate enclosed ones • If density of previous and next one are identical and highter, we fuse the 3 of them M輪 Brice Pesci

  14. The Block Fusion algorithm • Plain • We fuse if slope is below a threshold • Smoothed • Surrounding blocks dominate • If density of previous and next one are identical and highter, we fuse the 3 of them M輪 Brice Pesci

  15. A few notes • 2 parameters • : not document-specific • Input blocks B • Complexity : O(n) • About the gaps : • <h1> produces the same gap as <b> ! • Version with rules : • Tforce gap : block-level elements • Tno gap : inline elements H1-H6, UL, DL, OL, HR, TABLE, ADDRESS, HR, IMG, SCRIPT A, B, BR, EM, FONT, I, S, SPAN, STRONG, SUB, SUP, U, TT M輪 Brice Pesci

  16. Experiments (0) • WebSpam UK-2007 • 106 millions pages from 115,000 hosts • 111 non-spam pages from 102 differents sites • Manual results compared to • Word wrap : everyline is a segment • Tag gap : text portions between tag (except A) • BF-plain / smoothed / rulebased • Just rules ( ) • GCuts M輪 Brice Pesci

  17. Experiments (1) • Statistical properties of web page text • Text density /sentence length ? • Adjacent block with samedensity : one block • Not what we expected, but still holds • Also, the number of tokens in a segment follows the Zipf’s law : M輪 Brice Pesci

  18. Experiments (2) • Segmentation accuracy • 2 cluster correlation metrics between 0 and 1 • Adjusted Rand Index • Normalized Multual Information BF-plain/smoothed BF-rulebased M輪 Brice Pesci

  19. Experiments (2) • Segmentation : • BF-plain M輪 Brice Pesci

  20. Experiments (2) • Segmentation : • BF-smoothed M輪 Brice Pesci

  21. Experiments (2) • Segmentation : • BF-rulebased M輪 Brice Pesci

  22. Experiments (3) • Average accuracy : • Performance • Most of the error getsremoved after thefirst iteration • On a standard laptop15ms per page M輪 Brice Pesci

  23. Experiments (4) • Effect of wmax • Confirms relation betweenlanguage-specific average and line width • Stable between80 and 100 M輪 Brice Pesci

  24. Experiments (5) • On near-duplicate detection • Using the LYRICS dataset • 2359 web pages song lyrics by 6 artists • Very effective on near duplicate detection • Narrow winner :JustRules M輪 Brice Pesci

  25. Conclusion • Web page segmentation • Token-level text density is an effective property • New method is inspired by quantitative linguistics and computer vision M輪 Brice Pesci

  26. Fin • Thanks for you listening M輪 Brice Pesci

More Related