1 / 14

Data-rich Section Extraction from HTML pages

Data-rich Section Extraction from HTML pages. Introducing the DSE -Algorithm Original Paper from: Jiying Wang and Fred H. Lochovsky Department of Computer Science University of Science & Technology Hong Kong presentation from Max Arends. The problem:

willem
Download Presentation

Data-rich Section Extraction from HTML pages

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data-rich Section Extraction from HTML pages Introducing the DSE-Algorithm Original Paper from: Jiying Wang and Fred H. Lochovsky Department of Computer Science University of Science & Technology Hong Kong presentation from Max Arends

  2. The problem: Given a web-page find the Data-rich Section of the page without any input What is it making difficult? Decoration and advertisement “human-oriented” HTML pages are difficult for computer programs to parse Data-rich Section Extraction from HTML pages – DSE Algorithm

  3. Data-rich Section Extraction from HTML pages – DSE Algorithm • Topic distillation: • tries to distill a small number of high-quality pages that are most representative of the topic. • Basic Idea ist that the number of links pointing to a page offers an assessment of its popularity and quality. • Web Information Extraction: • tries to extract data items from web pages, usually semi-structured, and return it in a structured data DSE – Algorithm improves both!

  4. Data-rich Section Extraction from HTML pages – DSE Algorithm Overview: HITS Algorithm: • One of the most well-known topic distillation algorithms. • Given a set of web pages about one specific topic, the HITS algorithm calculates the authority score (indication for relevant links) • Basically looking how many links are pointing to that page (Google)

  5. Data-rich Section Extraction from HTML pages – DSE Algorithm • The DSE Algorithm (Data-rich Section Extraction) • Basic Idea: • Pages are simular or the same (same CMS, style) • Basic method: • Find use structural information and identify the basic layout. • Find “neighboring” pages on the same site and compare them.

  6. Data-rich Section Extraction from HTML pages – DSE Algorithm What is the Data-rich Section on a HTML page? • Both sites share similar layout • The key content is in the lower right section

  7. Data-rich Section Extraction from HTML pages – DSE Algorithm 3 Phases: • 1. Discover a set of pages as sample pages, that are simular to the target page • 2. These HTML pages are parsed and converted into tag-trees • 3. Compare the target page tree with the sample page tree to identify their common parts. The difference is the Data rich section

  8. Data-rich Section Extraction from HTML pages – DSE Algorithm Phase 1: Discovering sample URLs US(i,j) [URL similarity] estimates the similarity of two pages

  9. Data-rich Section Extraction from HTML pages – DSE Algorithm Phase 2: Tree creation • The target page and the sample page are being parsed. • The HTML page's layout is brought into a tree like structure (DOM) • Unimportant tags are being ignored: FONT, SMALL, H1,H6 • Unimportet arributes (like BACKGROUND) are being ignored, to avoid unnecessary computations and comparisons

  10. Data-rich Section Extraction from HTML pages – DSE Algorithm Phase 3: Tree Matching • Given two DOM trees (one representing the target page and one the sample page), the similar structures have to be matched • The two trees are being traversed using a depth-first order and compare them node-by-node • The parts of the tree that don't match are the Data-rich Sections

  11. Data-rich Section Extraction from HTML pages – DSE Algorithm

  12. Data-rich Section Extraction from HTML pages – DSE Algorithm Applying DSE to HITS • 28 queries are used • for each quer we sent it to the Google search engine and require that the first 200 be returned • Result pages are add to the root set • Send each of the 200 results to Google again to retrieve at most 100 inlinks pointing to the result page and add them also to the root set. • The root set ranges from 975 to 6,776 nodes

  13. Data-rich Section Extraction from HTML pages – DSE Algorithm

  14. Data-rich Section Extraction from HTML pages – DSE Algorithm

More Related