140 likes | 293 Views
Data-rich Section Extraction from HTML pages. Introducing the DSE -Algorithm Original Paper from: Jiying Wang and Fred H. Lochovsky Department of Computer Science University of Science & Technology Hong Kong presentation from Max Arends. The problem:
E N D
Data-rich Section Extraction from HTML pages Introducing the DSE-Algorithm Original Paper from: Jiying Wang and Fred H. Lochovsky Department of Computer Science University of Science & Technology Hong Kong presentation from Max Arends
The problem: Given a web-page find the Data-rich Section of the page without any input What is it making difficult? Decoration and advertisement “human-oriented” HTML pages are difficult for computer programs to parse Data-rich Section Extraction from HTML pages – DSE Algorithm
Data-rich Section Extraction from HTML pages – DSE Algorithm • Topic distillation: • tries to distill a small number of high-quality pages that are most representative of the topic. • Basic Idea ist that the number of links pointing to a page offers an assessment of its popularity and quality. • Web Information Extraction: • tries to extract data items from web pages, usually semi-structured, and return it in a structured data DSE – Algorithm improves both!
Data-rich Section Extraction from HTML pages – DSE Algorithm Overview: HITS Algorithm: • One of the most well-known topic distillation algorithms. • Given a set of web pages about one specific topic, the HITS algorithm calculates the authority score (indication for relevant links) • Basically looking how many links are pointing to that page (Google)
Data-rich Section Extraction from HTML pages – DSE Algorithm • The DSE Algorithm (Data-rich Section Extraction) • Basic Idea: • Pages are simular or the same (same CMS, style) • Basic method: • Find use structural information and identify the basic layout. • Find “neighboring” pages on the same site and compare them.
Data-rich Section Extraction from HTML pages – DSE Algorithm What is the Data-rich Section on a HTML page? • Both sites share similar layout • The key content is in the lower right section
Data-rich Section Extraction from HTML pages – DSE Algorithm 3 Phases: • 1. Discover a set of pages as sample pages, that are simular to the target page • 2. These HTML pages are parsed and converted into tag-trees • 3. Compare the target page tree with the sample page tree to identify their common parts. The difference is the Data rich section
Data-rich Section Extraction from HTML pages – DSE Algorithm Phase 1: Discovering sample URLs US(i,j) [URL similarity] estimates the similarity of two pages
Data-rich Section Extraction from HTML pages – DSE Algorithm Phase 2: Tree creation • The target page and the sample page are being parsed. • The HTML page's layout is brought into a tree like structure (DOM) • Unimportant tags are being ignored: FONT, SMALL, H1,H6 • Unimportet arributes (like BACKGROUND) are being ignored, to avoid unnecessary computations and comparisons
Data-rich Section Extraction from HTML pages – DSE Algorithm Phase 3: Tree Matching • Given two DOM trees (one representing the target page and one the sample page), the similar structures have to be matched • The two trees are being traversed using a depth-first order and compare them node-by-node • The parts of the tree that don't match are the Data-rich Sections
Data-rich Section Extraction from HTML pages – DSE Algorithm
Data-rich Section Extraction from HTML pages – DSE Algorithm Applying DSE to HITS • 28 queries are used • for each quer we sent it to the Google search engine and require that the first 200 be returned • Result pages are add to the root set • Send each of the 200 results to Google again to retrieve at most 100 inlinks pointing to the result page and add them also to the root set. • The root set ranges from 975 to 6,776 nodes
Data-rich Section Extraction from HTML pages – DSE Algorithm
Data-rich Section Extraction from HTML pages – DSE Algorithm