Data-rich Section Extraction from HTML pages

Data-rich Section Extraction from HTML pages Introducing the DSE-Algorithm Original Paper from: Jiying Wang and Fred H. Lochovsky Department of Computer Science University of Science & Technology Hong Kong presentation from Max Arends

The problem: Given a web-page find the Data-rich Section of the page without any input What is it making difficult? Decoration and advertisement “human-oriented” HTML pages are difficult for computer programs to parse Data-rich Section Extraction from HTML pages – DSE Algorithm

Data-rich Section Extraction from HTML pages – DSE Algorithm • Topic distillation: • tries to distill a small number of high-quality pages that are most representative of the topic. • Basic Idea ist that the number of links pointing to a page offers an assessment of its popularity and quality. • Web Information Extraction: • tries to extract data items from web pages, usually semi-structured, and return it in a structured data DSE – Algorithm improves both!

Data-rich Section Extraction from HTML pages – DSE Algorithm Overview: HITS Algorithm: • One of the most well-known topic distillation algorithms. • Given a set of web pages about one specific topic, the HITS algorithm calculates the authority score (indication for relevant links) • Basically looking how many links are pointing to that page (Google)

Data-rich Section Extraction from HTML pages – DSE Algorithm • The DSE Algorithm (Data-rich Section Extraction) • Basic Idea: • Pages are simular or the same (same CMS, style) • Basic method: • Find use structural information and identify the basic layout. • Find “neighboring” pages on the same site and compare them.

Data-rich Section Extraction from HTML pages – DSE Algorithm What is the Data-rich Section on a HTML page? • Both sites share similar layout • The key content is in the lower right section

Data-rich Section Extraction from HTML pages – DSE Algorithm 3 Phases: • 1. Discover a set of pages as sample pages, that are simular to the target page • 2. These HTML pages are parsed and converted into tag-trees • 3. Compare the target page tree with the sample page tree to identify their common parts. The difference is the Data rich section

Data-rich Section Extraction from HTML pages – DSE Algorithm Phase 1: Discovering sample URLs US(i,j) [URL similarity] estimates the similarity of two pages

Data-rich Section Extraction from HTML pages – DSE Algorithm Phase 2: Tree creation • The target page and the sample page are being parsed. • The HTML page's layout is brought into a tree like structure (DOM) • Unimportant tags are being ignored: FONT, SMALL, H1,H6 • Unimportet arributes (like BACKGROUND) are being ignored, to avoid unnecessary computations and comparisons

Data-rich Section Extraction from HTML pages – DSE Algorithm Phase 3: Tree Matching • Given two DOM trees (one representing the target page and one the sample page), the similar structures have to be matched • The two trees are being traversed using a depth-first order and compare them node-by-node • The parts of the tree that don't match are the Data-rich Sections

Data-rich Section Extraction from HTML pages – DSE Algorithm

Data-rich Section Extraction from HTML pages – DSE Algorithm Applying DSE to HITS • 28 queries are used • for each quer we sent it to the Google search engine and require that the first 200 be returned • Result pages are add to the root set • Send each of the 200 results to Google again to retrieve at most 100 inlinks pointing to the result page and add them also to the root set. • The root set ranges from 975 to 6,776 nodes

Data-rich Section Extraction from HTML pages – DSE Algorithm

Data-rich Section Extraction from HTML pages

Data-rich Section Extraction from HTML pages

Presentation Transcript

HTML and Web Pages

HTML: Pages and Tools

From Forms to HTML: Understanding and Using Oracle Projects’ HTML Pages

Content Extraction from HTML Documents

Schema Matching and Data Extraction over HTML Tables

Text Extraction from Big Data

From raw data to rich data

Scheme Matching and Data Extraction over HTML Tables from Heterogeneous Sources

Schema Matching and Data Extraction over HTML Tables

Multimedia Information extraction from HTML product catalogues

Formatting Pages with HTML

Scheme Matching and Data Extraction over HTML Tables

Information extraction from web pages using extraction ontologies

HTML and Active Pages

Properties Data Extraction from Remax

HTML and Web Pages

Scheme Matching and Data Extraction over HTML Tables

Schema Matching and Data Extraction over HTML Tables

Information extraction from web pages using extraction ontologies

Schema Matching and Data Extraction over HTML Tables

HTML and Web Pages

The Data Records Extraction from Web Pages