Paper 37 M ining Web Pages for D ata R ecords (MDR)

Paper 37 Mining Web Pages for Data Records (MDR) Liu, Bing; Grossman, Robert; Yanhong Zhai University of Illinois at Chicago IEEE Intelligent Systems, 11 Volume: 19, Issue: 6, Pages: 49-55. 2004/11/1 Professors: 陳彥良許秉瑜教授 Presented by: 狄宇昌 2006 Data Mining

Outline • Introduction • Related work (MDR, Omini, IEPAD, Wrapper) • Mining data regions • Comparing generalized nodes (CombComp) • Determining data regions (FindDRs) • Identifying data records (FindRecords) • Experiment results

Introduction(1/5) • Extract information from Web pages help provide value-added services. Such as: • Customizable Web information gathering • Comparative shopping • Metasearching • MDR (mining data records) exploit • Web page structure • A string-matching • Mine contiguous and noncontiguous data records

Introduction(2/5) • Current approach 1, supervised learning • require substantial human effort • Current approach 2, Automatic techniques perform poorly • Only assume relevant items are in a contiguous Web page • Few researchers exploited the nested of HTML structures

Introduction(3/5) • MDR (mining data records) • An automatic technique finds all data records formed by table and form related HTML tags • Such as, table, form, tr, td, and so on • MDR outperformed other existing systems

Introduction(4/5) • MDR base on two observations of web pages layout • Observation one: • Similar objects appear in a contiguous region of a page • Data regions are formatted with similar HTML tags

Introduction(5/5) • Observation two • A tag tree, the nested structured of HTML tags in a Web page • Data records in a specific region under one parent node • As figure 1b, each notebook is wrapped in 5 tr nodes under the same parent nodetbody

Related work(1/2) • Researchers have developed several approaches for mining data records from Web pages • Omini (Object Mining and Extraction system) • use a set of heuristics and a manually constructed domain ontology

Related work(2/2) • IEPAD (Information Extraction based on Pattern Discovery ) • A automatic method that uses sequence alignment to find patterns representing a set of data records • Wrapper induction • Wrapper is a program that extract data from a Web site and put in a DB • learns extraction rules using manually labeled training examples

MDR Technique • 3 Steps as: • Build an HTML tag tree of the page • Mine all data regions in the page by using observations and edit distance string algorithm • Identify data records from each data region

Mining data regions(1/2) • First, mine generalized nodes • a sequence of adjacent generalized nodes form a data region p.16 The Node pair (14,15),(16,17), and (18,19) are generalized nodes of length 2 Node 5,6 are generalized nodes of length 1 Node 8,9,10 are generalized nodes of length 1

Mining data regions(2/2) • A data region contains two or more generalized nodes with properties: • They have the same parent • They have the same length (the same number of child nodes in the tag tree) • They are adjacent • The normalizededit distance between them is less than a fixed threshold

Comparing generalized nodes (1/7) • The mining algorithm must answer two question below: • Q1.Where does the 1st generalized node of a data region start? • Q2.How many tag nodes (components) are in the generalized nodes in each data region

Comparing generalized nodes(2/7) • K: the maximum number of tag nodes in a generalized node. K is small (less than 10) • Answer 1: find a data region starting from each tag node sequentially • Answer 2: try 1-node, 2-node, …, K-node combination

Comparing generalized nodes(3/7) • The number of comparisons is not large for two reason: • Compare only the child nodes of the same parent node. E.g., in figure 2 no need to compare node 8 and node 13 • Some comparison performed for earlier nodes are the same as for later nodes. Therefore, no need to do them twice.

Comparing generalized nodes(4/7) • The figure 3 has 10 nodes below a parent node p. • A generalized node can have a maximum of three components, K=3

Comparing generalized nodes(5/7) • Starting from Node 1, we compute these comparisons: • (1, 2), (2, 3), (3, 4), (4, 5), (5, 6), (6, 7), (7, 8), (8, 9), (9, 10) • (1-2, 3-4), (3-4, 5-6), (5-6, 7-8), (7-8, 9-10) • (1-2-3, 4-5-6), (4-5-6, 7-8-9) • Starting from Node 2, we compute only • (2-3, 4-5), (4-5, 6-7), (6-7, 8-9) • (2-3-4, 5-6-7), (5-6-7, 8-9-10) • Starting from Node 3, we only need to compute one string comparison: (3-4-5, 6-7-8). • No need to start from any other nodes after node 3 because of “K=3”

Comparing generalized nodes(6/7) The algorithm won’t search for the data regions if the subtree’s depth from Node is 1 or 2

Comparing generalized nodes(7/7) • Total number of nodes in the tag tree is N • Without considering string comparison, the complexity of CombComp is O(NK) • Because K is relatively small, the CombComp algorithm linear in N

Determining data region(1/2) • Procedure FindDRs report • the entire area as data region • each row as a generalized node • contains eight data records

Determining data region(2/2) • Two main issues affect the final decisions • If a lower-level data region is within a higher-level data region, we report higher-level data region. • In a data region, we only report only the smallest generalized nodes

Identifying data records(1/5) • Data region  Generalized node  Data Record (object) • A generalized node might contain one or more data records

Identifying data records(2/5) • Noncontiguous object description • HTML code: Name 1, Name2, Description 1, Description 2, Name 3, Name 4, Description 3, Description 4

Identifying data records(3/5) • Finding noncontiguous data records • Group the corresponding children of Node 1 and 2 • Join Node 5 and node 7 to form one • Join Node 6 and node 8 to form another

Identifying data records(4/5) • Data record not in any data regions • Row 1, 2, 3 at same level, row 1, 2 (two generalized node form a data region) • Object 5 won’t be covered by a data region

Identifying data records(5/5) • Finding Object 5, an odd number of objects in a table and HTML tag tree • Use Object 4 (or any of the four object) to match each tag string of the children of the sibling nodes of r1 and r2

Experiment result (1/2) • Evaluate MDR and compare with Omini and IEPAD • Implement and debug MDR by using pages from Amazon, Yahoo, and Hewlett-Packard Web site • Default edit distance threshold 0.3, and no tuning for new pages or Web sites

Experiment result (2/2) • Use standard precision and recall measures • Omini and IEPADonly work well with simple page • Pages with many similar data records and little noise

Current & future work • Currently, two practical applications • Extract consumer product reviews from online merchant sites • A more effective technique for extracting individual data fields from data record • Future work • Study the problem of extracting information from text document that are much less structured than HTML

Paper 37 M ining Web Pages for D ata R ecords (MDR)