1 / 30

Paper 37 M ining Web Pages for D ata R ecords (MDR)

Paper 37 M ining Web Pages for D ata R ecords (MDR). Liu, Bing; Grossman, Robert; Yanhong Zhai University of Illinois at Chicago IEEE Intelligent Systems , 1 1 Volume: 19, Issue: 6, Pages: 49-55. 2004/11/1 Professors: 陳彥良 許秉瑜 教授 Presented by: 狄宇昌 2006 Data Mining. Outline.

Jimmy
Download Presentation

Paper 37 M ining Web Pages for D ata R ecords (MDR)

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Paper 37 Mining Web Pages for Data Records (MDR) Liu, Bing; Grossman, Robert; Yanhong Zhai University of Illinois at Chicago IEEE Intelligent Systems, 11 Volume: 19, Issue: 6, Pages: 49-55. 2004/11/1 Professors: 陳彥良 許秉瑜 教授 Presented by: 狄宇昌 2006 Data Mining

  2. Outline • Introduction • Related work (MDR, Omini, IEPAD, Wrapper) • Mining data regions • Comparing generalized nodes (CombComp) • Determining data regions (FindDRs) • Identifying data records (FindRecords) • Experiment results

  3. Introduction(1/5) • Extract information from Web pages help provide value-added services. Such as: • Customizable Web information gathering • Comparative shopping • Metasearching • MDR (mining data records) exploit • Web page structure • A string-matching • Mine contiguous and noncontiguous data records

  4. Introduction(2/5) • Current approach 1, supervised learning • require substantial human effort • Current approach 2, Automatic techniques perform poorly • Only assume relevant items are in a contiguous Web page • Few researchers exploited the nested of HTML structures

  5. Introduction(3/5) • MDR (mining data records) • An automatic technique finds all data records formed by table and form related HTML tags • Such as, table, form, tr, td, and so on • MDR outperformed other existing systems

  6. Introduction(4/5) • MDR base on two observations of web pages layout • Observation one: • Similar objects appear in a contiguous region of a page • Data regions are formatted with similar HTML tags

  7. Introduction(5/5) • Observation two • A tag tree, the nested structured of HTML tags in a Web page • Data records in a specific region under one parent node • As figure 1b, each notebook is wrapped in 5 tr nodes under the same parent nodetbody

  8. Related work(1/2) • Researchers have developed several approaches for mining data records from Web pages • Omini (Object Mining and Extraction system) • use a set of heuristics and a manually constructed domain ontology

  9. Related work(2/2) • IEPAD (Information Extraction based on Pattern Discovery ) • A automatic method that uses sequence alignment to find patterns representing a set of data records • Wrapper induction • Wrapper is a program that extract data from a Web site and put in a DB • learns extraction rules using manually labeled training examples

  10. MDR Technique • 3 Steps as: • Build an HTML tag tree of the page • Mine all data regions in the page by using observations and edit distance string algorithm • Identify data records from each data region

  11. Mining data regions(1/2) • First, mine generalized nodes • a sequence of adjacent generalized nodes form a data region p.16 The Node pair (14,15),(16,17), and (18,19) are generalized nodes of length 2 Node 5,6 are generalized nodes of length 1 Node 8,9,10 are generalized nodes of length 1

  12. Mining data regions(2/2) • A data region contains two or more generalized nodes with properties: • They have the same parent • They have the same length (the same number of child nodes in the tag tree) • They are adjacent • The normalizededit distance between them is less than a fixed threshold

  13. Comparing generalized nodes (1/7) • The mining algorithm must answer two question below: • Q1.Where does the 1st generalized node of a data region start? • Q2.How many tag nodes (components) are in the generalized nodes in each data region

  14. Comparing generalized nodes(2/7) • K: the maximum number of tag nodes in a generalized node. K is small (less than 10) • Answer 1: find a data region starting from each tag node sequentially • Answer 2: try 1-node, 2-node, …, K-node combination

  15. Comparing generalized nodes(3/7) • The number of comparisons is not large for two reason: • Compare only the child nodes of the same parent node. E.g., in figure 2 no need to compare node 8 and node 13 • Some comparison performed for earlier nodes are the same as for later nodes. Therefore, no need to do them twice.

  16. Comparing generalized nodes(4/7) • The figure 3 has 10 nodes below a parent node p. • A generalized node can have a maximum of three components, K=3

  17. Comparing generalized nodes(5/7) • Starting from Node 1, we compute these comparisons: • (1, 2), (2, 3), (3, 4), (4, 5), (5, 6), (6, 7), (7, 8), (8, 9), (9, 10) • (1-2, 3-4), (3-4, 5-6), (5-6, 7-8), (7-8, 9-10) • (1-2-3, 4-5-6), (4-5-6, 7-8-9) • Starting from Node 2, we compute only • (2-3, 4-5), (4-5, 6-7), (6-7, 8-9) • (2-3-4, 5-6-7), (5-6-7, 8-9-10) • Starting from Node 3, we only need to compute one string comparison: (3-4-5, 6-7-8). • No need to start from any other nodes after node 3 because of “K=3”

  18. Comparing generalized nodes(6/7) The algorithm won’t search for the data regions if the subtree’s depth from Node is 1 or 2

  19. Comparing generalized nodes(7/7) • Total number of nodes in the tag tree is N • Without considering string comparison, the complexity of CombComp is O(NK) • Because K is relatively small, the CombComp algorithm linear in N

  20. Determining data region(1/2) • Procedure FindDRs report • the entire area as data region • each row as a generalized node • contains eight data records

  21. Determining data region(2/2) • Two main issues affect the final decisions • If a lower-level data region is within a higher-level data region, we report higher-level data region. • In a data region, we only report only the smallest generalized nodes

  22. Identifying data records(1/5) • Data region  Generalized node  Data Record (object) • A generalized node might contain one or more data records

  23. Identifying data records(2/5) • Noncontiguous object description • HTML code: Name 1, Name2, Description 1, Description 2, Name 3, Name 4, Description 3, Description 4

  24. Identifying data records(3/5) • Finding noncontiguous data records • Group the corresponding children of Node 1 and 2 • Join Node 5 and node 7 to form one • Join Node 6 and node 8 to form another

  25. Identifying data records(4/5) • Data record not in any data regions • Row 1, 2, 3 at same level, row 1, 2 (two generalized node form a data region) • Object 5 won’t be covered by a data region

  26. Identifying data records(5/5) • Finding Object 5, an odd number of objects in a table and HTML tag tree • Use Object 4 (or any of the four object) to match each tag string of the children of the sibling nodes of r1 and r2

  27. Experiment result (1/2) • Evaluate MDR and compare with Omini and IEPAD • Implement and debug MDR by using pages from Amazon, Yahoo, and Hewlett-Packard Web site • Default edit distance threshold 0.3, and no tuning for new pages or Web sites

  28. Experiment result (2/2) • Use standard precision and recall measures • Omini and IEPADonly work well with simple page • Pages with many similar data records and little noise

  29. Current & future work • Currently, two practical applications • Extract consumer product reviews from online merchant sites • A more effective technique for extracting individual data fields from data record • Future work • Study the problem of extracting information from text document that are much less structured than HTML

More Related