1 / 40

Web Data Extraction

Web Data Extraction. Aki Hecht Seminar in Databases (236826) January 2009. Agenda. Introduction Building Tag Trees Mining Data Regions Partial Tree Alignment Extraction Given Multiple Pages. Introduction. Enormous amount of data is stored in open databases.

telyn
Download Presentation

Web Data Extraction

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Web Data Extraction Aki Hecht Seminar in Databases (236826) January 2009

  2. Agenda • Introduction • Building Tag Trees • Mining Data Regions • Partial Tree Alignment • Extraction Given Multiple Pages

  3. Introduction • Enormous amount of data is stored in open databases. • Most databases retrieve web pages with structured data objects. • Usually “Deep Web” pages • Non trivial task to crawl those pages • The data is important and useful for many applications: • Price comparison engines • Collecting individuals information

  4. The goal Given a HTML page containing multiple data records – insert the data into a table. • No assumptions allowed on the amount of data records in the page nor on their structure/content. • The extraction should be done automatically • Human intervention can help in getting more accurate results, but the cost is too high.

  5. Example 1

  6. Example 2 More than one data region!

  7. General idea • Given a Web page: • Build the HTML tag tree • Mine data regions • Mining data records directly is hard • Identify data records from each data region • Learn the structure of a general data record • A data record can contain optional fields • Extract the data

  8. Agenda • Introduction • Building Tag Trees • Mining Data Regions • Partial Tree Alignment • Extraction Given Multiple Pages

  9. Building a tag tree • Most HTML tags work in pairs. Within each corresponding tag-pair, there can be other pairs of tags, resulting in a nested structure. • Some tags do not require closing tags (e.g., <li>, <hr> and <p>) although they have closing tags. • Additional closing tags need to be inserted to ensure all tags are balanced. • Building a tag treefrom a page using its HTML code is thus natural.

  10. An example

  11. The tag tree

  12. Building trees using visual cues • The HTML code can contain errors. • Browsers are sophisticated enough to display pages with HTML errors. • We can build the tag tree using the browser’s mechanism. • Each HTML element is rendered as a rectangle. • Containments of rectangles representing nesting.

  13. An example

  14. Agenda • Introduction • Building Tag Trees • Mining Data Regions • Partial Tree Alignment • Extraction Given Multiple Pages

  15. Tree Edit Distance • Tree edit distance between two trees A and B is the cost associated with the minimum set of operations needed to transform A into B. • The set of operations used to define tree edit distance includes three operations: • node removal • node insertion • node replacement A cost is assigned to each of the operations.

  16. Finding Tree Edit Distance • Tree edit distance is very similar to string edit distance. • Can be found in the same way • Done by finding the minimal cost mapping between the two trees.

  17. Finding Tree Edit Distance cont. • The algorithm for finding the minimal cost mapping is identical for both trees and strings. • Based on dynamic programming

  18. Mining Data Regions • Definition: A generalized node of length r consists of r (r  1) nodes in the tag tree with the following two properties: • the nodes all have the same parent. • the nodes are adjacent. • Definition: A data region is a collection of two or more generalized nodes with the following properties: • the generalized nodes all have the same parent. • the generalized nodes all have the same length. • the generalized nodes are all adjacent. • the similarity between adjacent generalized nodes is greater than a fixed threshold.

  19. An Example 1 The regions were found using tree edit distance. For example, nodes 5 and 6 are similar (low cost mapping), they also suit the above definition and therefore they define a data region 2 3 4 6 9 10 5 7 8 12 11 Region 1 Region 2 13 16 17 14 15 18 19 Region 3

  20. Agenda • Introduction • Building Tag Trees • Mining Data Regions • Partial Tree Alignment • Extraction Given Multiple Pages

  21. Partial Tree Alignment • For each data region we have found we need to understand the structure of the data records in the region. • Not all data records contain the same fields (optional fields are possible) • We will use (partial) tree alignment to gather the structure.

  22. The algorithm • Choose a seed tree: • A seed tree, denoted by Ts, is picked with the maximum number of data items. • Tree matching: • For each unmatched tree Ti (i ≠ s), • match Ts and Ti. • Each pair of matched nodes are linked (aligned). • For each unmatched node nj in Ti do • expand Ts by inserting nj into Ts if a position for insertion can be uniquely determined in Ts. The expanded seed tree Ts is then used in subsequent matching.

  23. Partial Tree Alignment of two trees Ti Ts p p e d a c b e b Insertion is possible p New part of Ts e d c b a Ti p p Ts Insertion is not possible e a b a x e

  24. Full algorithm

  25. A complete example p T2 p Ts = T1 p T3 … k c n x b g d b d h k b c Ts p No node inserted … x b d p New Ts c b x d h k … T2 is matched again p T2 p … g n x c d h k b g k c b n

  26. Output data table Different data records contain different fields!

  27. Agenda • Introduction • Building Tag Trees • Mining Data Regions • Partial Tree Alignment • Extraction Given Multiple Pages

  28. Extraction given multiple pages • The described technique is good for a single list page. • It can clearly be used for multiple list pages. • Templates from all input pages may be found separately and merged to produce a single refined pattern. • Extraction results will get more accurate. • In many applications, one needs to extract the data from the detail pages as they contain more information on the object.

  29. Detail pages – an example More data in the detail pages A list page

  30. Extraction from detail pages • For extraction, we can treat each detail page as a data record, then extract using partial tree alignment. • For instance, to apply the algorithm, we simply create a rooted tree as follows: • create an artificial root node, and • make the tag tree of each page as a child sub-tree of the artificial root node.

  31. An example r … We already know how to extract data from a data region

  32. Difficulty with detail pages • Although a detail page focuses on a single object, the page may contain a large amount of “noise”, at the top, on the left and right and at the bottom. • Mostly in commercial websites • Since we treat each page as a data record, the algorithm will also extract the “noise”.

  33. An example (a lot of noise)

  34. The solution • To start, a sample page is taken as the wrapper. • The wrapper is then refined by solving mismatches between the wrapper and each sample page, which generalizes the wrapper. • A mismatch occurs when some token in the sample does not match the grammar of the wrapper.

  35. Wrapper generalization • Different types of mismatches: • Text string mismatches: indicate data fields (or items). • Tag mismatches: indicate list of repeated patterns or optional elements. • Find the last token of the mismatch position and identify some candidate repeated patterns from the wrapper and sample by searching forward.

  36. An example

  37. Summary • Automatic extraction of data from a web page requires understanding of the data records’ structure. • First step is finding the data records in the page. • Second step is merging the different structures and build a generic template for a data record. • Partial tree alignment is one method for building the template.

  38. Summary cont. Automatic extraction • Advantages: • It is scalable to a huge number of sites due to the automatic process. • Disadvantages: • It may extract a large amount of unwanted data because the system does not know what is interesting to the user. Domain heuristics or manual filtering may be needed to remove unwanted data. • Extracted data from multiple sites need integration, i.e., their schemas need to be matched.

  39. Thank you! Question?

  40. Bibliography • Y. Zhai, B. Liu “Web data extraction based on partial tree alignment”. International World Wide Web Conference (2005) • Y. zhai, B. Liu "Structured data extraction from the web based on partial tree alignment," IEEE Transactions on Knowledge and Data Engineering (2006) • DC Reis, PB Golgher, AS Silva, AF Laender “Automatic web news extraction using tree edit distance” Proceedings of the 13th international conference on World Wide Web Conference (2004)

More Related