1 / 76

Annotation Free Information Extraction

Annotation Free Information Extraction. Chia-Hui Chang Department of Computer Science & Information Engineering National Central University chia@csie.ncu.edu.tw 10/4/2002. IEPAD: Information Extraction based on Pattern Discovery. C.H. Chang. National Central University WWW10.

asa
Download Presentation

Annotation Free Information Extraction

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Annotation Free Information Extraction Chia-Hui Chang Department of Computer Science & Information Engineering National Central University chia@csie.ncu.edu.tw 10/4/2002

  2. IEPAD: Information Extraction based on Pattern Discovery C.H. Chang. National Central University WWW10

  3. Semi-structured Information Extraction • Information Extraction (IE) • Input: Html pages • Output: A set of records

  4. Pattern Discovery based IE • Motivation • Display of multiple records often forms a repeated pattern • The occurrences of the pattern are spaced regularly and adjacently • Now the problem becomes ... • Find regular and adjacent repeats in a string

  5. Pattern Generator Html Page Patterns Pattern Viewer Users Extractor ExtractionResults Extraction Rule Html Pages IEPAD Architecture

  6. HTML Page Token Translator A Token String PAT Tree Constructor PAT trees and Maximal Repeats Validator Advenced Patterns Rule Composer Extraction Rules The Pattern Generator • Translator • PAT tree construction • Pattern validator • Rule Composer

  7. 1. Web Page Translation • Encoding of HTML source • Rule 1: Each tag is encoded as a token • Rule 2: Any text between two tags are translated to a special token called TEXT (denoted by a underscore) • HTML Example: <B>Congo</B><I>242</I><BR> <B>Egypt</B><I>20</I><BR> • Encoded token string T(<B>)T(_)T(</B>)T(<I>)T(_)T(</I>)T(<BR>) T(<B>)T(_)T(</B>)T(<I>)T(_)T(</I>)T(<BR>)

  8. Various Encoding Schemes

  9. 2. PAT Tree Construction • PAT tree: binary suffix tree • A Patricia treeconstructed over all possible suffix strings of a text • Example T(<B>) 000 T(</B>) 001 T(<I>) 010 T(</I>) 011 T(<BR>) 100 T(_) 110 • T(<B>)T(_)T(</B>)T(<I>)T(_)T(</I>)T(<BR>) • T(<B>)T(_)T(</B>)T(<I>)T(_)T(</I>)T(<BR>) 000110001010110011100 000110001010110011100

  10. The Constructed PAT Tree

  11. Definition of Maximal Repeats • Let a occurs in S in position p1, p2, p3, …, pk • a is left maximal if there exists at least one (i, j) pair such that S[pi-1]S[pj-1] • a is right maximal if there exists at least one (i, j) pair such that S[pi+|a|]S[pj+|a|] • a is a maximal repeat if it it both left maximal and right maximal

  12. Finding Maximal Repeats • Definition: • Let’s call character S[pi-1] the left character of suffix pi • A node  is left diverse if at least two leaves in the ’s subtree have different left characters • Lemma: • The path labels of an internal node  in a PAT tree is a maximal repeat if and only if  is left diverse

  13. 3. Pattern Validator • Suppose a maximal repeat  are ordered by its position such that suffix p1 < p2 < p3… < pk, where pi denotes the position of each suffix in the encoded token sequence. • Characteristics of a Pattern • Regularity: Variance coefficient • Adjacency: Density

  14. Pattern a No V(a)<0.5 Discard Yes No 0.25<D(a)<1.5 Discard Yes Pattern a Pattern Validator (Cont.) • Basic Screening For each maximal repeat a, compute V(a) and D(a) a) check if the pattern’s variance: V(a) < 0.5 b) check if the pattern’s density: 0.25 < D(a) < 1.5

  15. 4. Rule Composer • Occurrence partition • Flexible variance threshold control • Multiple string alignment • Increase density of a pattern

  16. Occurrence Partition • Problem • Some patterns are divided into several blocks • Ex: Lycos, Excite with large regularity • Solution • Clustering of the occurrences of such a pattern Clustering V(P)<0.1 No P Discard Yes Check density

  17. Multiple String Alignment • Problem • Patterns with density less than 1 can extract only part of the information • Solution • Align k-1 substrings among the k occurrences • A natural generalization of alignment for two strings which can be solved in O(n*m) by dynamic programming where n and m are string lengths.

  18. Multiple String Alignment (Cont.) • Suppose “adc” is the discovered pattern for token string “adcwbdadcxbadcxbdadcb” • If we have the following multiple alignment for strings ``adcwbd'', ``adcxb'' and ``adcxbd'': a d c w b d a d c x b - a d c x b d • The extraction pattern can be generalized as “adc[w|x]b[d|-]”

  19. Pattern Viewer • Java-application based GUI • Web based GUI • http://www.csie.ncu.edu.tw/~chia/WebIEPAD/

  20. The Extractor • Matching the pattern against the encoding token string • Knuth-Morris-Pratt’s algorithm • Boyer-Moore’s algorithm • Alternatives in a rule • matching the longest pattern • What are extracted? • The whole record

  21. Experiment Setup • Fourteen sources: search engines • Performance measures • Number of patterns • Retrieval rate and Accuracy rate • Parameters • Encoding scheme • Thresholds control

  22. Translation • Average page length is 22.7KB

  23. Accuracy and Retrieval Rate

  24. Problems • Guarantee high retrieval rate instead of accuracy rate • Generalized rule can extract more than the desired data • Only applicable when there are several records in a Web page, currently

  25. ROADRUNNER: Towards Automatic Data Extraction from Large Web Sites Valter Crescenzi , Giansalvatore , Paolo Merialdo VLDB2001

  26. Observations 1. Wrapper generator works by using additional information. (labeled samples) 2. Wrapper induction system has some a priori knowledge about the page organization. 3. Finally, systems generate wrapper by examining one HTML page at a time.

  27. ROADRUNNER new perspective 1. Don’t rely on any interaction with the user. (Completely automatic) 2. No a priori knowledge HTML schema will be inferred along with wrapper. Can handle any nested structures. 3. Works with two HTML pages at a time. (based on the study of similarities and dissimilarities between the pages)

  28. Theoretical Background • Site generation = Encoding of database content • Data extraction = Decoding • The problem is based on a close correspondence between nested type and union-free regular expressios.

  29. Delimiter • #PCDATA : map to string • + : map to lists (nested) , being iterator • ? : map to nullable fields, optional patterns. • Find schema and data extraction = Find minimal UFRE.

  30. Matching Technique • It is based on a matching technique called ACME. (Align, Collapse under Mismatch, and Extract) • HTML  XHTML  tokens • Matching algorithm works on two objects: • A list of tokens, call the sample • A wrapper (one UFRE) • This is done by solving mismatches between the wrapper and the sample.

  31. Mismatches 1. String mismatches: • May be due only to different values of a database field. • These mismatches are use to discover fields. (#PCDATA) • Ex : ‘John Smith’ and ‘Paul Jones’ at token 4 2. Tag mismatches: • Optional patterns • Iterative patterns

  32. Discovering Optionals • Strategy: Looking for repeated patterns as a first step, and then, if this attempt fails, in trying to identify optional pattern. • Two steps: • 1. Optional Pattern Location by Cross-Search • Mismatch at token 6 - <UL> and <IMG…/> • Assume optional pattern is located on wrapper or sample. • 2. Wrapper Generalization • ( <IMG src=…/> ) ?

  33. Discovering Iterators 1. Square Location by Terminal – Tag Search : • Both the wrapper and sample contain at least one occurrence of the square. • Terminal Tag = position before the mismatch • In this example is </LI> • Test which is the square initial tag ? • </UI> ~ </LI> v.s. <LI> ~ </LI> • Finally, we can infer that the sample contains one candidate occurrence of the square at token 20-25.

  34. Discovering Iterators (con’t) 2. Square Matching : • Try to match the candidate square occurrence (tokens 20-25). • Backwards : matching token 25 and 19, then moves to 24 and 18 and so on. 3. Wrapper Generalization : • If we denote the newly found square by s, we replace the repeated pattern by (s)+

  35. More Complex Example • First mismatch at token 15 (external mismatch) • Find iterators : • Terminal tag = </LI> • Candidate square is found : <LI> ~ </LI> at token 15-28 • Backward match : second mismatch at token 23 and 9 (internal mismatch)  solve the mismatch by recursive

  36. Recursively solve mismatch • Internal mismatch at token 23 and 9 • Solve it by the same way at external mismatch. • But don’t work by comparing one wrapper and one sample, rather two different portions of the same objects. • Terminal tag = <B> • Candidate square is </B>~<B> token 23-18 • Backward match : mismatch at token 20 and 26 • Find token 20-22 is optional pattern.

  37. Matching as an AND-OR tree • Finding one solution to match(w,s) corresponds to finding one visit for the AND-OR tree. • (i) match(w,s) = all external mismatches encountered during the parsing (AND node) • (ii) solve mismatch by either introducing one field, or one iterator, or one optional (OR) • (iii) The search may either on wrapper or sample (OR) • (iv) iterators and optionals are various candidates (OR) • (v) Discover iterators may be need to recursively solve several internal mismatches. (AND)

  38. AND-OR tree

  39. Experimental Results

  40. Experimental Results (con’t)

  41. Extracting Structured Data from Web Page Arvind Arasu, Hector Garcia-Molina ACM SIGMOD 2003

  42. Cue • Keywords: schema, template • Web pages belonging to the same site are generated by encoding data of the same schema with a common template => a common template by plugging-in value

  43. Figuration

  44. Goal and Challenge • Previous IE Techniques rely on heuristic by human. ex. wrapper • Goal: to deduce the template without human • Time consuming and error-prone • Optional attributes are ignored • Challenge: • No obvious way of differentiating what text is template or data • The schema of data in pages isn’t flat but more complex and semi-structured of attributes

  45. Model, Problem Formulation • Structured Data • Model of Page Creation • Optionals and Disjunctions • Problem Statement • Miscellaneous Terminology, Definition

  46. Structured Data • Token: A token is some basic unit of text • Structured Data: any set of data values conforming to a common schema or type • Define “Type”: 1. Basic Type (β): string of tokens e.g. <html>, text 2. Ordered List Type: tuple constructor order “n” e.g. <T1, T2, …, Tn>, T1, T2, …, Tn : type 3. Define Type: set constructor e.g. {T} , T: type

  47. Define term value and example • Define “instance”: 1. an instance of basic type, β, token 2. an instance of type <T1, T2, …, Tn>is tuple of the form <i1, i2, …, in>, attributes i1, i2, …, in are instances of typesT1, T2, …, Tn 3. an instance of type {T}, is any set of elements {e1, e2, …, em}, such ei is an instance of type T • Instance → Value; String → token • Example: • Schema S1= • Value =

More Related