Annotation Free Information Extraction

Annotation Free Information Extraction Chia-Hui Chang Department of Computer Science & Information Engineering National Central University chia@csie.ncu.edu.tw 10/4/2002

IEPAD: Information Extraction based on Pattern Discovery C.H. Chang. National Central University WWW10

Semi-structured Information Extraction • Information Extraction (IE) • Input: Html pages • Output: A set of records

Pattern Discovery based IE • Motivation • Display of multiple records often forms a repeated pattern • The occurrences of the pattern are spaced regularly and adjacently • Now the problem becomes ... • Find regular and adjacent repeats in a string

Pattern Generator Html Page Patterns Pattern Viewer Users Extractor ExtractionResults Extraction Rule Html Pages IEPAD Architecture

HTML Page Token Translator A Token String PAT Tree Constructor PAT trees and Maximal Repeats Validator Advenced Patterns Rule Composer Extraction Rules The Pattern Generator • Translator • PAT tree construction • Pattern validator • Rule Composer

1. Web Page Translation • Encoding of HTML source • Rule 1: Each tag is encoded as a token • Rule 2: Any text between two tags are translated to a special token called TEXT (denoted by a underscore) • HTML Example: Congo242 Egypt20 • Encoded token string T()T(_)T()T()T(_)T()T( ) T()T(_)T()T()T(_)T()T( )

Various Encoding Schemes

2. PAT Tree Construction • PAT tree: binary suffix tree • A Patricia treeconstructed over all possible suffix strings of a text • Example T() 000 T() 001 T() 010 T() 011 T( ) 100 T(_) 110 • T()T(_)T()T()T(_)T()T( ) • T()T(_)T()T()T(_)T()T( ) 000110001010110011100 000110001010110011100

The Constructed PAT Tree

Definition of Maximal Repeats • Let a occurs in S in position p1, p2, p3, …, pk • a is left maximal if there exists at least one (i, j) pair such that S[pi-1]S[pj-1] • a is right maximal if there exists at least one (i, j) pair such that S[pi+|a|]S[pj+|a|] • a is a maximal repeat if it it both left maximal and right maximal

Finding Maximal Repeats • Definition: • Let’s call character S[pi-1] the left character of suffix pi • A node  is left diverse if at least two leaves in the ’s subtree have different left characters • Lemma: • The path labels of an internal node  in a PAT tree is a maximal repeat if and only if  is left diverse

3. Pattern Validator • Suppose a maximal repeat  are ordered by its position such that suffix p1 < p2 < p3… < pk, where pi denotes the position of each suffix in the encoded token sequence. • Characteristics of a Pattern • Regularity: Variance coefficient • Adjacency: Density

Pattern a No V(a)<0.5 Discard Yes No 0.25<D(a)<1.5 Discard Yes Pattern a Pattern Validator (Cont.) • Basic Screening For each maximal repeat a, compute V(a) and D(a) a) check if the pattern’s variance: V(a) < 0.5 b) check if the pattern’s density: 0.25 < D(a) < 1.5

4. Rule Composer • Occurrence partition • Flexible variance threshold control • Multiple string alignment • Increase density of a pattern

Occurrence Partition • Problem • Some patterns are divided into several blocks • Ex: Lycos, Excite with large regularity • Solution • Clustering of the occurrences of such a pattern Clustering V(P)<0.1 No P Discard Yes Check density

Multiple String Alignment • Problem • Patterns with density less than 1 can extract only part of the information • Solution • Align k-1 substrings among the k occurrences • A natural generalization of alignment for two strings which can be solved in O(n*m) by dynamic programming where n and m are string lengths.

Multiple String Alignment (Cont.) • Suppose “adc” is the discovered pattern for token string “adcwbdadcxbadcxbdadcb” • If we have the following multiple alignment for strings `àdcwbd'', `àdcxb'' and `àdcxbd'': a d c w b d a d c x b - a d c x b d • The extraction pattern can be generalized as “adc[w|x]b[d|-]”

Pattern Viewer • Java-application based GUI • Web based GUI • http://www.csie.ncu.edu.tw/~chia/WebIEPAD/

The Extractor • Matching the pattern against the encoding token string • Knuth-Morris-Pratt’s algorithm • Boyer-Moore’s algorithm • Alternatives in a rule • matching the longest pattern • What are extracted? • The whole record

Experiment Setup • Fourteen sources: search engines • Performance measures • Number of patterns • Retrieval rate and Accuracy rate • Parameters • Encoding scheme • Thresholds control

Translation • Average page length is 22.7KB

Accuracy and Retrieval Rate

Problems • Guarantee high retrieval rate instead of accuracy rate • Generalized rule can extract more than the desired data • Only applicable when there are several records in a Web page, currently

ROADRUNNER: Towards Automatic Data Extraction from Large Web Sites Valter Crescenzi , Giansalvatore , Paolo Merialdo VLDB2001

Observations 1. Wrapper generator works by using additional information. (labeled samples) 2. Wrapper induction system has some a priori knowledge about the page organization. 3. Finally, systems generate wrapper by examining one HTML page at a time.

ROADRUNNER new perspective 1. Don’t rely on any interaction with the user. (Completely automatic) 2. No a priori knowledge HTML schema will be inferred along with wrapper. Can handle any nested structures. 3. Works with two HTML pages at a time. (based on the study of similarities and dissimilarities between the pages)

Theoretical Background • Site generation = Encoding of database content • Data extraction = Decoding • The problem is based on a close correspondence between nested type and union-free regular expressios.

Delimiter • #PCDATA : map to string • + : map to lists (nested) , being iterator • ? : map to nullable fields, optional patterns. • Find schema and data extraction = Find minimal UFRE.

Matching Technique • It is based on a matching technique called ACME. (Align, Collapse under Mismatch, and Extract) • HTML  XHTML  tokens • Matching algorithm works on two objects: • A list of tokens, call the sample • A wrapper (one UFRE) • This is done by solving mismatches between the wrapper and the sample.

Mismatches 1. String mismatches: • May be due only to different values of a database field. • These mismatches are use to discover fields. (#PCDATA) • Ex : ‘John Smith’ and ‘Paul Jones’ at token 4 2. Tag mismatches: • Optional patterns • Iterative patterns

Discovering Optionals • Strategy: Looking for repeated patterns as a first step, and then, if this attempt fails, in trying to identify optional pattern. • Two steps: • 1. Optional Pattern Location by Cross-Search • Mismatch at token 6 - <UL> and <IMG…/> • Assume optional pattern is located on wrapper or sample. • 2. Wrapper Generalization • ( <IMG src=…/> ) ?

Discovering Iterators 1. Square Location by Terminal – Tag Search : • Both the wrapper and sample contain at least one occurrence of the square. • Terminal Tag = position before the mismatch • In this example is </LI> • Test which is the square initial tag ? • </UI> ~ </LI> v.s. <LI> ~ </LI> • Finally, we can infer that the sample contains one candidate occurrence of the square at token 20-25.

Discovering Iterators (con’t) 2. Square Matching : • Try to match the candidate square occurrence (tokens 20-25). • Backwards : matching token 25 and 19, then moves to 24 and 18 and so on. 3. Wrapper Generalization : • If we denote the newly found square by s, we replace the repeated pattern by (s)+

More Complex Example • First mismatch at token 15 (external mismatch) • Find iterators : • Terminal tag = </LI> • Candidate square is found : <LI> ~ </LI> at token 15-28 • Backward match : second mismatch at token 23 and 9 (internal mismatch)  solve the mismatch by recursive

Recursively solve mismatch • Internal mismatch at token 23 and 9 • Solve it by the same way at external mismatch. • But don’t work by comparing one wrapper and one sample, rather two different portions of the same objects. • Terminal tag = • Candidate square is ~ token 23-18 • Backward match : mismatch at token 20 and 26 • Find token 20-22 is optional pattern.

Matching as an AND-OR tree • Finding one solution to match(w,s) corresponds to finding one visit for the AND-OR tree. • (i) match(w,s) = all external mismatches encountered during the parsing (AND node) • (ii) solve mismatch by either introducing one field, or one iterator, or one optional (OR) • (iii) The search may either on wrapper or sample (OR) • (iv) iterators and optionals are various candidates (OR) • (v) Discover iterators may be need to recursively solve several internal mismatches. (AND)

AND-OR tree

Experimental Results

Experimental Results (con’t)

Extracting Structured Data from Web Page Arvind Arasu, Hector Garcia-Molina ACM SIGMOD 2003

Cue • Keywords: schema, template • Web pages belonging to the same site are generated by encoding data of the same schema with a common template =＞ a common template by plugging-in value

Figuration

Goal and Challenge • Previous IE Techniques rely on heuristic by human. ex. wrapper • Goal: to deduce the template without human • Time consuming and error-prone • Optional attributes are ignored • Challenge: • No obvious way of differentiating what text is template or data • The schema of data in pages isn’t flat but more complex and semi-structured of attributes

Model, Problem Formulation • Structured Data • Model of Page Creation • Optionals and Disjunctions • Problem Statement • Miscellaneous Terminology, Definition

Structured Data • Token: A token is some basic unit of text • Structured Data: any set of data values conforming to a common schema or type • Define “Type”: 1. Basic Type (β): string of tokens e.g. ＜html＞, text 2. Ordered List Type: tuple constructor order “n” e.g. ＜T1, T2, …, Tn＞, T1, T2, …, Tn : type 3. Define Type: set constructor e.g. {T} , T: type

Define term value and example • Define “instance”: 1. an instance of basic type, β, token 2. an instance of type ＜T1, T2, …, Tn＞is tuple of the form ＜i1, i2, …, in＞, attributes i1, i2, …, in are instances of typesT1, T2, …, Tn 3. an instance of type {T}, is any set of elements {e1, e2, …, em}, such ei is an instance of type T • Instance → Value; String → token • Example: • Schema S1= • Value =

Annotation Free Information Extraction