760 likes | 948 Views
Annotation Free Information Extraction. Chia-Hui Chang Department of Computer Science & Information Engineering National Central University chia@csie.ncu.edu.tw 10/4/2002. IEPAD: Information Extraction based on Pattern Discovery. C.H. Chang. National Central University WWW10.
E N D
Annotation Free Information Extraction Chia-Hui Chang Department of Computer Science & Information Engineering National Central University chia@csie.ncu.edu.tw 10/4/2002
IEPAD: Information Extraction based on Pattern Discovery C.H. Chang. National Central University WWW10
Semi-structured Information Extraction • Information Extraction (IE) • Input: Html pages • Output: A set of records
Pattern Discovery based IE • Motivation • Display of multiple records often forms a repeated pattern • The occurrences of the pattern are spaced regularly and adjacently • Now the problem becomes ... • Find regular and adjacent repeats in a string
Pattern Generator Html Page Patterns Pattern Viewer Users Extractor ExtractionResults Extraction Rule Html Pages IEPAD Architecture
HTML Page Token Translator A Token String PAT Tree Constructor PAT trees and Maximal Repeats Validator Advenced Patterns Rule Composer Extraction Rules The Pattern Generator • Translator • PAT tree construction • Pattern validator • Rule Composer
1. Web Page Translation • Encoding of HTML source • Rule 1: Each tag is encoded as a token • Rule 2: Any text between two tags are translated to a special token called TEXT (denoted by a underscore) • HTML Example: <B>Congo</B><I>242</I><BR> <B>Egypt</B><I>20</I><BR> • Encoded token string T(<B>)T(_)T(</B>)T(<I>)T(_)T(</I>)T(<BR>) T(<B>)T(_)T(</B>)T(<I>)T(_)T(</I>)T(<BR>)
2. PAT Tree Construction • PAT tree: binary suffix tree • A Patricia treeconstructed over all possible suffix strings of a text • Example T(<B>) 000 T(</B>) 001 T(<I>) 010 T(</I>) 011 T(<BR>) 100 T(_) 110 • T(<B>)T(_)T(</B>)T(<I>)T(_)T(</I>)T(<BR>) • T(<B>)T(_)T(</B>)T(<I>)T(_)T(</I>)T(<BR>) 000110001010110011100 000110001010110011100
Definition of Maximal Repeats • Let a occurs in S in position p1, p2, p3, …, pk • a is left maximal if there exists at least one (i, j) pair such that S[pi-1]S[pj-1] • a is right maximal if there exists at least one (i, j) pair such that S[pi+|a|]S[pj+|a|] • a is a maximal repeat if it it both left maximal and right maximal
Finding Maximal Repeats • Definition: • Let’s call character S[pi-1] the left character of suffix pi • A node is left diverse if at least two leaves in the ’s subtree have different left characters • Lemma: • The path labels of an internal node in a PAT tree is a maximal repeat if and only if is left diverse
3. Pattern Validator • Suppose a maximal repeat are ordered by its position such that suffix p1 < p2 < p3… < pk, where pi denotes the position of each suffix in the encoded token sequence. • Characteristics of a Pattern • Regularity: Variance coefficient • Adjacency: Density
Pattern a No V(a)<0.5 Discard Yes No 0.25<D(a)<1.5 Discard Yes Pattern a Pattern Validator (Cont.) • Basic Screening For each maximal repeat a, compute V(a) and D(a) a) check if the pattern’s variance: V(a) < 0.5 b) check if the pattern’s density: 0.25 < D(a) < 1.5
4. Rule Composer • Occurrence partition • Flexible variance threshold control • Multiple string alignment • Increase density of a pattern
Occurrence Partition • Problem • Some patterns are divided into several blocks • Ex: Lycos, Excite with large regularity • Solution • Clustering of the occurrences of such a pattern Clustering V(P)<0.1 No P Discard Yes Check density
Multiple String Alignment • Problem • Patterns with density less than 1 can extract only part of the information • Solution • Align k-1 substrings among the k occurrences • A natural generalization of alignment for two strings which can be solved in O(n*m) by dynamic programming where n and m are string lengths.
Multiple String Alignment (Cont.) • Suppose “adc” is the discovered pattern for token string “adcwbdadcxbadcxbdadcb” • If we have the following multiple alignment for strings ``adcwbd'', ``adcxb'' and ``adcxbd'': a d c w b d a d c x b - a d c x b d • The extraction pattern can be generalized as “adc[w|x]b[d|-]”
Pattern Viewer • Java-application based GUI • Web based GUI • http://www.csie.ncu.edu.tw/~chia/WebIEPAD/
The Extractor • Matching the pattern against the encoding token string • Knuth-Morris-Pratt’s algorithm • Boyer-Moore’s algorithm • Alternatives in a rule • matching the longest pattern • What are extracted? • The whole record
Experiment Setup • Fourteen sources: search engines • Performance measures • Number of patterns • Retrieval rate and Accuracy rate • Parameters • Encoding scheme • Thresholds control
Translation • Average page length is 22.7KB
Problems • Guarantee high retrieval rate instead of accuracy rate • Generalized rule can extract more than the desired data • Only applicable when there are several records in a Web page, currently
ROADRUNNER: Towards Automatic Data Extraction from Large Web Sites Valter Crescenzi , Giansalvatore , Paolo Merialdo VLDB2001
Observations 1. Wrapper generator works by using additional information. (labeled samples) 2. Wrapper induction system has some a priori knowledge about the page organization. 3. Finally, systems generate wrapper by examining one HTML page at a time.
ROADRUNNER new perspective 1. Don’t rely on any interaction with the user. (Completely automatic) 2. No a priori knowledge HTML schema will be inferred along with wrapper. Can handle any nested structures. 3. Works with two HTML pages at a time. (based on the study of similarities and dissimilarities between the pages)
Theoretical Background • Site generation = Encoding of database content • Data extraction = Decoding • The problem is based on a close correspondence between nested type and union-free regular expressios.
Delimiter • #PCDATA : map to string • + : map to lists (nested) , being iterator • ? : map to nullable fields, optional patterns. • Find schema and data extraction = Find minimal UFRE.
Matching Technique • It is based on a matching technique called ACME. (Align, Collapse under Mismatch, and Extract) • HTML XHTML tokens • Matching algorithm works on two objects: • A list of tokens, call the sample • A wrapper (one UFRE) • This is done by solving mismatches between the wrapper and the sample.
Mismatches 1. String mismatches: • May be due only to different values of a database field. • These mismatches are use to discover fields. (#PCDATA) • Ex : ‘John Smith’ and ‘Paul Jones’ at token 4 2. Tag mismatches: • Optional patterns • Iterative patterns
Discovering Optionals • Strategy: Looking for repeated patterns as a first step, and then, if this attempt fails, in trying to identify optional pattern. • Two steps: • 1. Optional Pattern Location by Cross-Search • Mismatch at token 6 - <UL> and <IMG…/> • Assume optional pattern is located on wrapper or sample. • 2. Wrapper Generalization • ( <IMG src=…/> ) ?
Discovering Iterators 1. Square Location by Terminal – Tag Search : • Both the wrapper and sample contain at least one occurrence of the square. • Terminal Tag = position before the mismatch • In this example is </LI> • Test which is the square initial tag ? • </UI> ~ </LI> v.s. <LI> ~ </LI> • Finally, we can infer that the sample contains one candidate occurrence of the square at token 20-25.
Discovering Iterators (con’t) 2. Square Matching : • Try to match the candidate square occurrence (tokens 20-25). • Backwards : matching token 25 and 19, then moves to 24 and 18 and so on. 3. Wrapper Generalization : • If we denote the newly found square by s, we replace the repeated pattern by (s)+
More Complex Example • First mismatch at token 15 (external mismatch) • Find iterators : • Terminal tag = </LI> • Candidate square is found : <LI> ~ </LI> at token 15-28 • Backward match : second mismatch at token 23 and 9 (internal mismatch) solve the mismatch by recursive
Recursively solve mismatch • Internal mismatch at token 23 and 9 • Solve it by the same way at external mismatch. • But don’t work by comparing one wrapper and one sample, rather two different portions of the same objects. • Terminal tag = <B> • Candidate square is </B>~<B> token 23-18 • Backward match : mismatch at token 20 and 26 • Find token 20-22 is optional pattern.
Matching as an AND-OR tree • Finding one solution to match(w,s) corresponds to finding one visit for the AND-OR tree. • (i) match(w,s) = all external mismatches encountered during the parsing (AND node) • (ii) solve mismatch by either introducing one field, or one iterator, or one optional (OR) • (iii) The search may either on wrapper or sample (OR) • (iv) iterators and optionals are various candidates (OR) • (v) Discover iterators may be need to recursively solve several internal mismatches. (AND)
Extracting Structured Data from Web Page Arvind Arasu, Hector Garcia-Molina ACM SIGMOD 2003
Cue • Keywords: schema, template • Web pages belonging to the same site are generated by encoding data of the same schema with a common template => a common template by plugging-in value
Goal and Challenge • Previous IE Techniques rely on heuristic by human. ex. wrapper • Goal: to deduce the template without human • Time consuming and error-prone • Optional attributes are ignored • Challenge: • No obvious way of differentiating what text is template or data • The schema of data in pages isn’t flat but more complex and semi-structured of attributes
Model, Problem Formulation • Structured Data • Model of Page Creation • Optionals and Disjunctions • Problem Statement • Miscellaneous Terminology, Definition
Structured Data • Token: A token is some basic unit of text • Structured Data: any set of data values conforming to a common schema or type • Define “Type”: 1. Basic Type (β): string of tokens e.g. <html>, text 2. Ordered List Type: tuple constructor order “n” e.g. <T1, T2, …, Tn>, T1, T2, …, Tn : type 3. Define Type: set constructor e.g. {T} , T: type
Define term value and example • Define “instance”: 1. an instance of basic type, β, token 2. an instance of type <T1, T2, …, Tn>is tuple of the form <i1, i2, …, in>, attributes i1, i2, …, in are instances of typesT1, T2, …, Tn 3. an instance of type {T}, is any set of elements {e1, e2, …, em}, such ei is an instance of type T • Instance → Value; String → token • Example: • Schema S1= • Value =