350 likes | 566 Views
Information Extraction for Semi-structured Documents: From Supervised learning to Unsupervised learning. Chia-Hui Chang Dept. of Computer Science and Information Engineering, National Central University, Taiwan chia@csie.ncu.edu.tw. Outline. Problem Definition of Information Extraction
E N D
Information Extraction for Semi-structured Documents: From Supervised learning to Unsupervised learning Chia-Hui Chang Dept. of Computer Science and Information Engineering, National Central University, Taiwan chia@csie.ncu.edu.tw
Outline • Problem Definition of Information Extraction • Semi-structured IE • Plain Text Information Extraction • Methods • Special designed programming language • W4F, Xwrap, Lixto • Supervised learning approach • WIEN, Softmealy, Stalker • Unsupervised learning approach • IEPAD • Semi-supervised learning approach • OLERA • Summary and Future Work
Introduction • Information Extraction (IE) is to identify relevant information from documents, pulling information from a variety of sources and aggregates it into a homogeneous form. • The output template of the IE task • Several fields (slots) • Several instances of a field
Problem Definition • Plain Text Information Extraction • The task of locating specific pieces of data from a natural language document • To obtain useful structured information from unstructured text • DARPA’s MUC program • Semi-structured IE • Different from traditional IE • The necessity of extracting and integrating data from multiple Web-based sources • e.g. generating1000 wrappers/extractors
Types of IE from MUC • Named Entity recognition (NE) • Finds and classifies names, places, etc. • Coreference Resolution (CO) • Identifies identity relations between entities in texts. • Template Element construction (TE) • Adds descriptive information to NE results. • Scenario Template production (ST) • Fits TE results into specified event scenarios.
IE from Semi-structured Documents • Output Template: k-tuple • Multiple instances of a field • Missing data • Several permutation of attributes
Special-designed Programming Language • Programming by users • General programming language • Special-designed programming language • W4F, Xwrap, Lixto • How? • Observing common delimiters as landmarks • Writing extraction rules
Supervised Learning Approach • Wrapper induction • WIEN, IJCAI-97 • Kushmerick, Weld, Doorenbos, • SoftMealy, IJCAI-99 • Hsu • STALKER, AA-99 • Muslea, Minton, Knoblock • Key component of IE systems • Interface for labeling • Learning algorithm • Extraction rules: Rule format • Extractor
Example Labels: {(Congo, 242), (Egypt, 20), (Belize, 501), (Spain, 34)}
Labeling • Start and end positions for • Scope • Record • Attribute • Example
Learning Algorithm • Token hierarchy for generalization • Background knowledge • Learning Algorithms • Rule expression • Delimiter-based • Consecutive landmark • Sequential landmark • Context rule
Extractor Architecture • WIEN • Single-pass • Single-loop, no branch • STALKER • Multi-pass • Bi-directional scanning • Softmealy • Single-pass or multi-pass • Finite-state transducer
Pattern-discovery based IE (Unsupervised Learning Approach ) • Motivation • Display of multiple records often forms a repeated pattern • The occurrences of the pattern are spaced regularly and adjacently • Now the problem becomes ... • Find regular and adjacent repeats in a string
Pattern Discoverer Html Page Patterns Pattern Viewer Users Extractor ExtractionResults Extraction Rule Html Pages IEPAD Architecture
HTML Page Token Translator A Token String PAT Tree Constructor PAT trees and Maximal Repeats Validator Advenced Patterns Rule Composer Extraction Rules The Pattern Generator • Translator • PAT tree construction • Pattern validator • Rule Composer
1. Web Page Translation • Encoding of HTML source • Rule 1: Each tag is encoded as a token • Rule 2: Any text between two tags are translated to a special token called TEXT (denoted by a underscore) • HTML Example: <B>Congo</B><I>242</I><BR> <B>Egypt</B><I>20</I><BR> • Encoded token string T(<B>)T(_)T(</B>)T(<I>)T(_)T(</I>)T(<BR>) T(<B>)T(_)T(</B>)T(<I>)T(_)T(</I>)T(<BR>)
2. PAT Tree Construction • PAT tree: binary suffix tree • A Patricia treeconstructed over all possible suffix strings of a text • Example T(<B>) 000 T(</B>) 001 T(<I>) 010 T(</I>) 011 T(<BR>) 100 T(_) 110 • T(<B>)T(_)T(</B>)T(<I>)T(_)T(</I>)T(<BR>) • T(<B>)T(_)T(</B>)T(<I>)T(_)T(</I>)T(<BR>) 000110001010110011100 000110001010110011100
Definition of Maximal Repeats • Let a occurs in S in position p1, p2, p3, …, pk • a is left maximal if there exists at least one (i, j) pair such that S[pi-1]S[pj-1] • a is right maximal if there exists at least one (i, j) pair such that S[pi+|a|]S[pj+|a|] • a is a maximal repeat if it it both left maximal and right maximal
3. Pattern Validator • Suppose a maximal repeat are ordered by its position such that suffix p1 < p2 < p3… < pk, where pi denotes the position of each suffix in the encoded token sequence. • Characteristics of a Pattern • Regularity: Variance coefficient • Adjacency: Density
4. Rule Composer • Problem • Patterns with density less than 1 can extract only part of the information • Solution • Align k-1 substrings among the k occurrences • A natural generalization of alignment for two strings which can be solved in O(n*m) by dynamic programming where n and m are string lengths.
Multiple String Alignment • Suppose “adc” is the discovered pattern for token string “adcwbdadcxbadcxbdadcb” • If we have the following multiple alignment for strings ``adcwbd'', ``adcxb'' and ``adcxbd'': a d c w b d a d c x b - a d c x b d • The extraction pattern can be generalized as “adc[w|x]b[d|-]”
Pattern Viewer / User Interface • Java-application based GUI • Web based GUI • http://www.csie.ncu.edu.tw/~chia/WebIEPAD/
The Extractor • Matching the pattern against the encoding token string • Knuth-Morris-Pratt’s algorithm • Boyer-Moore’s algorithm • Alternatives in a rule • matching the longest pattern • What are extracted? • The whole record
Problem • Deals only with multi-record pages • Many patterns are composed due to • Multiple string alignment • Unknown start position • Alignment error due to ignored text strings
Semi-supervised approach: OLERA • An universal method for wrapping both • single-record pages or • multi-record pages • OnLine Extraction Rule Analysis • Drill-down/Roll up operations • Encoding hierarchy (What would you do?)
OLERA’s Framework doc • Three simple operations • Block enclosing • Drill-down/Roll-up • Attribute Designation Page Encoder Approximate Matching Block Enclosing Multiple String Alignment Drill down/ Roll up Page Encoder Attribute Designation Multiple String Alignment Extraction Patterns
Block Enclosing Multiple single-record pages
Enclosing (Cont.) • Different from labeling • The number of enclosing operation is far less than the number of training pages • Encoding • Approximate Matching • Extension of global string alignment • String Alignment • Enhanced matching function
Drill-down/Roll-up • Drill-down • Encoding • Multiple String Alignment • Each column is given a identifier: • 8_0, 8_1, 8_2 for drill down operation on column 8 • Roll-up • Several columns can be concatenated together • The corresponding identifiers are recorded
Extractors • Grammar • Signature representation for alignment result • Each drill-down and roll-up operations • The columns to be extracted for each attribute • Matching signature pattern in testing pages • Variation of approximate matching • Insertion and mismatch is not allowed • Deletion is allowed only if indicated in the signature pattern
Conclusion • The input of training page • Annotated or unlabeled • The format of extraction rule • Delimiter-based, content-based, contextual rule • The background knowledge • Implicitly or explicitly
Problems For different problems, different encoding scheme is needed Designing unsupervised approach for both single-record and multi-record documents
References • Semi-structured IE • C.H. Chang and S.C. Kuo, OLERA: On-Line Extraction Rule Analysis for Semi-structured Documents, Submitted for publication. • C.H. Chang. and S.C. Lui. IEPAD: Information Extraction based on Pattern Discovery, WWW10, pp. 681-688, May 2-6, 2001, Hong Kong.