Chia-Hui Chang

Information Extraction for Semi-structured Documents: From Supervised learning to Unsupervised learning Chia-Hui Chang Dept. of Computer Science and Information Engineering, National Central University, Taiwan chia@csie.ncu.edu.tw

Outline • Problem Definition of Information Extraction • Semi-structured IE • Plain Text Information Extraction • Methods • Special designed programming language • W4F, Xwrap, Lixto • Supervised learning approach • WIEN, Softmealy, Stalker • Unsupervised learning approach • IEPAD • Semi-supervised learning approach • OLERA • Summary and Future Work

Introduction • Information Extraction (IE) is to identify relevant information from documents, pulling information from a variety of sources and aggregates it into a homogeneous form. • The output template of the IE task • Several fields (slots) • Several instances of a field

Problem Definition • Plain Text Information Extraction • The task of locating specific pieces of data from a natural language document • To obtain useful structured information from unstructured text • DARPA’s MUC program • Semi-structured IE • Different from traditional IE • The necessity of extracting and integrating data from multiple Web-based sources • e.g. generating1000 wrappers/extractors

Types of IE from MUC • Named Entity recognition (NE) • Finds and classifies names, places, etc. • Coreference Resolution (CO) • Identifies identity relations between entities in texts. • Template Element construction (TE) • Adds descriptive information to NE results. • Scenario Template production (ST) • Fits TE results into specified event scenarios.

IE from Semi-structured Documents • Output Template: k-tuple • Multiple instances of a field • Missing data • Several permutation of attributes

Special-designed Programming Language • Programming by users • General programming language • Special-designed programming language • W4F, Xwrap, Lixto • How? • Observing common delimiters as landmarks • Writing extraction rules

Supervised Learning Approach • Wrapper induction • WIEN, IJCAI-97 • Kushmerick, Weld, Doorenbos, • SoftMealy, IJCAI-99 • Hsu • STALKER, AA-99 • Muslea, Minton, Knoblock • Key component of IE systems • Interface for labeling • Learning algorithm • Extraction rules: Rule format • Extractor

Example Labels: {(Congo, 242), (Egypt, 20), (Belize, 501), (Spain, 34)}

Labeling • Start and end positions for • Scope • Record • Attribute • Example

Learning Algorithm • Token hierarchy for generalization • Background knowledge • Learning Algorithms • Rule expression • Delimiter-based • Consecutive landmark • Sequential landmark • Context rule

Extractor Architecture • WIEN • Single-pass • Single-loop, no branch • STALKER • Multi-pass • Bi-directional scanning • Softmealy • Single-pass or multi-pass • Finite-state transducer

Pattern-discovery based IE (Unsupervised Learning Approach ) • Motivation • Display of multiple records often forms a repeated pattern • The occurrences of the pattern are spaced regularly and adjacently • Now the problem becomes ... • Find regular and adjacent repeats in a string

Pattern Discoverer Html Page Patterns Pattern Viewer Users Extractor ExtractionResults Extraction Rule Html Pages IEPAD Architecture

HTML Page Token Translator A Token String PAT Tree Constructor PAT trees and Maximal Repeats Validator Advenced Patterns Rule Composer Extraction Rules The Pattern Generator • Translator • PAT tree construction • Pattern validator • Rule Composer

1. Web Page Translation • Encoding of HTML source • Rule 1: Each tag is encoded as a token • Rule 2: Any text between two tags are translated to a special token called TEXT (denoted by a underscore) • HTML Example: Congo242 Egypt20 • Encoded token string T()T(_)T()T()T(_)T()T( ) T()T(_)T()T()T(_)T()T( )

2. PAT Tree Construction • PAT tree: binary suffix tree • A Patricia treeconstructed over all possible suffix strings of a text • Example T() 000 T() 001 T() 010 T() 011 T( ) 100 T(_) 110 • T()T(_)T()T()T(_)T()T( ) • T()T(_)T()T()T(_)T()T( ) 000110001010110011100 000110001010110011100

The Constructed PAT Tree

Definition of Maximal Repeats • Let a occurs in S in position p1, p2, p3, …, pk • a is left maximal if there exists at least one (i, j) pair such that S[pi-1]S[pj-1] • a is right maximal if there exists at least one (i, j) pair such that S[pi+|a|]S[pj+|a|] • a is a maximal repeat if it it both left maximal and right maximal

3. Pattern Validator • Suppose a maximal repeat  are ordered by its position such that suffix p1 < p2 < p3… < pk, where pi denotes the position of each suffix in the encoded token sequence. • Characteristics of a Pattern • Regularity: Variance coefficient • Adjacency: Density

4. Rule Composer • Problem • Patterns with density less than 1 can extract only part of the information • Solution • Align k-1 substrings among the k occurrences • A natural generalization of alignment for two strings which can be solved in O(n*m) by dynamic programming where n and m are string lengths.

Multiple String Alignment • Suppose “adc” is the discovered pattern for token string “adcwbdadcxbadcxbdadcb” • If we have the following multiple alignment for strings `àdcwbd'', `àdcxb'' and `àdcxbd'': a d c w b d a d c x b - a d c x b d • The extraction pattern can be generalized as “adc[w|x]b[d|-]”

Pattern Viewer / User Interface • Java-application based GUI • Web based GUI • http://www.csie.ncu.edu.tw/~chia/WebIEPAD/

The Extractor • Matching the pattern against the encoding token string • Knuth-Morris-Pratt’s algorithm • Boyer-Moore’s algorithm • Alternatives in a rule • matching the longest pattern • What are extracted? • The whole record

Problem • Deals only with multi-record pages • Many patterns are composed due to • Multiple string alignment • Unknown start position • Alignment error due to ignored text strings

Semi-supervised approach: OLERA • An universal method for wrapping both • single-record pages or • multi-record pages • OnLine Extraction Rule Analysis • Drill-down/Roll up operations • Encoding hierarchy (What would you do?)

OLERA’s Framework doc • Three simple operations • Block enclosing • Drill-down/Roll-up • Attribute Designation Page Encoder Approximate Matching Block Enclosing Multiple String Alignment Drill down/ Roll up Page Encoder Attribute Designation Multiple String Alignment Extraction Patterns

Block Enclosing Multiple single-record pages

Enclosing (Cont.) • Different from labeling • The number of enclosing operation is far less than the number of training pages • Encoding • Approximate Matching • Extension of global string alignment • String Alignment • Enhanced matching function

Attribute Designation

Drill-down/Roll-up • Drill-down • Encoding • Multiple String Alignment • Each column is given a identifier: • 8_0, 8_1, 8_2 for drill down operation on column 8 • Roll-up • Several columns can be concatenated together • The corresponding identifiers are recorded

Extractors • Grammar • Signature representation for alignment result • Each drill-down and roll-up operations • The columns to be extracted for each attribute • Matching signature pattern in testing pages • Variation of approximate matching • Insertion and mismatch is not allowed • Deletion is allowed only if indicated in the signature pattern

Conclusion • The input of training page • Annotated or unlabeled • The format of extraction rule • Delimiter-based, content-based, contextual rule • The background knowledge • Implicitly or explicitly

Problems For different problems, different encoding scheme is needed Designing unsupervised approach for both single-record and multi-record documents

References • Semi-structured IE • C.H. Chang and S.C. Kuo, OLERA: On-Line Extraction Rule Analysis for Semi-structured Documents, Submitted for publication. • C.H. Chang. and S.C. Lui. IEPAD: Information Extraction based on Pattern Discovery, WWW10, pp. 681-688, May 2-6, 2001, Hong Kong.

Chia-Hui Chang

Chia-Hui Chang

Presentation Transcript

Chia ly

Cathy Lee Chris Chang Daphne Chia Edison Yen

Group Three’s Presentation Dr. Ya-Hui Elegance Chang HFT 3003

Presenter : Min- Chia Chang Advisor : Prof. Jane Hsu Date : 201 1 - 06 -30

Chia Seeds

Presenter : Min- Chia Chang Advisor : Prof. Jane Hsu Date : 201 1 - 06 -30

mass/chia

Shadi Agel Pongsakorn Bunyaphriruang Chih-Chung Chang Winnie Chia

Presenter : Min- Chia Chang Advisor : Prof. Jane Hsu Date : 201 1 - 06 -30

Chin-Chih Chang chang@cs.twsu

Chin-Chih Chang chang@cs.twsu

Chin-Chih Chang chang@cs.twsu

Chien-Lung Wang Ju-Hui Chang Department of Education, NTTU

Chia Seeds

Liu Hui

Hui-Jung Chang, Ph.D. 1996 Department of Communication Michigan State University

Whanau Hui

Chia Seeds

hui

Hui Ekolu 

Chin-Chih Chang chang@cs.twsu

CHANG, Chia-Ming Soochow University, Taipei, Taiwan 06 April 2010, Moscow