1 / 35

Chia-Hui Chang

Information Extraction for Semi-structured Documents: From Supervised learning to Unsupervised learning. Chia-Hui Chang Dept. of Computer Science and Information Engineering, National Central University, Taiwan chia@csie.ncu.edu.tw. Outline. Problem Definition of Information Extraction

reece-lucas
Download Presentation

Chia-Hui Chang

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Information Extraction for Semi-structured Documents: From Supervised learning to Unsupervised learning Chia-Hui Chang Dept. of Computer Science and Information Engineering, National Central University, Taiwan chia@csie.ncu.edu.tw

  2. Outline • Problem Definition of Information Extraction • Semi-structured IE • Plain Text Information Extraction • Methods • Special designed programming language • W4F, Xwrap, Lixto • Supervised learning approach • WIEN, Softmealy, Stalker • Unsupervised learning approach • IEPAD • Semi-supervised learning approach • OLERA • Summary and Future Work

  3. Introduction • Information Extraction (IE) is to identify relevant information from documents, pulling information from a variety of sources and aggregates it into a homogeneous form. • The output template of the IE task • Several fields (slots) • Several instances of a field

  4. Problem Definition • Plain Text Information Extraction • The task of locating specific pieces of data from a natural language document • To obtain useful structured information from unstructured text • DARPA’s MUC program • Semi-structured IE • Different from traditional IE • The necessity of extracting and integrating data from multiple Web-based sources • e.g. generating1000 wrappers/extractors

  5. Types of IE from MUC • Named Entity recognition (NE) • Finds and classifies names, places, etc. • Coreference Resolution (CO) • Identifies identity relations between entities in texts. • Template Element construction (TE) • Adds descriptive information to NE results. • Scenario Template production (ST) • Fits TE results into specified event scenarios.

  6. IE from Semi-structured Documents • Output Template: k-tuple • Multiple instances of a field • Missing data • Several permutation of attributes

  7. Special-designed Programming Language • Programming by users • General programming language • Special-designed programming language • W4F, Xwrap, Lixto • How? • Observing common delimiters as landmarks • Writing extraction rules

  8. Supervised Learning Approach • Wrapper induction • WIEN, IJCAI-97 • Kushmerick, Weld, Doorenbos, • SoftMealy, IJCAI-99 • Hsu • STALKER, AA-99 • Muslea, Minton, Knoblock • Key component of IE systems • Interface for labeling • Learning algorithm • Extraction rules: Rule format • Extractor

  9. Example Labels: {(Congo, 242), (Egypt, 20), (Belize, 501), (Spain, 34)}

  10. Labeling • Start and end positions for • Scope • Record • Attribute • Example

  11. Learning Algorithm • Token hierarchy for generalization • Background knowledge • Learning Algorithms • Rule expression • Delimiter-based • Consecutive landmark • Sequential landmark • Context rule

  12. Extractor Architecture • WIEN • Single-pass • Single-loop, no branch • STALKER • Multi-pass • Bi-directional scanning • Softmealy • Single-pass or multi-pass • Finite-state transducer

  13. Pattern-discovery based IE (Unsupervised Learning Approach ) • Motivation • Display of multiple records often forms a repeated pattern • The occurrences of the pattern are spaced regularly and adjacently • Now the problem becomes ... • Find regular and adjacent repeats in a string

  14. Pattern Discoverer Html Page Patterns Pattern Viewer Users Extractor ExtractionResults Extraction Rule Html Pages IEPAD Architecture

  15. HTML Page Token Translator A Token String PAT Tree Constructor PAT trees and Maximal Repeats Validator Advenced Patterns Rule Composer Extraction Rules The Pattern Generator • Translator • PAT tree construction • Pattern validator • Rule Composer

  16. 1. Web Page Translation • Encoding of HTML source • Rule 1: Each tag is encoded as a token • Rule 2: Any text between two tags are translated to a special token called TEXT (denoted by a underscore) • HTML Example: <B>Congo</B><I>242</I><BR> <B>Egypt</B><I>20</I><BR> • Encoded token string T(<B>)T(_)T(</B>)T(<I>)T(_)T(</I>)T(<BR>) T(<B>)T(_)T(</B>)T(<I>)T(_)T(</I>)T(<BR>)

  17. 2. PAT Tree Construction • PAT tree: binary suffix tree • A Patricia treeconstructed over all possible suffix strings of a text • Example T(<B>) 000 T(</B>) 001 T(<I>) 010 T(</I>) 011 T(<BR>) 100 T(_) 110 • T(<B>)T(_)T(</B>)T(<I>)T(_)T(</I>)T(<BR>) • T(<B>)T(_)T(</B>)T(<I>)T(_)T(</I>)T(<BR>) 000110001010110011100 000110001010110011100

  18. The Constructed PAT Tree

  19. Definition of Maximal Repeats • Let a occurs in S in position p1, p2, p3, …, pk • a is left maximal if there exists at least one (i, j) pair such that S[pi-1]S[pj-1] • a is right maximal if there exists at least one (i, j) pair such that S[pi+|a|]S[pj+|a|] • a is a maximal repeat if it it both left maximal and right maximal

  20. 3. Pattern Validator • Suppose a maximal repeat  are ordered by its position such that suffix p1 < p2 < p3… < pk, where pi denotes the position of each suffix in the encoded token sequence. • Characteristics of a Pattern • Regularity: Variance coefficient • Adjacency: Density

  21. 4. Rule Composer • Problem • Patterns with density less than 1 can extract only part of the information • Solution • Align k-1 substrings among the k occurrences • A natural generalization of alignment for two strings which can be solved in O(n*m) by dynamic programming where n and m are string lengths.

  22. Multiple String Alignment • Suppose “adc” is the discovered pattern for token string “adcwbdadcxbadcxbdadcb” • If we have the following multiple alignment for strings ``adcwbd'', ``adcxb'' and ``adcxbd'': a d c w b d a d c x b - a d c x b d • The extraction pattern can be generalized as “adc[w|x]b[d|-]”

  23. Pattern Viewer / User Interface • Java-application based GUI • Web based GUI • http://www.csie.ncu.edu.tw/~chia/WebIEPAD/

  24. The Extractor • Matching the pattern against the encoding token string • Knuth-Morris-Pratt’s algorithm • Boyer-Moore’s algorithm • Alternatives in a rule • matching the longest pattern • What are extracted? • The whole record

  25. Problem • Deals only with multi-record pages • Many patterns are composed due to • Multiple string alignment • Unknown start position • Alignment error due to ignored text strings

  26. Semi-supervised approach: OLERA • An universal method for wrapping both • single-record pages or • multi-record pages • OnLine Extraction Rule Analysis • Drill-down/Roll up operations • Encoding hierarchy (What would you do?)

  27. OLERA’s Framework doc • Three simple operations • Block enclosing • Drill-down/Roll-up • Attribute Designation Page Encoder Approximate Matching Block Enclosing Multiple String Alignment Drill down/ Roll up Page Encoder Attribute Designation Multiple String Alignment Extraction Patterns

  28. Block Enclosing Multiple single-record pages

  29. Enclosing (Cont.) • Different from labeling • The number of enclosing operation is far less than the number of training pages • Encoding • Approximate Matching • Extension of global string alignment • String Alignment • Enhanced matching function

  30. Attribute Designation

  31. Drill-down/Roll-up • Drill-down • Encoding • Multiple String Alignment • Each column is given a identifier: • 8_0, 8_1, 8_2 for drill down operation on column 8 • Roll-up • Several columns can be concatenated together • The corresponding identifiers are recorded

  32. Extractors • Grammar • Signature representation for alignment result • Each drill-down and roll-up operations • The columns to be extracted for each attribute • Matching signature pattern in testing pages • Variation of approximate matching • Insertion and mismatch is not allowed • Deletion is allowed only if indicated in the signature pattern

  33. Conclusion • The input of training page • Annotated or unlabeled • The format of extraction rule • Delimiter-based, content-based, contextual rule • The background knowledge • Implicitly or explicitly

  34. Problems For different problems, different encoding scheme is needed Designing unsupervised approach for both single-record and multi-record documents

  35. References • Semi-structured IE • C.H. Chang and S.C. Kuo, OLERA: On-Line Extraction Rule Analysis for Semi-structured Documents, Submitted for publication. • C.H. Chang. and S.C. Lui. IEPAD: Information Extraction based on Pattern Discovery, WWW10, pp. 681-688, May 2-6, 2001, Hong Kong.

More Related