A Survey of WEB Information Extraction Systems

A Survey of WEB Information Extraction Systems Chia-Hui Chang National Central University Sep. 22, 2005

Introduction • Abundant information on the Web • Static Web pages • Searchable databases: Deep Web • Information Integration • Information for life • e.g. shopping agents, travel agents • Data for research purpose • e.g. bioinformatics, auction economy

Introduction (Cont.) • Information Extraction (IE) • is to identify relevant information from documents, pulling information from a variety of sources and aggregates it into a homogeneous form • An IE task is defined by its input and output

An IE Task

Web Data Extraction Data Record Data Record

IE Systems • Wrappers • Programs that perform the task of IE are referred to as extractors or wrappers. • Wrapper Induction • IE systems are software tools that are designed to generate wrappers.

Various IE Survey • Muslea • Hsu and Dung • Chang • Kushmerick • Laender • Sarawagi • Kuhlins and Tredwell

Related Work: Time • MUC Approaches • AutoSolg [Riloff, 1993], LIEP [Huffman, 1996], PALKA [Kim, 1995], HASTEN [Krupka, 1995], and CRYSTAL [Soderland, 1995] • Post-MUC Approaches • WHISK [Soderland, 1999], RAPIER [califf, 1998], SRV [Freitag, 1998], WIEN [Kushmerick, 1997], SoftMealy [Hsu, 1998] and STALKER [Muslea, 1999]

Related Work: Automation Degree • Hsu and Dung [1998] • hand-crafted wrappers using general programming languages • specially designed programming languages or tools • heuristic-based wrappers, and • WI approaches

Related Work: Automation Degree • Chang and Kuo [2003] • systems that need programmers, • systems that need annotation examples, • annotation-free systems and • semi-supervised systems

Related Work: Input and Extraction Rules • Muslea [1999] • IE from free text using extraction patterns that are mainly based on syntactic/semantic constraints. • The second class is Wrapper induction systems which rely on the use of delimiter-based rules. • The third class also processes IE from online documents; however the patterns of these tools are based on both delimiters and syntactic/semantic constraints.

Related Work: Extraction Rules • Kushmerick [2003] • Finite-state tools (regular expressions) • Relational learning tools (logic rules)

Related Work: Techniques • Laender [2002] • languages for wrapper development • HTML-aware tools • NLP-based tools • Wrapper induction tools (e.g., WIEN, SoftMealy and STALKER), • Modeling-based tools • Ontology-based tools • New Criteria: • degree of automation, support for complex objects, page contents, availability of a GUI, XML output, support for non-HTML sources, resilience and adaptiveness.

Related Work: Output Targets • Sarawagi [VLDB 2002] • Record-level • Page-level • Site-level

Related Work: Usability • Kuhlins and Tredwell [2002] • Commercial • Noncommercial

Three Dimensions • Task Domain • Input (Unstructured, semi-structured) • Output Targets (record-level, page-level, site-level) • Automation Degree • Programmer-involved, learning-based or annotation-free approaches • Techniques • Regular expression rules vs Prolog-like logic rules • Deterministic finite-state transducer vs probabilistic hidden Markov models

Task Domain: Input

Task Domain: Output • Missing Attributes • Multi-valued Attributes • Multiple Permutations • Nested Data Objects • Various Templates for an attribute • Common Templates for various attributes • Untokenized Attributes

Classification by Automation Degree • Manually • TSIMMIS, Minerva, WebOQL, W4F, XWrap • Supervised • WIEN, Stalker, Softmealy • Semi-supervised • IEPAD, OLERA • Unsupervised • DeLa, RoadRunner, EXALG

Automation Degree • Page-fetching Support • Annotation Requirement • Output Support • API Support

Technologies • Scan passes • Extraction rule types • Learning algorithms • Tokenization schemes • Feature used

A Survey of Contemporary IE Systems • Manually-constructed IE tools • Programmer-aided • Supervised IE systems • Labeled based • Semi-supervised IE systems • Unsupervised IE systems • Annotation-free

Manually-constructed IE Systems • TSIMMIS [Hammer, et al, 1997] • Minerva [Crescenzi, 1998] • WebOQL [Arocena and Mendelzon, 1998] • W4F [Saiiuguet and Azavant, 2001] • XWrap [Liu, et al. 2000]

A Running Example

TSIMMIS • Each command is of the form: [variables, source, pattern] where • source specifies the input text to be considered • pattern specifies how to find the text of interest within the source, and • variables are a list of variables that hold the extracted results. • Note: • # means “save in the variable” • * means “discard”

Minerva • The grammar used by Minerva is defined in an EBNF style

Tag: Body, Source: <Body>…</Body> Text: Book Name … Tag: OL, Source: <ol>…</ol> Text: Reviewer Name … Tag: Source:Book Name Text: Book Name Tag: NOTAG Source: Databases Text: Database Tag: Source:Reviews Text: Reviews Tag: LI, Source: <li>…</li> Text: Reviewer Name … Tag: Source:Reviewer Name Text: Reviewer Name Tag: NOTAG Source: … Text: … Tag: NOTAG Source: John Text: John Tag: Source:Rating Text: Rating Tag: Source:Text Text: Text Tag: NOTAG Source: 7 Text: 7 WebOQL Select [ Z!’.Text] From x in browse (“pe2.html”)’, y in x’, Z in y’ Where x.Tag = “ol” and Z.Text=”Reviewer Name”

W4F • Wysiwyg support • Java toolkit • Extraction rule • HTML parse tree (DOM object) • e.g. html.body.ol[0].li[*].pcdata[0].txt • Regular expression to address finer pieces of information

Supervised IE systems • SRV [Freitag, 1998] • Rapier [Califf and Mooney, 1998] • WIEN [Kushmerick, 1997] • WHISK [Soderland, 1999] • NoDoSE [Adelberg, 1998] • Softmealy [Hsu and Dung, 1998] • Stalker [Muslea, 1999] • DEByE [Laender, 2002b ]

SRV • Single-slot information extraction • Top-down (general to specific) relational learning algorithm • Positive examples • Negative examples • Learning algorithm work like FOIL • Token-oriented features • Logic rule Rating extraction rule:- Length(=1), Every(numeric true), Every(in_list true).

Rapier • Field-level (Single-slot) data extraction • Bottom-up (specific to general) • The extraction rules consist of 3 parts: • Pre-filler • Slot-filler • Post-filler Book Title extraction rule:- Pre-filler slot-filler post-filler word: Book Length=2 word= word: Name Tag: [nn, nns] word:

WIEN • LR Wrapper • (‘Reviewer name ’, ‘’, ‘Rating ’, ‘’, ‘Text ’, ‘</li>’) • HLRT Wrapper (Head LR Tail) • OCLR Wrapper (Open-Close LR) • HOCLRT Wrapper • N-LR Wrapper (Nested LR) • N-HLRT Wrapper (Nested HLRT)

WHISK • Top-down (general to specific) learning • Example • To generate 3-slot book reviews, it start with empty rule “*(*)*(*)*(*)*” • Each parenthesis indicates a phrase to be extracted • The phrase in the first set of parenthesis is bound to variable $1, and 2nd to $2, etc. • The extraction logic is similar to the LR wrapper for WIEN. Pattern:: * ‘Reviewer Name ’ (Person) ‘’ * (Digit) ‘Text’(*) ‘</li>’ Output:: BookReview {Name $1} {Rating $2} {Comment $3}

NoDoSE • Assume the order of attributes within a record to be fixed • The user interacts with the system to decompose the input. • For the running example • a book title (an attribute of type string) and • a list of Reviewer • RName (string), Rate (integer), and Text (string).

?/next_token ?/next_token ?/next_token ?/ε ?/ε ?/ε s<,T>/ “T=”+ next_tokn s<b,N>/ “N=”+ next_tokn s<,R>/ “R=”+ next_tokn s<N, > / ε s<T,e> / ε e N R R T b N s<R, e>/ ε Softmealy • Finite transducer • Contextual rules s<,R>L ::= HTML() C1Alph(Rating) HTML() s<,R>R ::= Spc(-) Num(-) s<R,>L ::= Num(-) s<R,>R ::= NL(-) HTML()

Stalker • Embedded Category Tree • Multipass Softmealy

DEByE • Bottom-up extraction strategy • Comparison • DEByE: the user marks only atomic (attribute) values to assemble nested tables • NoDoSE: the user decomposes the whole document in a top-down fashion

Semi-supervised Approaches • IEPAD [Chang and Lui, 2001] • OLERA [Chang and Kuo, 2003] • Thresher [Hogue, 2005]

IEPAD • Encoding of the input page • Multiple-record pages • Pattern Mining by PAT Tree • Multiple string alignment • For the running example • <li>TTTTTT</li>

OLERA • Online extraction rule analysis • Enclosing • Drill-down / Roll-up • Attribute Assignment

Thresher • Work similar to OLERA • Apply tree alignment instead of string alignment

Unsupervised Approaches • Roadrunner [Crescenzi, 2001] • DeLa [Wang, 2002; 2003] • EXALG [Arasu and Garcia-Molina, 2003] • DEPTA [Zhai, et al., 2005]

Terminal search match Wrapper after solving mismatch <html><body> Book Name #PCDATA Reviews <OL> ( <LI> Reviewer Name #PCDATA Rating #PCDATA Text #PCDATA </LI> )+ </OL></body></html> Roadrunner • Input: multiple pages with the same template • Match two input pages at one time Wrapper (initially) 01: <html><body> 02: 03: Book Name 04: 05: Databases 06: 07: Reviews 08: 09: <OL> 10: <LI> 11: Reviewer Name 12: John 13: Rating 14: 7 15: Text 16: … 17: </LI> 10: </OL> 11:</body></html> Sample page 01: <html><body> 02: 03: Book Name 04: 05: Data mining 06: 07: Reviews 08: 09: <OL> 10: <LI> 11: Reviewer Name 12: Jeff 13: Rating 14: 2 15: Text 16: … 17: </LI> 18: <LI> 19: Reviewer Name 20: Jane 21: Rating 22: 6 23: Text 24: … 25: </LI> 26: </OL> 27:</body></html> parsing String mismatch String mismatch String mismatch String mismatch tag mismatch

DeLa • Similar to IEPAD • Works for one input page • Handle nested data structure • Example • <A>T</A><A>T</A> T<A>T</A>T • <A>T</A>T<A>T</A>T • ((<A>T</A>)*T)*

EXALG • Input: multiple pages with the same template • Techniques: • Differentiating token roles • Equivalence class (EC) form a template • Tokens with the same occurrence vector

DEPTA • Identify data region • Allow mismatch between data records • Identify data record • Data records may not be continuous • Identify data items • By partial tree alignment

Comparison • How do we differentiate template token from data token? • DeLa and DEPTA assume HTML tags are template while others are data tokens • IEPAD and OLERA leaves the problems to users • How to apply the information from multiple pages? • DeLa and DEPTA conduct the mining from single page • Roadrunner and EXALG do the analysis from multiple pages

Comparison (Cont.) • Techniques improvement • From string alignment (IEPAD, RoadRunner) to tree alignment (DEPTA, Thresher) • From full alignment (IEPAD) to partial alignment (DEPTA)

Task domain comparison • Page type • structured, semi-structured or free-text Web pages • Non-HTML support • Extraction level • Field level, record-level, page-level

A Survey of WEB Information Extraction Systems

A Survey of WEB Information Extraction Systems

Presentation Transcript

Survey of Business Information Systems

A Brief Survey of Web Data Extraction Tools (WDET)

Towards Web-Scale Information Extraction

Information Extraction from Web Documents

Information Extraction

A Survey on Information Extraction from Documents Using Structures of Sentences

Open Information Extraction from the Web

information extraction

Information Extraction

Information Extraction on the Web

Toward Semantic Web Information Extraction

Information Extraction A Practical Survey

Information extraction from web pages using extraction ontologies

Web scale Information Extraction

A Survey of WEB Information Extraction Systems

Information extraction from web pages using extraction ontologies

Information Extraction A Practical Survey