460 likes | 654 Views
Information Extraction from Web Documents. CS 652 Information Extraction and Integration Li Xu Yihong Ding. IR and IE . IR (Information Retrieval) Retrieves relevant documents from collections Information theory, probabilistic theory, and statistics IE (Information Extraction)
E N D
Information Extractionfrom Web Documents CS 652 Information Extraction and Integration Li Xu Yihong Ding
IR and IE • IR (Information Retrieval) • Retrieves relevant documents from collections • Information theory, probabilistic theory, and statistics • IE (Information Extraction) • Extracts relevant information from documents • Machine learning, computational linguistics, and natural language processing
History of IE • Large amount of both online and offline textual data. • Message Understanding Conference (MUC) • Quantitative evaluation of IE systems • Tasks • Latin American terrorism • Joint ventures • Microelectronics • Company management changes
Evaluation Metrics • Precision • Recall • F-measure
Web Documents • Unstructured (Free) Text • Regular sentences and paragraphs • Linguistic techniques, e.g., NLP • Structured Text • Itemized information • Uniform syntactic clues, e.g., table understanding • Semistructured Text • Ungrammatical, telegraphic (e.g., missing attributes, multi-value attributes, …) • Specialized programs, e.g., wrappers
Approaches to IE • Knowledge Engineering • Grammars are constructed by hand • Domain patterns are discovered by human experts through introspection and inspection of a corpus • Much laborious tuning and “hill climbing” • Machine Learning • Use statistical methods when possible • Learn rules from annotated corpora • Learn rules from interaction with user
Knowledge Engineering • Advantages • With skills and experience, good performing systems are not conceptually hard to develop. • The best performing systems have been hand crafted. • Disadvantages • Very laborious development process • Some changes to specifications can be hard to accommodate • Required expertise may not be available
Machine Learning • Advantages • Domain portability is relatively straightforward • System expertise is not required for customization • “Data driven” rule acquisition ensures full coverage of examples • Disadvantages • Training data may not exist, and may be very expensive to acquire • Large volume of training data may be required • Changes to specifications may require reannotation of large quantities of training data
Wrapper • A specialized program that identifies data of interest and maps them to some suitable format (e.g. XML or relational tables) • Challenge: recognizing the data of interest among many other uninterested pieces of text • Tasks • Source understanding • Data processing
Free Text • AutoSlog • Liep • Palka • Hasten • Crystal • WebFoot • WHISK
AutoSlog [1993] The Parliament building was bombed by Carlos.
LIEP [1995] The Parliament building was bombed by Carlos.
PALKA [1995] The Parliament building was bombed by Carlos.
HASTEN [1995] The Parliament building was bombed by Carlos. Egraphs (SemanticLabel, StructuralElement)
CRYSTAL [1995] The Parliament building was bombed by Carlos.
WHISK [1999] The Parliament building was bombed by Carlos. • WHISK Rule: *(PhyObj)*@passive *F ‘bombed’ * {PP ‘by’ *F (Person)} • Context-based patterns
Web Documents • Semistructured and Unstructured • RAPIER (E. Califf, 1997) • SRV (D. Freitag, 1998) • WHISK (S. Soderland, 1998) • Semistructured and Structured • WIEN (N. Kushmerick, 1997) • SoftMealy (C-H. Hsu, 1998) • STALKER (I. Muslea, S. Minton, C. Knoblock, 1998)
Inductive Learning • Task • Inductive Inference • Learning Systems • Zero-order • First-order, e.g., Inductive Logic Programming (ILP)
RAPIER [1997] • Inductive Logic Programming • Extraction Rules • Syntactic information • Semantic information • Advantage • Efficient learning (bottom-up) • Drawback • Single-slot extraction
SRV [1998] • Relational Algorithm (top-down) • Features • Simple features (e.g., length, character type, …) • Relational features (e.g., next-token, …) • Advantages • Expressive rule representation • Drawbacks • Single-slot rule generation • Large-volume of training data
WHISK [1998] • Covering Algorithm (top-down) • Advantages • Learn multi-slot extraction rules • Handle various order of items-to-be-extracted • Handle document types from free text to structured text • Drawbacks • Must see all the permutations of items • Less expressive feature set • Need large volume of training data
WIEN [1997] • Assumes • Items are always in fixed, known order • Introduces several types of wrappers • Advantages • Fast to learn and extract • Drawbacks • Can not handle permutations and missing items • Must label entire pages • Does not use semantic classes
SoftMealy [1998] • Learns a transducer • Advantages • Learns order of items • Allows item permutations and missing items • Allows both the use of semantic classes and disjunctions • Drawbacks • Must see all possible permutations • Can not use delimiters that do not immediately precede and follow the relevant items
STALKER [1998,1999,2001] • Hierarchical Information Extraction • Embedded Catalog Tree (ECT) Formalism • Advantages • Extracts nested data • Allows item permutations and missing items • Need not see all of the permutations • One hard-to-extract item does not affect others • Drawbacks • Does not exploit item order
Web IE Tools (main technique used) • Wrapper languages (TSIMMIS, Web-OQL) • HTML-aware (X4F, XWRAP, RoadRunner, Lixto) • NLP-based (RAPIER, SRV, WHISK) • Inductive learning (WIEN, SoftMealy, Stalker) • Modeling-based (NoDoSE, DEByE) • Ontology-based (BYU ontology)
Degree of Automation • Trade-off: page lay-out dependent • RoadRunner • Assume target pages were automatically generated from some data sources • The only fully automatic wrapper generator • BYU ontology • Manually created with graphical editing tool • Extraction process fully automatic
Support of Complex Objects • Complex objects: nested objects, graphs, trees, complex tables, … • Earlier tools do not support extracting from complex objects, like RAPIER, SRV, WHISK, and WIEN. • BYU ontology • Support
Page Contents • Semistructured data (table type, richly tagged) • Semistructured text (text type, rarely tagged) • NLP-based tools: text type only • Other tools (except ontology-based): table type only • BYU ontology: both types
Ease of Use • HTML-aware tools, easiest to use • Wrapper languages, hardest to use • Other tools, in the middle
Output • XML is the best output format for data sharing on the Web.
Support for Non-HTML Sources • NLP-based and ontology-based, automatically support • Other tools, may support but need additional helper like syntactical and semantic analyzer • BYU ontology • support
Resilience and Adaptiveness • Resilience: continuing to work properly in the occurrence of changes in the target pages • Adaptiveness: working properly with pages from some other sources but in the same application domain • Only BYU ontology has both the features.
X means the information extraction system has the capability; X* means the information extraction system has the ability as long as the training corpus can accommodate the required training data; ? Shows that the systems can has the ability in somewhat degree; * means that the extraction pattern itself doesn’t show the ability, but the overall system has the capability.
Meaning Information Extraction • Knowledge Source Target • Information • Data Problem of IE (unstructured documents)
Meaning Information Extraction • Knowledge Source Target • Information • Data Problem of IE (structured documents)
Meaning Information Extraction • Knowledge Source Target • Information • Data Problem of IE (semistructured documents)
Meaning Information Extraction • Knowledge Source Target • Information • Data Solution of IE (the Semantic Web)