600 likes | 615 Views
Plain Text Information Extraction (based on Machine Learning ). Chia-Hui Chang Department of Computer Science & Information Engineering National Central University chia@csie.ncu.edu.tw 9/24/2002. Introduction. Plain Text Information Extraction
E N D
Plain Text Information Extraction (based on Machine Learning) Chia-Hui Chang Department of Computer Science & Information Engineering National Central University chia@csie.ncu.edu.tw 9/24/2002
Introduction • Plain Text Information Extraction • The task of locating specific pieces of data from a natural language document • To obtain useful structured information from unstructured text • DARPA’s MUC program • The extraction rules are based on • syntactic analyzer • semantic tagger
On-line documents SRV, AAAI-1998 D. Freitag Rapier, ACL-1997, AAAI-1999 M. E. Califf WHISK, ML-1999 Solderland Free-text documents PALKA, MUC-5, 1993 AutoSlog, AAAI-1993 E. Riloff LIEP, IJCAI-1995 Huffman Crystal, IJCAI-1995, KDD-1997 Solderland Related Work
SRVInformation Extraction from HTML: Application of a General Machine Learning Approach Dayne Freitag Dayne@cs.cmu.edu AAAI-98
Introduction • SRV • A general-purpose relational learner • A top-down relational algorithm for IE • Reliance on a set of token-oriented features • Extraction pattern • First-order logic extraction pattern with predicates based on attribute-value tests
Extraction as Text Classification • Extraction as Text Classification • Identify the boundaries of field instances • Treat each fragment as a bag-of-words • Find the relations from the surrounding context
Relational Learning • Inductive Logic Programming (ILP) • Input: class-labeled instances • Output: classifier for unlabeled instances • Typical covering algorithm • Attribute values are added greedily to a rule • The number of positive examples is heuristically maximized while the number of negative examples is heuristically minimized
Simple Features • Features on individual token • Length (e.g. single letter or multiple letters) • Character type (e.g. numeric or alphabet) • Orthography (e.g. capitalized) • Part of speech (e.g. verb) • Lexical meaning (e.g. geographical_place)
Individual Predicates • Individual predicate: • Length (=3): accepts only fragments containing three tokens • Some(?A [] capitalizedp true): the fragment contains some token that is capitalized • Every(numericp false): every token in the fragment is non-numeric • Position(?A fromfirst <2): the token bound to ?A is either first or second in the fragment • Relpos(?A ?B =1) the token bound to ?A immediately preceds the token bound to ?B
Relational Features • Relational Feature types • Adjacency (next_token) • Linguistic syntax (subject_verb)
Search • Adding predicates greedily, attempting to cover as many positive and as few negative examples as possible. • At every step in rule construction, all documents in the training set are scanned and every text fragment of appropriate size counted. • Every legal predicate is assessed in terms of the number of positive and negative examples it covers. • A position-predicate is not legal unless some-predicate is already part of the rule
Relational Paths • Relational features are used only in the Path argument to the some-predicate • Some(?A [prev_token prev_token] capitalized true): The fragment contains some token preceded by a capitalized token two tokens back.
Validation • Training Phase • 2/3: learning • 1/3: validation • Testing • Bayesian m-estimates: • All rules matching a given fragment are used to assign a confidence score. • Combined confidence:
Experiments • Data Source: • Four university computer science departments: Cornell, U. of Texas, U. of Washington, U. of Wisconsin • Data Set: • Course: title, number, instructor • Project: title, member • 105 course pages • 96 project pages • Two Experiments • Random: 5 cross-validation • LOUO: 4-fold experiments
OPD Coverage: Each rule has its own confidence
Baseline Strategies OPD MPD Simply memorizes field instances Random Guesser
Conclusions • Increased modularity and flexibility • Domain-specific information is separate from the underlying learning algorithm • Top-down induction • From general to specific • Accuracy-coverage trade-off • Associate confidence score with predictions • Critique: single-slot extraction rule
RAPIERRelational Learning of Pattern-Match Rules for Information Extraction M.E. Califf and R.J. Mooney ACL-97, AAAI-1999
Rule Representation • Single-slot extraction patterns • Syntactic information (part-of-speech tagger) • Semantic class information (WordNet)
The Learning Algorithm • A specific to general search • The pre-filler pattern contains an item for each word • The filler pattern has one item from each word in the filler • The post-filler has one item for each word • Compress the rules for each slot • Generate the least general generalization (LGG) of each pair of rules • When the LGG of two constraints is a disjunction, we create two alternatives (1) disjunction (2) removal of the constraints.
Example • Located in Atlanta, Georgia. • Offices in Kansas City, Missouri. , , , ,
Example: Located in Atlanta, Georgia. Offices in Kansas City, Missouri. • Assume there is a semantic class for states, but not one for cities.
Experimental Evaluation • 300 computer-related Jobs • 17 slots: employer, location, salary, job requirements, language and platform.
Experimental Evaluation • 485 seminar announcement • 4 slots:
WHISK: S. Soderland University of Washington Journal of Machine Learning 1999
Free Text Person name Position Verb stem Verb stem
WHISK Rule Representation • For Semi-structured IE
WHISK Rule Representation Skip only whithin the same syntactic field • For Free Text IE Person name Position Verb stem Verb stem
Creating a Rule from a Seed Instance • Top-down rule induction • Start from an empty rule • Add terms within the extraction boundary (Base_1) • Add terms just outside the extraction (Base_2) • Until the seed is covered
AutoSlog: Automatically Constructing a Dictionary for Information Extraction Tasks Ellen Riloff Dept. of Computer Science, University of Massachusetts, AAAI93
AutoSlog • Purpose: • Automatically constructs a domain-specific dictionary for IE • Extraction pattern (concept nodes): • Conceptual anchor: a trigger word • Enabling conditions: constraints
Concept Node Example Physical target slot of a bombing template
Construction of Concept Nodes • Given a target piece of information. • AutoSlog finds the first sentence in the text that contains the string. • The sentence is handed over to CIRCUS which generates a conceptual analysis of the sentence. • The first clause in the sentence is used. • A set of heuristics are applied to suggest a good conceptual anchor point for a concept node. • If none of the heuristics is satisfied, AutoSlog searches for the next sentence, and goto 3.
Background Knowledge • Concept Node Construction • Slot • The slot of the answer key • Hard and soft constraints • Type: Use template types such as bombing, kidnapping • Enabling condition: heuristic pattern • Domain Specification • The type of a template • The constraints for each template slot
Another good concept node definition Perpetrator slot from a perpetrator template
A bad concept node definition Victim slot from a kidnapping template
Empirical Results • Input: • Annotated corpus of texts in which the targeted information is marked and annotated with semantic tags denoting the type of information (e.g., victim) and type of event (e.g., kidnapping) • 1500 texts with 1258 answer keys contain 4780 string fillers • Output: • 1237 concept node definitions • Human intervention: 5 user-hour to sift through all generated concept nodes • 450 definitions are kept • Performance:
Conclusion • In 5 person-hour, AutoSlog creates a dictionary that achieves 98% of the performance of hand-crafted dictionary • Each concept node is a single-slot extraction pattern • Reasons for bad definitions • When a sentence contains the targeted string but does not describe the event • When a heuristic proposes the wrong conceptual anchor point • When CIRCUS incorrectly analyzes the sentence