300 likes | 428 Views
Sources of Success for Information Extraction Methods. Seminar for Computational Learning and Adaptation Stanford University October 25, 2001 Joseph Smarr jsmarr@stanford.edu Based on research conducted at UC San Diego in Summer 2001 with Charles Elkan and David Kauchak.
E N D
Sources of Success for Information Extraction Methods Seminar for Computational Learning and Adaptation Stanford University October 25, 2001 Joseph Smarr jsmarr@stanford.edu Based on research conducted at UC San Diego in Summer 2001 with Charles Elkan and David Kauchak
Overview and Themes: Identifying “Sources of Success” • Brief overview of Information Extraction (IE) paradigm and current methods • Getting “under the hood” of current systems to understand the source of their performance and limitations • Identifying new sources of information to exploit for increased performance and usefulness
Motivation for Information Extraction • Abundance of freely available text in digital form (WWW, MEDLINE, etc.) • Information contained un-annotated text is largely inaccessible to computers • Much of this information appears “ripe for the plucking” without having to do full text understanding
Highly Structured Example: Amazon.com Book Info Pages Desired Info: title, author(s), price, availability, etc.
Partially Structured Example:SCLA Speaker Announcement Emails Desired Info: title, speaker, date, abstract, etc.
Natural Text Example:MEDLINE Journal Abstracts BACKGROUND: The most challenging aspect of revision hip surgery is the management of bone loss. A reliable and valid measure of bone loss is important since it will aid in future studies of hip revisions and in preoperative planning. We developed a measure of femoral and acetabular bone loss associated with failed total hip arthroplasty. The purpose of the present study was to measure the reliability and the intraoperative validity of this measure and to determine how it may be useful in preoperative planning. METHODS: From July 1997 to December 1998, forty-five consecutive patients with a failed hip prosthesis in need of revision surgery were prospectively followed. Three general orthopaedic surgeons were taught the radiographic classification system, and two of them classified standardized preoperative anteroposterior and lateral hip radiographs with use of the system. Interobserver testing was carried out in a blinded fashion. These results were then compared with the intraoperative findings of the third surgeon, who was blinded to the preoperative ratings. Kappa statistics (unweighted and weighted) were used to assess correlation. Interobserver reliability was assessed by examining the agreement between the two preoperative raters. Prognostic validity was assessed by examining the agreement between the assessment by either Rater 1 or Rater 2 and the intraoperative assessment (reference standard). RESULTS: With regard to the assessments of both the femur and the acetabulum, there was significant agreement (p < 0.0001) between the preoperative raters (reliability), with weighted kappa values of >0.75. There was also significant agreement (p < 0.0001) between each rater's assessment and the intraoperative assessment (validity) of both the femur and the acetabulum, with weighted kappa values of >0.75. CONCLUSIONS: With use of the newly developed classification system, preoperative radiographs are reliable and valid for assessment of the severity of bone loss that will be found intraoperatively. Desired Info: subject size, study type, condition studied, etc.
Current Types of IE Systems • Hand-built systems • Often effective, but slow and expensive to build and adapt • Stochastic generative models • HMMs, N-Grams, PCFGs, etc. • Keep separate distributions for “content” and “filler” states • Induced rule-based systems • Learn to identify local “landmarks” for beginning and end of target information
Formalization of Information Extraction • Performance task: • Extract specific tokens from a set of documents that contain the desired information • Performance measure: • Precision: # correct returned / total # returned • Recall: # correct returned / total # correct • F1: harmonic mean of precision and recall • Learning paradigm: • Supervised learning on set of documents with target fields manually labeled • Usually train/test on one field at a time
IE as a Classification Task: Token Extraction as Boundary Detection Input: Linear Sequence of Tokens Date : Thursday , October 25 Time : 4 : 15 - 5 : 30 PM Method: Binary Classification of Inter-Token Boundaries Start / End of Content Date : Thursday , October 25 Time : 4 : 15 - 5 : 30 PM … Unimportant Boundaries Output: Tokens Between Identified Start / End Boundary Date : Thursday , October 25 Time : 4 : 15 - 5 : 30 PM
Representation of Boundary Classifiers • “Boundary Detectors” are pairs of token sequences <p,s> • Detector matches a boundary iff p matches text before boundary and s matches text after boundary • Detectors can contain wildcards, e.g. “capitalized word”, “number”, etc. • Example: • <Date:,[CapitalizedWord]> matches beginning of Date: Thursday, October 25
Boosted Wrapper Induction (BWI):Exemplar of Current Rule-Based Systems • Wrapper Induction is a high-precision, low-recall learner that performs well for highly structured tasks • Boosting is a technique for combining multiple weak learners into a strong learner by reweighting examples • Boosted Wrapper Induction (BWI) was proposed by Freitag and Kushmerick in 2000 as the marriage of these two techniques
BWI Algorithm • Given set of documents with labeled fore and aft boundaries, induce <F,A,H>: • F: set of “fore detectors” • A: set of “aft detectors” • H: histogram of field lengths (for pairing fore and aft detectors) • To learn each boundary detector: • Start with an empty rule • Exhaustively enumerate all extensions up to lookahead length L • Add best scoring token extension • Repeat until no extensions improve score • After learning a new detector: • Re-weight documents according to AdaBoost (down-weight correctly covered docs, up-weight incorrectly covered docs, normalize all weights) • Repeat process, learning a new rule and re-weighting each time • Stop after a predetermined number of iterations
Summary of Original BWI Results • BWI gives state-of-the-art performance on highly structured and partially structured tasks • No systematic analysis of why BWI performs well • BWI proposed as a solution for natural text IE, but no tests conducted
Goals of Our Research • Understand specifically how boosting contributes to BWI’s performance • Investigate the relationship between performance and task regularity • Identify new sources of information to improve performance, particularly for natural language tasks
Comparison Algorithm:Sequential Wrapper Induction (SWI) • Same formulation as BWI, but uses set covering instead of boosting to learn multiple rules • Find highest scoring rule • Remove all positive examples covered by new rule • Stop when all positive examples have been removed • Scoring function – two choices: • Greedy-SWI: Most positive examples covered without covering any negative examples • Root-SWI: Sqrt(W+) – Sqrt(W-) (W+ and W- are total weight of positive and negative examples covered) • BWI uses root scoring, but many set covering methods use greedy scoring
Component Matrix of Algorithms Method for accumulating multiple detectors: Boosting Set Covering BWI Root-SWI Method for scoring individual detectors: Greedy Root Greedy-SWI
Question #1: Does BWI Outperform Greedy Approach of SWI? Average of 8 partially structured IE tasks • BWI has higher F1 than Greedy-SWI • Greedy-SWI tends to have slightly higher precision, but BWI has considerably higher recall • Does this difference come from the scoring function or the accumulation method?
Question #2: How Does Performance Differ By Choice of Scoring Function? Average of 8 partially structured IE tasks • Greedy-SWI and Root-SWI differ only by their scoring function • Greedy-SWI has higher precision, Root-SWI had higher recall, they have similar F1 • BWI still outperforms Root-SWI, even though they use identical scoring functions • Remaining differences: • boosting vs. set covering • total number of rules learned
Question #3: How Does Number of Rules Learned Affect Performance? Average of 8 partially structured IE tasks • BWI learns predetermined number of rules, but SWI stops when all examples are covered • Usually BWI learns many more rules than Root-SWI • Stop BWI after it’s learned as many rules as Root-SWI too (“Fixed-BWI”) • Results in precision-recall tradeoff from Root-SWI • BWI outperforms Fixed-BWI
Analysis of Experimental Results: Why Does BWI Outperform SWI? • Key Insight: Source of BWI’s success is interaction of two complimentary effects, both due to boosting: • Re-weighting examples causes increasingly specific rules to be learned to cover exceptional cases (high precision) • Re-weighting examples instead of removing them means rules can be learned even after all examples have been covered (high recall)
Performance vs. Task Regularity Reveals Important Interaction Highly structured Partially structured Natural text • All methods perform better on tasks with more structure • Relative power of different algorithmic components varies with task regularity
How Do We Quantify Task Regularity? • Goal: Measure relationship between task regularity and performance • Proposed solution: “SWI-Ratio” # of iterations Greedy-SWI takes to cover all positive examples total number of positive examples • Most regular case: 1 rule covers all examples; 1/ = 0 • Least regular case: separate rule for each example; N/N = 1 • Since each new rule must cover at least one example, SWI will learn at most N rules for N examples (and usually much fewer) SWI-Ratio always between 0 and 1 (smaller = more regular)
Desirable Properties of SWI-Ratio • Relative to size of document collection suitable for comparison across different sizes • General and objective • SWI is very simple and doesn’t allow any negative examples unbiased account of how many non-overlapping rules are needed to perfectly cover all examples • Quick and easy to run • No free parameters to set (except lookahead, which we kept fixed in all tests)
Performance of BWI and Greedy-SWI (F1) vs. Task Regularity (SWI-Ratio) Dotted lines separate highly structured, partially structured, and natural text domains
Improving IE Performance on Natural Text Documents Example of Tagged Sentence: Uba2pis located largelyinthe nucleus. NP_SEG VP_SEG PP_SEG NP_SEG • Goal: Compensate for weak IE performance on natural language tasks • Need to look elsewhere for regularities to exploit • Idea: Consider grammatical structure • Run shallow parser on each sentence • Flatten output into sequence of “typed phrase segments” (using XML tags to mark text)
Typed Phrase Segments Improve BWI’s Performance on Natural Text IE Tasks 21% increase 65% increase 45% increase
Typed Phrase Segments Increase Regularity of Natural Text IE Tasks Average decrease of 21%
Encouraging Results Suggest Exploiting Other Sources of Regularity • Key Insight: We can improve performance on natural text while maintaining simple IE framework if we expose the right regularities • Suggests other linguistic abstractions may be useful • More grammatical info, semantic categories, lexical features, etc.
Conclusions and Summary • Boosting is key source of BWI’s success • Learns specific rules, but learns many of them • IE performance is sensitive to task regularity • SWI-Ratio is quantitative, objective measure of regularity (vs. subjective document classes) • Exploiting more regularities in text is key to IE’s future, particularly in natural text • Canonical formatting and keywords are often sufficient in structured text documents • Exposing grammatical information boosts performance for natural text IE tasks
Acknowledgements • Dayne Fretiag, for making BWI code available • Mark Craven, for giving us natural text MEDLINE documents with annotated phrase segments • MedExpert International, Inc. for financial support of this research • Charles Elkan and David Kauchak, for hosting me at UCSD this summer This work was conducted as part of the California Institute for Telecommunications and Information Technology, Cal-(IT)2.