110 likes | 219 Views
Relational Learning of Pattern-Match Rules for Information Extraction. Presentation by Tim Chartrand of A paper by Mary Elaine Califf and Raymond J. Mooney. Introduction. Information Extraction (IE) is the task of locating specific pieces of information in NL text
E N D
Relational Learning of Pattern-Match Rules for Information Extraction Presentation by Tim Chartrand of A paper by Mary Elaine Califf and Raymond J. Mooney
Introduction • Information Extraction (IE) is the task of locating specific pieces of information in NL text • IE is an important subpart of text understanding • IE systems are difficult and time consuming to build and they don’t port well to different domains • Researchers are combining learning methods with NLP methods to automate IE
Overview of RAPIER • RAPIER – Robust Automated Production of Information Extraction Rules • Learn IE rules automatically • Use a corpus of documents paired with filled templates • Resulting rules do not require prior parsing or subsequent processing • Uses limited syntactic information from a POS tagger • Induced patterns incorporate semantic classes • Rules characterize slot-fillers and their context
RAPIER Rules • Consist of three parts: • Pre-filler pattern – matches text immediately preceding the extracted information • Filler pattern – matches the exact text to be extracted • Post-filler pattern – matches text after information • Each pattern is a sequence of pattern items or pattern lists • Pattern item specifies constraints for one word or symbol • Pattern list specifies constraints for 0..n words or symbols • Constraints include: • List of words, one of which must match the item • POS tag • Semantic class
Learning Algorithm located in Atlanta, Georgia. offices in Kansas City, Missouri. For each slot, S in the template being learned SlotRules = most specific rules from document S while compression has failed fewer than lim times randomly select r pairs of rules from SlotRules find the set L of generalizations of the fillers of the rule pairs create rules from L, evaluate, and initialize RulesList let n = 0 while best rule in RuleList produces spurious fillers and weighted information value of best rule is improving increment n specialize each rule in RuleList with generalizations of the last n items of the pre-filler patterns of the rule pair and add specializations to RuleList specialize each rule in RuleList with generalizations of the last n items of the post-filler patterns of the rule pair and add specializations to RuleList if best rule in RuleList produces only valid fillers Add it to SlotRules Remove empirically subsumed rules
Experimental Results • The task: Extract information from coputer-related job postings • 17 slots used, including employer, salary, etc. • Results do not employ semantic categories • 100 document dataset with filled templates with 10-fold cross validation • Measured precision, recall, and F-measure
Experimental Results – continued • Performance: • Is comparable to Crystal on a medical domain • Is better than AutoSlog and AutoSlog-TS on MUC-4 terrorism task • Is hard to compare because of the different domains tested • Is good because precision is most important
Related Work • Resolve • Uses decision trees • Uses annotated coreference examples • Crystal • Uses a clustering algorithm to build a dictionary of extraction patterns • Requires patterns identified by an expert • Requires prior syntax analysis to identify syntactic elements and their relationships • AutoSlog • Specializes a set of general syntatic patterns • An expert must examine the patterns it produces • Requires prior syntax analysis • Liep • Requires prior syntax analysis • Makes no real use of semantic information • Has not been applied to complex domains
Related Work – BYU DEG • RAPIER rules correspond closely to DEG data frames. • Data frames are finer-grained, based on character patterns, whereas rules are based on word patterns • Pre-filler and Post-filler patterns correspond closely to data frame contexts and key words • Semantic categories correspond closely with lexicons • Not mentioned how RAPIER handles multiple record documents • Rapier data structure is given by the template (slots) defined in the input data • RAPIER is very similar in purpose to what Joe is trying to do – learn extraction rules based on a filled in form
Conclusions • Extracting desired pieces of information from NL text is important • Manually constructing IE systems too hard • RAPIER uses relational learning to build a set of pattern-match rules given a database of texts and filled templates • Learned patterns employ syntactic and semantic information to match slot fillers and context • Fairly accurate results can be obtained for a real-world problem with relatively small datasets • RAPIER compares favorably with other IE learning systems