530 likes | 649 Views
Declarative Information Extraction The Avatar Group IBM Almaden Research Center Rajasekar Krishnamurthy, Yunyao Li , Sriram Raghavan, Frederick Reiss, Shivakumar Vaithyanathan, and Huaiyu Zhu. Motivation. Where is the party?. Hmmm…I don’t know. Let me check my email.
E N D
Declarative Information ExtractionThe Avatar Group IBM Almaden Research CenterRajasekar Krishnamurthy, Yunyao Li, Sriram Raghavan, Frederick Reiss, Shivakumar Vaithyanathan, and Huaiyu Zhu Sonoma State University Computer Science Colloquium
Motivation Where is the party? Hmmm…I don’t know. Let me check my email. John and Jane are going to a salsa party tonight! But …
Where is the party? Hi guys, We are planning a salsa party tonight starting at 10:00pm for our class at Miami Beach Club, 175 San Pedro Square San Jose, CA 95109 Whoever who is interested, please let me know so we can organize some car-pooling. -Juan PS: you can call me at 408.123.4567 if needed. salsa address 0 email found salsa 100 emails found address 0 email found The address of the party! But the email itself does not contain the word “address”!
Select Address From EVENTS Where event = ‘salsa party’ Event Address salsa party175 San Pedro Square ... ... ... 175 San Pedro Square … Information Extraction • Exploit the extracted datain your applications • E.g. for search, for advertisement • Distill structured data from unstructured and semi-structured text • E.g. extracting phone numbers from emails, extracting person names from the web Hi guys, We are planning a salsa party tonight starting at 10:00pm for our salsa class at Miami Beach Club, 175 San Pedro Square San Jose, CA 95109 Whoever who is interested, please let me know so we can organize some car-pooling. -Juan PS: you can call me at 408.123.4567 if needed.
Revisit: Where is the Party? salsa address Lotus Notes 8.01 Live Text San Jose, CA 95109
And many others • Literature Citations/ Research Communities • DBLife • Google Scholar • Terminology Extraction • Document Summarization • Life Science • Eg. Gene Sequence Extraction, Protein Interaction Extraction … … As the amount of data in text explodes, information extraction is becoming increasing important!
Basic Terminology Programs used to extract structured data Structured data extracted by annotators Higher Level Applications annotations Annotator annotations Annotator Data Repository documents … annotations Annotator
Background: Avatar • Working on information extraction (IE) since 2003 • Main goals: • Extract structured information from text • Build a system that can scale IE to real enterprise apps • Build new enterprise applications that leverage IE
Evolution of the Avatar IE System Evolutionary Triggers 2004 Custom Code Large number of annotators RAP(CPSL-style cascading grammar system) 2005 Diverse data sets, Complex extraction tasks RAP++(RAP + Extensions outside the scope of grammars) 2006 Performance, Expressivity System T(algebraic information extraction system) 2007 2008
The Custom Code Era Sonoma State University Computer Science Colloquium
Extracting Information with Custom Code • “It’s just pattern matching” • Use scripts and regular expressions • Then reality sets in… • Dozens of rules, even for simple concepts • Many special cases • Convoluted logic • Painfully slow code
The Age of Cascading Grammars Sonoma State University Computer Science Colloquium
Historical Perspective • MUC (Message Understanding Conference) – 1987 to 1997 • Competition-style conferences organized by DARPA • Shared data sets and performance metrics • News articles, Radio transcripts, Military telegraphic messages • Classical IE Tasks • Entity and Relationship/Link extraction • Event detection, sentiment mining etc. • Entity resolution/matching • Several IE systems were built • FRUMP [DeJong82], CIRCUS /AutoSlog [Riloff93], FASTUS [Appelt96], LaSIE/GATE, TextPro, PROTEUS, OSMX [Embley05]
Cascading Finite-state Grammars • Most IE systems share a common formalism • Input text viewed as a sequence of tokens • Rules expressed as regular expression patterns over the lexical features of these tokens • Several levels of processing Cascading Grammars • CPSL • A standard language for specifying cascading grammars • Created in 1998 • Several known implementations • TextPro: reference implementation of CPSL by Doug Appelt • JAPE (Java Annotation Pattern Engine) • Part of the GATE NLP framework • Under active consideration for commercial use by several companies
Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Proin elementum neque at justo. Aliquam erat volutpat. Curabitur a massa. Vivamus luctus, risus in sagittis facilisis arcu augue rutrum velit, sed <Name> at <Phone> hendrerit faucibus pede mi ipsum. Curabitur cursus tincidunt orci. Pellentesque justo tellus , scelerisque quis, facilisis quis, interdum non, ante. Suspendisse feugiat, erat in feugiat tincidunt, est Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Proin elementum neque at justo. Aliquam erat volutpat. Curabitur a massa. Vivamus luctus, risus in e sagittis facilisis, arcu augue rutrum velit, sed <PersonPhone>, hendrerit faucibus pede mi sed ipsum. Curabitur cursus tincidunt orci. Pellentesque justo tellus , scelerisque quis, facilisis quis, interdum non, ante. Suspendisse feugiat, erat in feugiat tincidunt, est nunc volutpat enim, quis viverra lacus nulla sit amet lectus. Nulla odio lorem, feugiat et, volutpat dapibus, ultrices sit amet, sem. Vestibulum quis dui vitae massa euismod faucibus. Pellentesque id neque id tellus hendrerit tincidunt. Etiam augue. Class aptent taciti Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Proin, in sagittis facilisis, John Smith at <Phone> amet lt arcu tincidunt orci. Pellentesque justo tellus , scelerisque quis, facilisis nunc volutpat enim, quis viverra lacus nulla sit lectus. Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Proin l enina i facilisis, <Name> at 555-1212 arcu tincidunt orci. Pellentesque justo tellus , scelerisque quis, facilisis nunc volutpat enim, quis viverra lacus nulla sit amet lectus. Nulla Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Proin elementum neque at justo. Aliquam erat volutpat. Curabitur a massa. Vivamus luctus, risus in sagittis facilisis arcu augue rutrum velit, sed John Smith at 555-1212 hendrerit faucibus pede mi ipsum. Curabitur cursus tincidunt orci. Pellentesque justo tellus , scelerisque quis, facilisis quis, interdum non, ante. Suspendisse feugiat, erat in feugiat tincidunt, est nunc volutpat enim, quis viverra lacus nulla sit amet lectus. Nulla odio lorem, feugiat et, volutpat dapibus, ultrices sit amet, sem. Vestibulum quis dui vitae massa euismod faucibus. Pellentesque id neque id tellus hendrerit tincidunt. Etiam augue. Class aptent taciti sociosqu ad litora Cascading Grammars By Example Level 2 Name Token[~ “at”] Phone PersonPhone Level 1 Token[~ “[1-9]\d{2}-\d{4}”] Phone Token[~ “John | Smith| …”]+ Name Level 0 (Tokenize)
Experiences with Cascading Grammars • Benefits • Big step forward from custom code • Can express many simple concepts • Drawbacks • Expressiveness • Dealing with overlap • Building complex structures • Performance
Sequencing Overlapping Input Annotations ProperNoun Instrument Marco Doe on the Hammond organ John Pipe plays the guitar Instrument ProperNoun ProperNoun Instrument <ProperNoun> <within 30 characters> <Instrument> Regular Expression Dictionary Match Match <d1|d2|…dn> <[A-Z]\w+(\s[A-Z]\w+)?> Example rule from the Band Review
Instrument John Pipe plays the guitar ProperNoun Sequencing Overlapping Input Annotations • Possible options • Pre-specified disambiguation rules (e.g., pick earlier annotation) • Supply tie-breaking rules for every possible overlap scenario • Let implementation make an internal non-deterministic choice (as in JAPE, RAP, ..) Over 4.5M blog entries a choice one way or another on a single rule would change the number of annotations by +/- 25%. There is no magic! Which of the two should we pick? Prefer ProperNoun over Instrument John Pipe plays the guitar ProperNounToken Token Instrument John Pipe plays the guitar TokenInstrument TokenToken Instrument Instrument Marco Doe onthe Hammond organ ProperNounTokenToken Instrument ProperNoun Marco Doe on the Hammond organ Marco Doe onthe Hammond organ ProperNounTokenToken PoperNountoken Instrument ProperNoun
Person Organization Phone URL Within 50 tokens At least 1 Phone Start with Person Person Organization Phone URL At least 2 of {Phone, Organization, URL} End with one of these. Complex Structures Example: Signature Annotator Laura Haas, PhD Distinguished Engineer and Director, Computer Science Almaden Research Center 408-927-1700 http://www.almaden.ibm.com/cs
Complex Structures: Existing Solutions • Approximate using regular expressions • Example: Signature • Rule: (Person Token{,25} Phone (Token{,25} Contact)+)| (Person (Token{,25} Contact)+ Token{,25} Phone (Token{,25} Contact)*) • Problems: • Need to enumerate all possible orders of sub-annotations • What if you want at least one phone and one email? • Does not restrict total token count
Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Proin elementum neque at justo. Aliquam erat volutpat. Curabitur a massa. Vivamus luctus, risus in sagittis facilisis arcu augue rutrum velit, sed <Name> at <Phone> hendrerit faucibus pede mi ipsum. Curabitur cursus tincidunt orci. Pellentesque justo tellus , scelerisque quis, facilisis quis, interdum non, ante. Suspendisse feugiat, erat in feugiat tincidunt, est Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Proin elementum neque at justo. Aliquam erat volutpat. Curabitur a massa. Vivamus luctus, risus in e sagittis facilisis, arcu augue rutrum velit, sed <PersonPhone>, hendrerit faucibus pede mi sed ipsum. Curabitur cursus tincidunt orci. Pellentesque justo tellus , scelerisque quis, facilisis quis, interdum non, ante. Suspendisse feugiat, erat in feugiat tincidunt, est nunc volutpat enim, quis viverra lacus nulla sit amet lectus. Nulla odio lorem, feugiat et, volutpat dapibus, ultrices sit amet, sem. Vestibulum quis dui vitae massa euismod faucibus. Pellentesque id neque id tellus hendrerit tincidunt. Etiam augue. Class aptent taciti Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Proin, in sagittis facilisis, John Smith at <Phone> amet lt arcu tincidunt orci. Pellentesque justo tellus , scelerisque quis, facilisis nunc volutpat enim, quis viverra lacus nulla sit lectus. Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Proin l enina i facilisis, <Name> at 555-1212 arcu tincidunt orci. Pellentesque justo tellus , scelerisque quis, facilisis nunc volutpat enim, quis viverra lacus nulla sit amet lectus. Nulla Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Proin elementum neque at justo. Aliquam erat volutpat. Curabitur a massa. Vivamus luctus, risus in sagittis facilisis arcu augue rutrum velit, sed John Smith at 555-1212 hendrerit faucibus pede mi ipsum. Curabitur cursus tincidunt orci. Pellentesque justo tellus , scelerisque quis, facilisis quis, interdum non, ante. Suspendisse feugiat, erat in feugiat tincidunt, est nunc volutpat enim, quis viverra lacus nulla sit amet lectus. Nulla odio lorem, feugiat et, volutpat dapibus, ultrices sit amet, sem. Vestibulum quis dui vitae massa euismod faucibus. Pellentesque id neque id tellus hendrerit tincidunt. Etiam augue. Class aptent taciti sociosqu ad litora Performance Level 2 Name Token[~ “at”] Phone PersonPhone Level 1 Token[~ “[1-9]\d{2}-\d{4}”] Phone Token[~ “John | Smith| …”]+ Name Each level in a cascading grammar looks at each character in each document Level 0
Dawn of Declarative Information Extraction Sonoma State University Computer Science Colloquium
Operator Runtime System-T Architecture AQL Language Specify annotator semantics declaratively Annotation Algebra Optimizer Choose an efficient execution plan that implements semantics
Declarative Information Extraction: AQL • SQL-like language for defining annotators • Declarative • Define basic patterns and the relationships between them • Let the system worry about the order of operations
AQL Example <ProperNoun> <within 30 characters> <Instrument> Regular Expression Dictionary Match Match select CombineSpans(name.match, instrument.match) as annot from Regex(/[A-Z]\w+(\s[A-Z]\w+)?/, DocScan.text) name, Dictionary(“instr.dict”, DocScan.text) instrument where Follows(0, 30, name.match, instrument.match);
Annotation Algebra • Each Operator in the algebra… • …operates on one or more tuples of annotations • …produces tuples of annotations • “Document at a time” execution model • Algebra expression is defined over • the current document d • annotations defined over d • Algebra expression is evaluated over each document in the corpus individually
Basic Single-Argument Operator Document Annotation 1 Output Tuple 1 Document Annotation 2 Output Tuple 2 Operator Parameters Document Input Tuple
…John Smith at 555-1212… …<PersonPhone>… …<Name> at <Phone>… …John Smith at 555-1212… Comparison with Cascading Grammars John Smith at 555-1212 Apply PersonPhone Join John Smith Block 555-1212 Apply Name Rule John Smith Apply Phone Rule Dictionary Regex Fewer passes over the documents Grammar Algebra
Instrument ProperNoun John Pipe plays the guitar Marco Benevento on the Hammond organ ProperNoun Instrument Instrument ProperNoun Revisit Problem of Sequencing Annotations
Algebra expression for the Rule from Band Review(Reiss, Raghavan, Krishnamurthy, Zhu and Vaithyanathan, ICDE 2008) <ProperNoun> <within 30 characters> <Instrument> Regular Expression Dictionary Match Match Join (followed within 30 characters) Regular expression Dictionary ProperNoun Instrument
doc John Pipe guitar doc Marco Benevento Hammond organ doc John Pipe doc Marco Benevento doc Hammond Regex Dictionary Instrument ProperNoun John Pipe plays the guitar Marco Benevento on the Hammond organ ProperNoun Instrument Instrument ProperNoun ProperNoun <0-30 chars> Instrument Join ProperNoun Instrument doc Pipe doc guitar doc Hammond organ
Person Organization Phone URL Within 50 tokens At least 1 Phone Start with Person Person Organization Phone URL At least 2 of {Phone, Organization, URL} End with one of these. How is aggregation handled Laura Haas, PhD Distinguished Engineer and Director, Computer Science Almaden Research Center 408-927-1700 http://www.almaden.ibm.com/cs
Signature Person Organization Phone URL Join Organization Phone URL Phone URL Block Organization Union Back to signature Person Person Org Phone URL Cleaner and potentially faster
Performance • Performance issues with grammars • Complete pass through tokens for each rule • Many of these passes are wasted work • Dominant approach: Make each pass go faster • Doesn’t solve root problem! • Algebraic approach: Build a query optimizer!
Optimizations • Query optimization is a familiar topic in databases • What’s different in text? • Operations over sequences and texts • Document boundaries • Costs concentrated in extraction operators (dictionary, regular expression) • Can leverage these characteristics • Text-specific optimizations • Significant performance improvements
Optimization Example <ProperNoun> <within 30 characters> <Instrument> Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Proin elentum non ante. John Pipeplayed theguitar. Aliquam erat volutpat. Curabitur a massa. Vivamus luctus, risus in sagittis facilisis, arcu augue rutrum ve Regex match Dictionary match 0-30 characters
Classic Query Optimization (Followed within 30 characters) Join <ProperNoun> <Instrument> Plan A Find <Instrument> within 30 characters Find <ProperNoun> within 30 characters Consider text to the right Consider text to the left <Instrument> <ProperNoun> Plan B Plan C
…John Smith at 555-1212… Example of Text-Specific Optimization: John Smith at 555-1212 • Conditional Evaluation (CE) • Leverage document-at-a-time processing • Don’t evaluate the inner operand of a join if the outer has no results • Costing plans is challenging CEJoin John Smith 555-1212 Dictionary Regex Don’t evaluate this Regex when there are no dictionary matches.
Experimental Results (Band Review Annotator) Classical query optimization Text-specific optimizations
IOPES: Extracting Relationships and Composite Entities • IOPES = IBM Omnifind Personal Email Search • Extract entities such as email address, url • Associations such as name ↔ phone number • Complex entities like conference schedules, directions, signature blocks
Thank you! • For more information… • Try out IOPES • http://www.alphaworks.ibm.com/tech/emailsearch • Avatar Project home page • http://almaden.ibm.com/cs/projects/avatar/ • Contact me • yunyaoli@us.ibm.com
Backup Slides Sonoma State University Computer Science Colloquium
Block Operator (b) Lorem ipsum dolor sit amet, consectetuer adipiscing elit. In augue mi, scelerisque non, dictum non, vestibulum congue, erat. Donec non felis. Maecenas urna nunc, pulvinar et, fringilla a, porta at, diam. In iaculis dignissim erat. Quisque pharetra. Suspendisse cursus viverra urna. Aliquam erat volutpat. Donec quis sapien et metus molestie eleifend. Maecenas sit amet metus eleifend nibh semper fringilla. Pellentesque habitant morbi tristique senectus et netus et malesuada Constraint on distance between inputs Input Input Block Input Input Constraint on number of inputs
…John Smith at 555-1212… Conditional Evaluation (CE) John Smith at 555-1212 • Leverage document-at-a-time processing • Don’t evaluate the inner operand of a join if the outer has no results • Costing plans is challenging CEJoin John Smith 555-1212 Dictionary Regex Don’t evaluate this Regex when there are no dictionary matches.
…John Smith at 555-1212… Restricted Span Evaluation John Smith at 555-1212 • Leverage the sequential nature of text • Only evaluate the inner on the relevant portions of the document • Limited applicability (compared with CE) • Only certain operands and predicates RSEJoin 555-1212 John Smith Regex Dictionary Only look for dictionary matches in the vicinity of a phone number.
Implementing Restricted Span Evaluation (RSE) s1 binding • RSE join operator • RSE extraction operator • Pass join bindings down to the inner of a join • Requires special physical operators at edges of plan p(s1,s2)Dict(D,s2) s1 p R1 RSEDict D s2’s that satisfyp(binding, s2) RSEDictionaryOperator
Length of longest dictionary entry RSE Dictionary Operator To find dictionary matches that end in this range… Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Proin tincidunt eleifend quam. Aliquam ut pede ut enim dapibus venenatis. …need to examine this range. • RSE version of an operator must produce the exact same answer • Ongoing work: RSE Regular Expression operator
Closely related work (Shen, Doan, Naughton, Ramakrishnan, VLDB 2007) Regular Expressions and Custom Code Cascading Grammars Workflows CPSL, AFST UIMA, GATE In the context of Project Cimple.Search for “cimple wisc” System T DBLife
Delving deeper into System T versus DBLife Conditional Evaluation Pushing Down Text Properties DBLife Restricted Span Evaluation Scoping Extractions System T Shared Dictionary Matching Pattern Matching