530 likes | 691 Views
Towards Declarative Information Extraction The Almaden Story Shivakumar Vaithyanathan IBM Almaden Acknowledgements to: Frederick Reiss, Rajasekar Krishnamurthy, Sriram Raghavan and HuaiyuZhu. Lots of Text, Many Applications!. Free-text, semi-structured, streaming …
E N D
Towards Declarative Information ExtractionThe Almaden StoryShivakumar VaithyanathanIBM AlmadenAcknowledgements to:Frederick Reiss, Rajasekar Krishnamurthy, Sriram Raghavan and HuaiyuZhu
Lots of Text, Many Applications! • Free-text, semi-structured, streaming … • Web pages, emails, news articles, call-center records, business reports, spreadsheets, research papers, blogs, wikis, tags, instant messages, … • High-impact applications • Business intelligence, personal information management, enterprise search, Web communities, Web search and advertising, scientific data management, e-government, medical records management, … • Growing rapidly • Just look at your inbox! (Adapted from SIGMOD ’06 tutorialby Ramakrishnan, Doan, and Vaithyanathan)
Select Name From PEOPLE Where Organization = ‘Microsoft’ Name Title Organization Bill GatesCEOMicrosoft Bill VeghteVPMicrosoft Richard StallmanFounderFree Soft.. Bill Gates Bill Veghte Information Extraction • Exploit the extracted data in your applications • Distill structured data from unstructured and semi-structured text For years, Microsoft CorporationCEOBill Gates was against open source. But today he appears to have changed his mind. "We can be open source. We love the concept of shared source," said Bill Veghte, a MicrosoftVP. "That's a super-important shift for us in terms of code access.“ Richard Stallman, founder of the Free Software Foundation, countered saying… (from Cohen’s IE tutorial, 2003)
Historical Perspective • MUC (Message Understanding Conference) – 1987 to 1997 • Competition-style conferences organized by DARPA • Shared data sets and performance metrics • News articles, Radio transcripts, Military telegraphic messages • Classical IE Tasks • Entity and Relationship/Link extraction • Event detection, sentiment mining etc. • Entity resolution/matching • Several IE systems were built during this period • FRUMP [DeJong82], CIRCUS /AutoSlog [Riloff93], FASTUS [Appelt96], LaSIE/GATE, TextPro, PROTEUS, OSMX [Embley05] Not Focus of this talk
Project Avatar -Evolution of IE: The Almaden Story Evolutionary triggers 2004 Custom Code Large number of annotators RAP(CPSL-style cascading grammar system) 2005 Diverse data sets, Complex extraction tasks RAP++(RAP + Extensions outside the scope of grammars) 2006 Performance, Expressivity System T(algebraic information extraction system) 2007
From: Michael D. Baselice <mikeb@baselice.com> …. Enron is in search of support with the electric deregulation fight out there. Dick Woodward or Dave Fogartycan be reached at650-340-0470. If you have any questions, please call. .… From: Tom Briggs <tom.briggs@enron.com> call Jeff Dasovichat415-782-7822. He is an excellent source and has a long, sordid histroy in california's Energy market. Circa 2004: Custom Code Extract Person’s Phone Number <Person>“(\s+at\s+)|((\w+\s+){1,3}reached\s+at\s+)”<PhoneNumber> Example emails from the Enron collection
Circa 2005: Moving to RAP Custom Code Large number of annotators RAP(CPSL-style cascading grammar system) RAP(CPSL-style cascading grammar system) As the number of annotators increased, custom code was not an option. We needed a cleaner abstraction RAP++(RAP + Extensions outside the scope of grammars) CPSL: Nice abstraction for specifying rules over annotations and text. RAP: Almaden CPSL execution engine System T(algebraic information extraction system)
Cascading Grammars • Multiple levels of grammar rules to perform extraction • At each level, rules are written in the formalism of finite-state grammars • Input text viewed as a sequence of tokens • Rules expressed as regular expression patterns over the lexical features of these tokens • Common Pattern Specification Language (Appelt & Onyshkevych, 1998) • A language specification to abstract rules and matching semantics from specific implementations • Several known implementations • TextPro: reference implementation of CPSL by Doug Appelt • JAPE (Java Annotation Pattern Engine) • Part of the GATE NLP framework • Under active consideration for commercial use by several companies
Cascading Grammar Reality Pre-processing step: Tokenization of the document text • Set of simple grammar rules for person name recognition Tokenize(Document Text) Sequence of <Token> Level 1: Rules that look for patterns in each token to produce corresponding annotations Token[~ “Mr. | Mrs. | Dr. | …”] Salutation Token[~ “Ph.D | MBA | …”] Qualification Token[~ “[A-Z][a-z]*”] CapsWord Token[~ “Michael | Richard | Smith| …”] PersonDict Level 2: Rules that look for patterns involving Level-1 annotations to identify Persons PersonDict PersonDict Person Salutation CapsWord CapsWord Person CapsWord CapsWord Token[~“,”]? Qualification Person Richard Smith Dr. Laura Haas Laura Haas, Ph.D
Applying RAP to more complex tasks Custom Code Complex tasks • Extracting signature from emails RAP(CPSL-style cascading grammar system) RAP++(RAP + Extensions outside the scope of grammars) System T(algebraic information extraction system)
RAP : Problems encountered • Lack of Support for complex operations • Aggregation, Dictionary evaluation, Combining annotations with character-level regular expressions • Handling overlapping output annotations • Multiple rules for a concept may generate overlapping annotations • Sometimes, final results may overlap • Overlapping input annotations : • Input to a grammar is a linear sequence of annotations • Performance • Annotators for complex tasks are very expensive to execute
Person Organization Phone URL Within 250 characters At least 1 Phone Start with Person Person Organization Phone URL At least 2 of {Phone, Organization, URL, Email, Address} End with one of these. Complex operations Laura Haas, PhD Distinguished Engineer and Director, Computer Science Almaden Research Center 408-927-1700 http://www.almaden.ibm.com/cs
Approximate Signature Extraction using a Grammar Macro: Contact = Phone|Organization|URL| Email|Address Rule: Person (.{,125} Contact){2,?} Signature Problems: • Cannot guarantee at least 1 Phone. • Leads to false positives • Cannot restrict limit on total character count of 250. • Leads to both false positives and false negatives. Expressing aggregations is hard in grammars
Circa 2006: Moved to RAP++ Custom Code RAP(CPSL-style cascading grammar system) Modules for aggregation (e.g., Count) RAP++(RAP + Extensions outside the scope of grammars) System T(algebraic information extraction system)
Grammar-based systems : Problems encountered • Lack of Support for complex operations • Aggregation, Dictionary evaluation, Combining annotations with character-level regular expressions • Handling overlapping output annotations • All overlapping annotations need to be retained • Overlapping annotations need to be consolidated • Overlapping input annotations : • Input to a CPSL grammar is a linear sequence of annotations • Performance • Annotators for complex tasks are expensive to execute • Takes 9 hours to execute Band Review annotator over 4.5 million blogs
Issue 2 – All overlapping annotations need to be retained • Valid results may overlap Dick Woodward or Dave Fogartycan be reached at650-340-0470. • Work-around: Multiple rules The number of rules can explode !
Circa 2006: Moved to RAP++ Custom Code RAP(CPSL-style cascading grammar system) Modules for aggregation (e.g., Count) RAP++(RAP + Extensions outside the scope of grammars) Workarounds to retain all overlapping annotations System T(algebraic information extraction system)
Overlapping annotations may need to be consolidated Salutation PersonDict … Dr. John Smith … PersonDict PersonDict Two spans need to be merged Consolidation cannot be expressed in Grammar
Circa 2006: Moved to RAP++ Custom Code RAP(CPSL-style cascading grammar system) Modules for aggregation (e.g., Count) RAP++(RAP + Extensions outside the scope of grammars) Consolidation modules Workarounds to generate overlapping output System T(algebraic information extraction system)
Apply RAP++ to diverse data-sets • Two severe problems surfaced • Due to complex workflows results largely dictated by low-level decisions • Performance
Weblogs: Identify Bands and Reviews …….I went to see the OTIS concert last night. T’ was SO MUCH FUN I really had a blast … ….there were a bunch of other bands …. I loved STAB (….). they were a really weird ska band and people were running around and …
review pattern review pattern review pattern review pattern review pattern Concert Concert instance pattern Block Review
Outline of the BandReview Extraction Task went to the Switchfoot concert at the Roxy. It was pretty fun,… The lead singer/guitarist was really good, and even though there was another guitarist (an Asian guy), he ended up playing most of the guitar parts, which was really impressive. The biggest surprise though is that I actually liked the opening bands. …I especially liked the first band BandReview “Lead singer/guitarist was really good, and even … I actually liked the opening bands. … Well they were none of those. I especially liked the first band” ReviewGroup ConcertInstance “went to the Switchfoot concert at the Roxy” “went to AJCO n Band concert” “performance by local funk band Saaraba” “lead singer/guitarist was really good” “Liked the opening bands” “Liked the first band” “Kurt Ralske played guitar” ReviewInstance
Example low-level decision Instrument ProperNoun Marco Benevento on the Hammond organ John Pipe plays the guitar Instrument ProperNoun Instrument ProperNoun <ProperNoun> <within 30 characters> <Instrument> Regular Expression Dictionary Match Match <d1|d2|…dn> <[A-Z]\w+(\s[A-Z]\w+)?> Example rule from the Band Review
Sequencing Overlapping Annotations John Pipe plays the guitar ProperNounToken Token Instrument Instrument • Possible options • Pre-specified disambiguation rules (e.g., pick earlier annotation) • Supply tie-breaking rules for every possible overlap scenario • Let implementation make an internal non-deterministic choice (as in JAPE, RAP, ..) John Pipe plays the guitar John Pipe plays the guitar TokenInstrument TokenToken Instrument Over 4.5M blog entries a choice one way or another on a single rule would change the number of annotations by +/- 25%. ProperNoun Instrument Which of the two should we pick?
Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Proin elementum neque at justo. Aliquam erat volutpat. Curabitur a massa. Vivamus luctus, risus in sagittis facilisis, <Name> at <Phone> arcu augue rutrum velit, hendrerit faucibus pede mi sed ipsum. Curabitur cursus tincidunt orci. Pellentesque justo tellus , scelerisque quis, facilisis quis, interdum non, ante. Suspendisse feugiat, erat in feugiat tincidunt, est nunc volutpat enim, quis viverra lacus nulla sit amet lectus. Nulla odio lorem, feugiat et, volutpat dapibus, ultrices sit amet, sem. Vestibulum quis dui vitae massa euismod faucibus. Pellentesque id neque id tellus hendrerit tincidunt. Etiam augue. Class aptent taciti Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Proin elementum neque at justo. Aliquam erat volutpat. Curabitur a massa. Vivamus luctus, risus in sagittis facilisis, <Name> at 555-1212 arcu augue rutrum velit, hendrerit faucibus pede mi sed ipsum. Curabitur cursus tincidunt orci. Pellentesque justo tellus , scelerisque quis, facilisis quis, interdum non, ante. Suspendisse feugiat, erat in feugiat tincidunt, est nunc volutpat enim, quis viverra lacus nulla sit amet lectus. Nulla odio lorem, feugiat et, volutpat dapibus, ultrices sit amet, sem. Vestibulum quis dui vitae massa euismod faucibus. Pellentesque id neque id tellus hendrerit tincidunt. Etiam augue. Class aptent taciti Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Proin elementum neque at justo. Aliquam erat volutpat. Curabitur a massa. Vivamus luctus, risus in sagittis facilisis, <PersonPhone> arcu augue rutrum velit, hendrerit faucibus pede mi sed ipsum. Curabitur cursus tincidunt orci. Pellentesque justo tellus , scelerisque quis, facilisis quis, interdum non, ante. Suspendisse feugiat, erat in feugiat tincidunt, est nunc volutpat enim, quis viverra lacus nulla sit amet lectus. Nulla odio lorem, feugiat et, volutpat dapibus, ultrices sit amet, sem. Vestibulum quis dui vitae massa euismod faucibus. Pellentesque id neque id tellus hendrerit tincidunt. Etiam augue. Class aptent taciti Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Proin elementum neque at justo. Aliquam erat volutpat. Curabitur a massa. Vivamus luctus, risus in sagittis facilisis, John Smith at 555-1212 arcu augue rutrum velit, hendrerit faucibus pede mi sed ipsum. Curabitur cursus tincidunt orci. Pellentesque justo tellus , scelerisque quis, facilisis quis, interdum non, ante. Suspendisse feugiat, erat in feugiat tincidunt, est nunc volutpat enim, quis viverra lacus nulla sit amet lectus. Nulla odio lorem, feugiat et, volutpat dapibus, ultrices sit amet, sem. Vestibulum quis dui vitae massa euismod faucibus. Pellentesque id neque id tellus hendrerit tincidunt. Etiam augue. Class aptent taciti sociosqu ad litora Performance Name Token[~ “at”] Phone PersonPhone Token[~ “[1-9]\d{2}-\d{4}”] Phone Token[~ “John | Smith| …”]+ Name Each level in a cascading grammar looks at each character in each document
2007: System T Custom Code RAP(CPSL-style cascading grammar system) RAP++(RAP + Extensions outside the scope of grammars) Scalable algebraic information extraction system System T(algebraic information extraction system)
Brief introduction to algebra for Intra-document IE • Each Operator in the algebra… • …operates on one or more tuples of annotations • …produces tuples of annotations • “Document at a time” execution model • Algebra expression is defined over • the current document d • annotations defined over d • Algebra expression is evaluated over each document in the corpus individually
Basic Single-Argument Operator Document Annotation 1 Output Tuple 1 Document Annotation 2 Output Tuple 2 Operator Parameters Document Input Tuple
Algebra expression for the Rule from Band Review(Reiss, Raghavan, Krishnamurthy, Zhu and Vaithyanathan, ICDE 2008) <ProperNoun> <within 30 characters> <Instrument> Regular Expression Dictionary Match Match Join (followed within 30 characters) Regular expression Dictionary ProperNoun Instrument
Instrument ProperNoun John Pipe plays the guitar Marco Benevento on the Hammond organ ProperNoun Instrument Instrument ProperNoun Revisit Problem of Sequencing Annotations
doc John Pipe guitar doc Marco Benevento Hammond organ doc John Pipe doc Marco Benevento doc Hammond Regex Dictionary Instrument ProperNoun John Pipe plays the guitar Marco Benevento on the Hammond organ ProperNoun Instrument Instrument ProperNoun ProperNoun <0-30 chars> Instrument Join ProperNoun Instrument doc Pipe doc guitar doc Hammond organ
Custom code of RAP++ are also addressed in the Algebra • Interaction between custom modules and grammar hard to maintain • In the algebra world they are addressed more cleanly • Aggregation • Consolidation
Person Organization Phone URL Within 250 characters At least 1 Phone Start with Person Person Organization Phone URL At least 2 of {Phone, Organization, URL, Email, Address} End with one of these. How is aggregation handled Laura Haas, PhD Distinguished Engineer and Director, Computer Science Almaden Research Center 408-927-1700 http://www.almaden.ibm.com/cs
Block Operator (b) Lorem ipsum dolor sit amet, consectetuer adipiscing elit. In augue mi, scelerisque non, dictum non, vestibulum congue, erat. Donec non felis. Maecenas urna nunc, pulvinar et, fringilla a, porta at, diam. In iaculis dignissim erat. Quisque pharetra. Suspendisse cursus viverra urna. Aliquam erat volutpat. Donec quis sapien et metus molestie eleifend. Maecenas sit amet metus eleifend nibh semper fringilla. Pellentesque habitant morbi tristique senectus et netus et malesuada Input Input Input Input
Block Operator (b) Lorem ipsum dolor sit amet, consectetuer adipiscing elit. In augue mi, scelerisque non, dictum non, vestibulum congue, erat. Donec non felis. Maecenas urna nunc, pulvinar et, fringilla a, porta at, diam. In iaculis dignissim erat. Quisque pharetra. Suspendisse cursus viverra urna. Aliquam erat volutpat. Donec quis sapien et metus molestie eleifend. Maecenas sit amet metus eleifend nibh semper fringilla. Pellentesque habitant morbi tristique senectus et netus et malesuada Constraint on distance between inputs Input Input Block Input Input Constraint on number of inputs
Back to signature Signature Person Organization Phone URL Join Organization Phone URL Phone Person URL Block Organization Union Person Org Phone URL Cleaner and potentially faster
Cost Modeling • Our experience: Extraction operators dominate • 90+ percent of execution time for most plans • Most of the costs that matter in databases don’t matter in information extraction • I/O costs • Join costs • Sorting costs • Aggregation costs
Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Proin elementum neque at justo. Aliquam erat volutpat. Curabitur a massa. Vivamus luctus, risus in sagittis facilisis, <Name> at <Phone> arcu augue rutrum velit, hendrerit faucibus pede mi sed ipsum. Curabitur cursus tincidunt orci. Pellentesque justo tellus , scelerisque quis, facilisis quis, interdum non, ante. Suspendisse feugiat, erat in feugiat tincidunt, est nunc volutpat enim, quis viverra lacus nulla sit amet lectus. Nulla odio lorem, feugiat et, volutpat dapibus, ultrices sit amet, sem. Vestibulum quis dui vitae massa euismod faucibus. Pellentesque id neque id tellus hendrerit tincidunt. Etiam augue. Class aptent taciti Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Proin elementum neque at justo. Aliquam erat volutpat. Curabitur a massa. Vivamus luctus, risus in sagittis facilisis, <Name> at 555-1212 arcu augue rutrum velit, hendrerit faucibus pede mi sed ipsum. Curabitur cursus tincidunt orci. Pellentesque justo tellus , scelerisque quis, facilisis quis, interdum non, ante. Suspendisse feugiat, erat in feugiat tincidunt, est nunc volutpat enim, quis viverra lacus nulla sit amet lectus. Nulla odio lorem, feugiat et, volutpat dapibus, ultrices sit amet, sem. Vestibulum quis dui vitae massa euismod faucibus. Pellentesque id neque id tellus hendrerit tincidunt. Etiam augue. Class aptent taciti Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Proin elementum neque at justo. Aliquam erat volutpat. Curabitur a massa. Vivamus luctus, risus in sagittis facilisis, <PersonPhone> arcu augue rutrum velit, hendrerit faucibus pede mi sed ipsum. Curabitur cursus tincidunt orci. Pellentesque justo tellus , scelerisque quis, facilisis quis, interdum non, ante. Suspendisse feugiat, erat in feugiat tincidunt, est nunc volutpat enim, quis viverra lacus nulla sit amet lectus. Nulla odio lorem, feugiat et, volutpat dapibus, ultrices sit amet, sem. Vestibulum quis dui vitae massa euismod faucibus. Pellentesque id neque id tellus hendrerit tincidunt. Etiam augue. Class aptent taciti Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Proin elementum neque at justo. Aliquam erat volutpat. Curabitur a massa. Vivamus luctus, risus in sagittis facilisis, John Smith at 555-1212 arcu augue rutrum velit, hendrerit faucibus pede mi sed ipsum. Curabitur cursus tincidunt orci. Pellentesque justo tellus , scelerisque quis, facilisis quis, interdum non, ante. Suspendisse feugiat, erat in feugiat tincidunt, est nunc volutpat enim, quis viverra lacus nulla sit amet lectus. Nulla odio lorem, feugiat et, volutpat dapibus, ultrices sit amet, sem. Vestibulum quis dui vitae massa euismod faucibus. Pellentesque id neque id tellus hendrerit tincidunt. Etiam augue. Class aptent taciti sociosqu ad litora And finally performance Name Token[~ “at”] Phone PersonPhone Token[~ “[1-9]\d{2}-\d{4}”]+ Phone Token[~ “John | Smith| …”]+ Name
…John Smith at 555-1212… …<Name> at 555-1212… …<Name> <Name> at <Phone>… …<PersonPhone>… …John Smith at 555-1212… Why performance should improve John Smith at 555-1212 Apply PersonPhone Join 555-1212 Apply Phone Rule John Smith Dictionary Regex Apply Name Rule Smaller number of passes over the data RAP Algebra
More Optimization • The algebraic approach allows further performance improvements via query optimization • Common technique in relational databases • We extend the formalism to the text domain • Text-specific algebraic transformations, cost model, and selectivity estimation
Optimization Example <ProperNoun> <within 30 characters> <Instrument> Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Proin elentum non ante. John Pipeplayed theguitar. Aliquam erat volutpat. Curabitur a massa. Vivamus luctus, risus in sagittis facilisis, arcu augue rutrum ve Regex match Dictionary match 0-30 characters
Plan A (Followed within 30 characters) Join <ProperNoun> <Instrument> Find <ProperNoun> within 30 characters Find <Instrument> within 30 characters Consider text to the right Consider text to the left <ProperNoun> <Instrument> Plan B Plan C
Closely related work (Shen, Doan, Naughton, Ramakrishnan, VLDB 2007) Regular Expressions and Custom Code Cascading Grammars Workflows CPSL, AFST UIMA, GATE In the context of Project Cimple.Search for “cimple wisc” System T DBLife
Delving deeper into System T versus DBLife Conditional Evaluation Pushing Down Text Properties DBLife Restricted Span Evaluation Scoping Extractions System T Shared Dictionary Matching Pattern Matching
We have started: AQL • SQL-like language for defining annotators • Declarative • Define basic patterns and the relationships between them • Let the system worry about the order of operations
AQL Example <ProperNoun> <within 30 characters> <Instrument> Regular Expression Dictionary Match Match select CombineSpans(name.match, instrument.match) as annot, name.match as name, instrument.match as instr from Regex(/[A-Z]\w+(\s[A-Z]\w+)?/, DocScan.text) name, Dictionary(“instr.dict”, DocScan.text) instrument where Follows(0, 30, name.match, instrument.match);