1 / 24

InstaRead : An IDE for IE

InstaRead : An IDE for IE. Short Presentation in CSE454. Raphael Hoffmann January 17 th , 2013. From Text To Knowledge. Citigroup has taken over EMI, the British music label of the Beatles and Radiohead, under a restructuring of its debt,

nash
Download Presentation

InstaRead : An IDE for IE

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. InstaRead: An IDE for IE Short Presentation in CSE454 Raphael Hoffmann January 17th, 2013

  2. From Text To Knowledge Citigroup has taken over EMI, the British music label of the Beatles and Radiohead, under a restructuring of its debt, EMI announced on Tuesday. thing isA isA organization isA E band company instOf instOf instOf Beatles acquired label Citigroup EMI

  3. Human Effort Evaluation #rels #ann words dev time ACE 2004 51 train 300K 12 months MR IC 2010 15 train 115K 12 months MR KBP 2011 16 train 118K 12 months team of annotators working in parallel teams of researchersworking in parallel Motivates research on learning with less supervision, but best-performing systems use more direct human input (eg. rules)

  4. A Typical IE System Domain Ontology WordNet Freebase DomainRules Inference Coreference Resolver Pattern Matcher Relation Classifier POS Tagger Entity Extractor Entity Linker Relation Extractor Document Classifier Constituent Parser Tokenizer Sentence Splitter Time-intensive to build AnnotatedTraining Data Parameters AnnotatedTraining Data ExtractionPatterns DomainEntityDatabases FeatureSet

  5. Development Cycle Analyze Integrate resource (e.g. WordNet) AdjustML algorithm … Analyze Analyze AnnotateData Refine Rules Analyze

  6. Need Many Iterations Analyze Integrate resource (e.g. WordNet) AdjustML algorithm Analyze Analyze AnnotateData Refine Rules Analyze

  7. Iteration is Slow No appropriate visualization ML expertiserequired Analyze Integrate resource (e.g. WordNet) AdjustML algorithm Different progr. lang.,different data structures No standardizedformats … Analyze Analyze No advancedqueries No interactive speeds AnnotateData Refine Rules Noexpressive rule languageconnecting components No direct manipulation Analyze Remove barriers!

  8. InstaRead: An IDE for IE Create relations Provide DBs of instances Load textdatasets Seedistributionallysimilarwords Visualizations Writeextractionrules in logic Collect & organizerules See rulerecommendations Statistics

  9. Key Ideas • User writes rules in simple, expressive languageUse First-Order Logic • User instantly sees extractions on lots of textUse Database Indexing • User gets automatic rule suggestionsUse Bootstrapping, Learning

  10. Demo • Browser-based tool • API • cd /projects/pardosa/s1/raphaelh/github/readr/exp • source ../../init.sh • mvnexec:java –Dexec.mainClass=newexp.Materialize

  11. Project Ideas In general: Create a new component & use InstaRead to explore, tie components together, and analyze • Integrate exist. component (eg. Entity Linking, OpenIE) • Mine rules from WordNet, FrameNet, VerbNet, Wiktionary • Generate rule-suggestions through clustering • Generate rule-suggestions through (better) bootstrapping • Set rule (and thus extraction) confidence based on overlap with knowledge-base • Develop rules/code to handle specific linguistic phenomenon (eg. time, modality, negation) • Experiment with joint-inference of parsing + rules

  12. More detailed slides (optional)

  13. Rules are Crucial • Example • Rules as patterns • Rules as features • Rules (or space of rules) typically supplied Goal: Enable experts to write quality rules extremely quickly For both hand-engineered and learned systems Extraction Quality Time 1h

  14. Writing Rules in Logic • FOL1 is simple, expressive, extensible, widely used • Introduce predicates for NER, dependencies, ... • Rules are deterministic, execute in defined order 1 We use the subset referred to as ‘safe domain-relational calculus’

  15. Rule Composition • Define new predicates for similar substructure • More compact set of rules, better generalization 9 instead of 24 rules!

  16. Enabling Instant Execution • Translation to SQL allows dynamic optimization • Each predicate maps to a fragment of SQL • Intensional and extensional predicates • Indices, caching, SSDs important

  17. Experimental Setup Datasets Procedure • Develop • 4 relational extractors in 55min each 22M news sentences ½ NYTimes1 • Compare • To a weakly supervised system • Bootstrap, Keyword, Morphology, and Decomposition features • To gold annotations for error analysis 1M news sentences NYTimes071 ½ NYTimes1 22M news sentences 5K selected sentences(some gold annotations) CoNLL042 1 LDC2008T19, 2 Roth & Yih, 2004

  18. Comparison to Weakly Supervised Extraction • Precision consistently at least 90% • Works well, even when weak supervision fails Rules Weaklysupervised 37 examples in train 52 examples in train

  19. Development Phases • Bootstrap Tool (0:15) • Keywords Tool (0:15) • Morphology Feature (0:05)mined morphology from Wiktionary (tense etc.) • Decomposition Feature (0:20)enable chaining of rules

  20. Comparison of Development Phases • Bootstrap initially effective, but recall limited • Decomposition effective for some relations

  21. Error Analysis • On CoNLL04 ‘killed’ wrt gold annotations: Re .34, Pr .98 • 36% of all errors are due to incorrect pre-processing

  22. Joint Parsing & Relation Extraction CJ PhraseStructure Parser StanfordDependency Extractor RelationExtractionRules Text PSP Deps Tuples • If top parse is wrong ... • In 35% of cases: correct at top-2 • In 50% of cases: correct among top-5 • In 90% of cases: correct among top-50 K-best statistical deterministic deterministic • Heuristic: Select first parse among top-k with most extractions 18/19 flipped to correct parse Almost nodrop in precision

  23. Runtime Performance • On 22M sentences, 3.7B rows in 75 tables, 140GB • Most queries execute in 74ms or less • Outliers due to Bootstrap tool • Aggregation over millions of rows motivates streaming or sampling

  24. Logic • Often Horn clauses are sufficient, but sometimes we need ¬,∀,∃ • Example 1: founded relationMichael Dell built his first company in a dorm-room.Mr. Harris built Dell into a formidable competitor to IBM.Desired conjunct • Example 2: Integration of co-reference“nearest noun-phrase which satisfies …”

More Related