1 / 17

IE approaches

IE approaches. Traditional IE (from NLP and CL) Using syntactic and semantic constraints Wrapper (independently developed for WWW) Using delimiter-based extraction patterns This paper Soft Pattern + IR(PRF) + summarization (sentence retrieval/ranking, MMR) techniques.

Download Presentation

IE approaches

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. IE approaches • Traditional IE (from NLP and CL) • Using syntactic and semantic constraints • Wrapper (independently developed for WWW) • Using delimiter-based extraction patterns • This paper • Soft Pattern + IR(PRF) + summarization (sentence retrieval/ranking, MMR) techniques

  2. Unsupervised Learning of Soft Patterns for Generating Definitions from Online News • IE from QA perspective • Research question: finding definition sentence for terms or person names; • Previous approaches: • hand-crafted rules (previous paper) or • supervised learning • Research method: • unsupervised soft patterns +IR + summarization • External tools needed: commercial pos tagger and syntactic chunker (NP, VP)

  3. Soft Patterns • A virtual vector representation (window size 3) • <Slot-w, ……, Slot-2, Slot-1, SCH_TERM , Slot1, Slot2, ……Slotw : Pa> • Slot: a vector of tokens with their probabilities of occurrence • <(tokeni1, weighti1), (tokeni2, eighti2) ……(tokenim, weightim): Sloti> • Token: word, punctuation or syntactic tag (substituted?)

  4. Soft Patterns Emerged from Text

  5. sentences Test sentence Tagging, chunking, substitution Tagging, chunking, substitution Pa instances <token-w, ……, token-2, token-1, SCH_TERM, token1, token2, …… tokenw : S> S instance Probability estimate Soft patternsPa Soft Patterns Matching Process Matching:1) bag-of-words similarity using Naive Bayes2) sequences fidelity using bigram model3) weighing patterns by their overall weight

  6. Soft Patterns Matching • bag-of-words similarity using Naive Bayes • sequences fidelity using bigram model Where is Pa? Manual Tuning alpha?

  7. System Architecture Search Term IR, anaphora resolution Final sentenceselection Input relevant sentences Redundancy removal: MMR Centroid-basedranking Matched candidatesentences as definition Reranking by pattern matching Ranked sentences Top n by PRF SP generation Pseudo-relevance feedback or assumption?

  8. Centroid Word Selection • Which sentences are mostly likely to contain a definition? • Local centroid words (summarization techniques) • For each word, compute its mutual info with search term

  9. Summary of the techniques employed • Core: soft pattern generalization and matching • Others: • Heavy use of summarization techniques • MMR for redundancy removal • Sentence Ranking/Retrieval • Shallow NLP • POS tagging and syntactic chunker

  10. Evaluation for Information Extraction

  11. Evaluation for Definition Extraction • Test data: • TREC QA corpus • Online news (heuristics leaning to news text) • Experiment: • Comparison to HCR and centroid-based statistical method (baseline) • F5-measure

  12. Evaluation for TREC collection

  13. Evaluation for Web Corpus

  14. Questions for this paper • Chunker-variate performance? (NP, VP) • Manual tuning parameter (alpha, delta)? • Void PRF? • Question selection: seed for pattern generation • Is it “patterns” or just one pattern at all? • Arbitrary window size? • Is it really “unsupervised learning?” • Part of data used for rule induction • Can SP+PRF really beat HCR?

  15. References • Line Eikvil. Information Extraction from World Wide Web. Norwegian Computing Center Technical Report 1999 • William Cohen and Andrew McCallum. Information Extraction from World Wide Web. Kdd tutorial 2003 • Stephen Soderland. Learning Information Extraction Rules from Semi-structured and Free-text. Machine Learning (1) 1999 • Fuchun Peng. Models for Information Extraction. Technical Report (2000 or 2001?) • Douglas E. Appelt and David J. Israel. Introduction to Information Extraction Technologies. IJCAI’99 Tutorial.

More Related