1 / 27

IE with Dictionaries

This project focuses on the problem of finding names in email text given a dictionary of names. The task is complicated by factors such as nicknames, abbreviations, misspellings, and polysemous words. The problem is similar to record linkage and requires combining state-of-the-art similarity metrics and Named Entity Recognition (NER) systems. The project explores using Sequential Word Classification and Semi-Markov models for Information Extraction.

eyeager
Download Presentation

IE with Dictionaries

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. IE with Dictionaries Cohen & Sarawagi

  2. Announcements • Current statistics: • days with unscheduled student talks: 2 • students with unscheduled student talks: 0 • Projects are due: 4/28 (last day of class) • Additional requirement: draft (for comments) no later than 4/21

  3. Finding names you know about • Problem: given dictionary of names, find them in email text • Important task beyond email (biology, link analysis,...) • Exact match is unlikely to work perfectly, due to nicknames (Will Cohen), abbreviations (William C) , misspellings (Willaim Chen), polysemous words (June, Bill), etc • In informal text it sometimes works very poorly • Problem is similar to record linkage (aka data cleaning, de-duping, merge-purge, ...) problem of finding duplicate database records in heterogeneous databases.

  4. Finding names you know about • Technical problem: • Hard to combine state of the art similaritymetrics (as used in record linkage) with state of the art NER system due to representational mismatch: • Opening up the box, modern NER systems don’t really know anything about names....

  5. IE as Sequential Word Classification A trained IE system models the relative probability of labeled sequences of words. person name location name background To classify, find the most likely state sequence for the given words: Yesterday Pedro Domingos spoke this example sentence. Any words said to be generated by the designated “person name” state extract as a person name: Person name: Pedro Domingos

  6. IE as Sequential Word Classification Modern IE systems use a rich representation for words, and clever probabilistic models of how labels interact in a sequence, but do not explicitly represent the names extracted. w w w identity of word ends in “-ski” is capitalized is part of a noun phrase is in a list of city names is under node X in WordNet is in bold font is indented is in hyperlink anchor last person name was female next two words are “and Associates” t - 1 t t+1 … is “Wisniewski” … part ofnoun phrase ends in “-ski” O O O t - t +1 t 1

  7. Train on sequences of labeled segments, not labeled words. S=(start,end,label) Build probability model of segment sequences, not word sequences Define features f of segments (Approximately) optimize feature weights on training data Semi-Markov models for IE with Sunita Sarawagi, IIT Bombay f(S) = words xt...xu, length, previous words, case information, ..., distance to known name maximize:

  8. Details: Semi-Markov model

  9. Segments vs tagging t x y f(xt,yt) t,u x y f(xj,yj)

  10. Details: Semi-Markov model

  11. Conditional Semi-Markov models CMM: CSMM:

  12. A training algorithm for CSMM’s (1) Review: Collins’ perceptron training algorithm Correct tags Viterbi tags

  13. A training algorithm for CSMM’s (2) Variant of Collins’ perceptron training algorithm: voted perceptron learner for TTRANS like Viterbi

  14. A training algorithm for CSMM’s (3) Variant of Collins’ perceptron training algorithm: voted perceptron learner for TTRANS like Viterbi

  15. A training algorithm for CSMM’s (3) Variant of Collins’ perceptron training algorithm: voted perceptron learner for TSEGTRANS like Viterbi

  16. Viterbi for HMMs

  17. Viterbi for SMM

  18. Sample CSMM features

  19. Experimental results • Baseline algorithms: • HMM-VP/1: tags are “in entity”, “other” • HMM-VP/4: tags are “begin entity”, “end entity”, “continue entity”, “unique”, “other” • SMM-VP: all features f(w) have versions for “f(w) true for some w in segment that is first (last, any) word of segment” • dictionaries: like Borthwick • HMM-VP/1: fD(w)=“word w is in D” • HMM-VP/4: fD,begin(w)=“word w begins entity in D”, etc, etc • Dictionary lookup

  20. Datasets used Used small training sets (10% of available) in experiments.

  21. Results

  22. Results: varying history

  23. Results: changing the dictionary

  24. Results: vs CRF

  25. Results: vs CRF

More Related