270 likes | 284 Views
This project focuses on the problem of finding names in email text given a dictionary of names. The task is complicated by factors such as nicknames, abbreviations, misspellings, and polysemous words. The problem is similar to record linkage and requires combining state-of-the-art similarity metrics and Named Entity Recognition (NER) systems. The project explores using Sequential Word Classification and Semi-Markov models for Information Extraction.
E N D
IE with Dictionaries Cohen & Sarawagi
Announcements • Current statistics: • days with unscheduled student talks: 2 • students with unscheduled student talks: 0 • Projects are due: 4/28 (last day of class) • Additional requirement: draft (for comments) no later than 4/21
Finding names you know about • Problem: given dictionary of names, find them in email text • Important task beyond email (biology, link analysis,...) • Exact match is unlikely to work perfectly, due to nicknames (Will Cohen), abbreviations (William C) , misspellings (Willaim Chen), polysemous words (June, Bill), etc • In informal text it sometimes works very poorly • Problem is similar to record linkage (aka data cleaning, de-duping, merge-purge, ...) problem of finding duplicate database records in heterogeneous databases.
Finding names you know about • Technical problem: • Hard to combine state of the art similaritymetrics (as used in record linkage) with state of the art NER system due to representational mismatch: • Opening up the box, modern NER systems don’t really know anything about names....
IE as Sequential Word Classification A trained IE system models the relative probability of labeled sequences of words. person name location name background To classify, find the most likely state sequence for the given words: Yesterday Pedro Domingos spoke this example sentence. Any words said to be generated by the designated “person name” state extract as a person name: Person name: Pedro Domingos
IE as Sequential Word Classification Modern IE systems use a rich representation for words, and clever probabilistic models of how labels interact in a sequence, but do not explicitly represent the names extracted. w w w identity of word ends in “-ski” is capitalized is part of a noun phrase is in a list of city names is under node X in WordNet is in bold font is indented is in hyperlink anchor last person name was female next two words are “and Associates” t - 1 t t+1 … is “Wisniewski” … part ofnoun phrase ends in “-ski” O O O t - t +1 t 1
Train on sequences of labeled segments, not labeled words. S=(start,end,label) Build probability model of segment sequences, not word sequences Define features f of segments (Approximately) optimize feature weights on training data Semi-Markov models for IE with Sunita Sarawagi, IIT Bombay f(S) = words xt...xu, length, previous words, case information, ..., distance to known name maximize:
Segments vs tagging t x y f(xt,yt) t,u x y f(xj,yj)
Conditional Semi-Markov models CMM: CSMM:
A training algorithm for CSMM’s (1) Review: Collins’ perceptron training algorithm Correct tags Viterbi tags
A training algorithm for CSMM’s (2) Variant of Collins’ perceptron training algorithm: voted perceptron learner for TTRANS like Viterbi
A training algorithm for CSMM’s (3) Variant of Collins’ perceptron training algorithm: voted perceptron learner for TTRANS like Viterbi
A training algorithm for CSMM’s (3) Variant of Collins’ perceptron training algorithm: voted perceptron learner for TSEGTRANS like Viterbi
Experimental results • Baseline algorithms: • HMM-VP/1: tags are “in entity”, “other” • HMM-VP/4: tags are “begin entity”, “end entity”, “continue entity”, “unique”, “other” • SMM-VP: all features f(w) have versions for “f(w) true for some w in segment that is first (last, any) word of segment” • dictionaries: like Borthwick • HMM-VP/1: fD(w)=“word w is in D” • HMM-VP/4: fD,begin(w)=“word w begins entity in D”, etc, etc • Dictionary lookup
Datasets used Used small training sets (10% of available) in experiments.