150 likes | 287 Views
COLACL 2006 Segment-based Hidden Markov Models for Information Extraction. Zhenmei Gu David R. Cheriton School of Computer Science University of Waterloo Nick Cercone Faculty of Computer Science Dalhousie University. JSYU, 2006.09.14. Outlines. Introduction Problem description
E N D
COLACL 2006 Segment-based Hidden Markov Models for Information Extraction Zhenmei Gu David R. Cheriton School of Computer Science University of Waterloo Nick Cercone Faculty of Computer Science Dalhousie University JSYU, 2006.09.14
Outlines • Introduction • Problem description • Previous works • Main contributions • Algorithms • Document-based HMM IE • Segment-based HMM IE
Introduction - Problem Description • Template filling IE problem • MUC: NE (name entity) CO (corefernece) TE (template element) • Template • Seminar announcement • Slots: Location, speaker, stime, etime • Algorithm • New aspect in evaluating HMM model • New approach in solving TE problem
Introduction - Previous Works • Using HMM for IE • Leek 1997, extract gene name-location facts • Bikel et al. 1997, find name in IE • Freitag and McCallum 1999, extract filler for slot • Other Markovian sequence models for IE • MEMM • CRF
Fillers Introduction – Main Contributions Doc Doc • HMM • Reduce noise • Document-based Segment-based • Alleviate sparseness • Remove irrelevant words • Eliminate redundancies of slotfillers • Multiple slot fillers single slot filler HMM Retrieval HMM Extractor HMM Selection Filler
Document-based HMM IE (1/3) • HMM structure (used to extract fillers)
Document-based HMM IE (2/3) SA domain, 485 documents, ten-fold cross validation evaluation Doc_HMM: Author’s IE system with Simple Good-Turning HMM_None: Other HMM IE system (Freitag and McCallum, 1999) HMM_Global: Other HMM IE system with shrinkage
Document-based HMM IE (3/3) Redundancy (in a document) Rdocument = Incorrect extracted fillers/all returned fillers R = average of Rdocument
Segment-based HMM IE (1/5) Doc • Step 1: Retrieval HMM • Filter text segments • that might contain a filler • Step 2: Extractor HMM • Label each segment (sentence) • with the most probable state sequence • Sort segments • according to their normalized likelihoods of their best state sequences • Return the filler(s) • having the largest likelihood Retrieval HMM Extractor HMM Filler
Segment-based HMM IE (2/5) • Step 2: Extraction • The segment with the highest l(s) number is selected For each segment s with token length of n, its normalized best state sequence likelihood is defined as follows. where λ is the HMM and Q is any possible state sequence associated with s.
Segment-based HMM IE (3/5) • Step1: Retrieval • Select a segment if • Qfiller= the set of state sequences that pass through any filler states • {all Q} = Qbg ∪Qfiller.
Segment-based HMM IE (4/5) • Step1: Retrieval The state sequences not passing through any target filler states. = The probability of s following this particular background state path Qbg Let s = O1O2 · · ·OT , where T is the length of s in tokens.
Fillers Main Contributions Doc Doc • HMM • Reduce noise • Document-based Segment-based • Alleviate sparseness • Remove irrelevant words • Eliminate redundancies of slotfillers • Multiple slot fillers single slot filler HMM Retrieval HMM Extractor HMM Selection Filler