1 / 33

Conditional Markov Models: MaxEnt Tagging and MEMMs

Conditional Markov Models: MaxEnt Tagging and MEMMs. William W. Cohen CALD. Announcements. Confused about what to write up? Mon 2/9: Ratnaparki & Frietag et al Wed 2/11: Borthwick et al & Mikheev Mon 2/16: no class (President’s day) Wed 2/18: Sha & Pereira, Lafferty et al

duane
Download Presentation

Conditional Markov Models: MaxEnt Tagging and MEMMs

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Conditional Markov Models: MaxEnt Tagging and MEMMs William W. Cohen CALD

  2. Announcements • Confused about what to write up? • Mon 2/9: Ratnaparki & Frietag et al • Wed 2/11: Borthwick et al & Mikheev • Mon 2/16: no class (President’s day) • Wed 2/18: Sha & Pereira, Lafferty et al • Mon 2/23: Klein & Manning, Toutanova et al • Wed 2/25: no writeup due • Mon 3/1: no writeup due • Wed 3/3: project proposal due: personnel + 1-2 page • Spring break week, no class

  3. Review of review • Multinomial HMMs are sequential version of naïve Bayes. • One way to drop independence assumption: use a maxent instead of NB, and a conditional model

  4. From NB to Maxent

  5. From NB to Maxent

  6. From NB to Maxent Learning: set alpha parameters to maximize this: the ML model of the data, given we’re using the same functional form as NB. Turns out this is the same as maximizing entropy of p(y|x) over all distributions.

  7. MaxEnt Comments • Functional form same as Naïve Bayes (loglinear model) • Numerical issues & smoothing important • All methods are iterative • Classification performance can be competitive with state-of-art • optimizes Pr(y|x), not error rate

  8. What is a symbol? Ideally we would like to use many, arbitrary, overlapping features of words. S S S identity of word ends in “-ski” is capitalized is part of a noun phrase is in a list of city names is under node X in WordNet is in bold font is indented is in hyperlink anchor … t - 1 t t+1 … is “Wisniewski” … part ofnoun phrase ends in “-ski” O O O t - t +1 t 1

  9. What is a symbol? S S S identity of word ends in “-ski” is capitalized is part of a noun phrase is in a list of city names is under node X in WordNet is in bold font is indented is in hyperlink anchor … t - 1 t t+1 … is “Wisniewski” … part ofnoun phrase ends in “-ski” O O O t - t +1 t 1 Idea: replace generative model in HMM with a maxent model, where state depends on observations

  10. What is a symbol? S S S identity of word ends in “-ski” is capitalized is part of a noun phrase is in a list of city names is under node X in WordNet is in bold font is indented is in hyperlink anchor … t - 1 t t+1 … is “Wisniewski” … part ofnoun phrase ends in “-ski” O O O t - t +1 t 1 Idea: replace generative model in HMM with a maxent model, where state depends on observations and previous state

  11. What is a symbol? S S identity of word ends in “-ski” is capitalized is part of a noun phrase is in a list of city names is under node X in WordNet is in bold font is indented is in hyperlink anchor … S t - 1 t+1 … t is “Wisniewski” … part ofnoun phrase ends in “-ski” O O O t - t +1 t 1 Idea: replace generative model in HMM with a maxent model, where state depends on observations and previous state history

  12. Ratnaparkhi’s MXPOST • Sequential learning problem: predict POS tags of words. • Uses MaxEnt model described above. • Rich feature set. • To smooth, discard features occurring < 10 times.

  13. MXPOST

  14. Feature selection GIS MXPOST: learning & inference

  15. MXPost inference Adwait: consider only extensions suggested by a dictionary

  16. MXPost results • State of art accuracy (for 1996) • Same approach used successfully for several other sequential classification steps of a stochastic parser (also state of art). • Same approach used for NER by Borthwick, Malouf, Collins, Manning, and others.

  17. Alternative inference

  18. Finding the most probable path: the Viterbi algorithm (for HMMs) • define to be the probability of the most probable path accounting for the first i characters of x and ending in state k (ending in with tag k) • we want to compute , the probability of the most probable path accounting for all of the sequence and ending in the end state • can define recursively • can use dynamic programming to find efficiently

  19. Finding the most probable path: the Viterbi algorithm for HMMs • initialization:

  20. The Viterbi algorithm for HMMs • recursion for emitting states (i =1…L):

  21. The Viterbi algorithm for HMMs and Maxent Taggers • recursion for emitting states (i =1…L): i-th token Previous tag k

  22. MEMMs (Frietag & McCallum) • Basic difference from ME tagging: • ME tagging: previous state is feature of MaxEnt classifier • MEMM: build a separate MaxEnt classifier for each state. • Can build any HMM architecture you want; eg parallel nested HMM’s, etc. • Data is fragmented: examples where previous tag is “proper noun” give no information about learning tags when previous tag is “noun” • Mostly a difference in viewpoint – easier to see parallels to HMMs

  23. MEMM task: FAQ parsing

  24. MEMM features

  25. MEMMs

  26. Borthwick et al: MENE system • Much like MXPost, with some tricks for NER: • 4 tags/field: x_start, x_continue, x_end, x_unique • Features: • Section features • Tokens in window • Lexical features of tokens in window • Dictionary features of tokens (is token a firstName?) • External system of tokens (is this a NetOwl_company_start? proteus_person_unique?) • Smooth by discarding low-count features • No history: viterbi search used to find best consistent tag sequence (e.g. no continue w/o start)

  27. Dictionaries in MENE

  28. MENE results (dry run)

  29. MENE learning curves 92.2 93.3 96.3

  30. Longer names Short names • Largest U.S. Cable Operator Makes Bid for Walt Disney • By ANDREW ROSS SORKIN • The Comcast Corporation, the largest cable television operator in the United States, made a $54.1 billion unsolicited takeover bid today for The Walt Disney Company, the storied family entertainment colossus. • If successful, Comcast's audacious bid would once again reshape the entertainment landscape, creating a new media behemoth that would combine the power of Comcast's powerful distribution channels to some 21 million subscribers in the nation with Disney's vast library of content and production assets. Those include its ABC television network, ESPN and other cable networks, and the Disney and Miramax movie studios.

  31. LTG system • Another MUC-7 competitor • Handcoded rules for “easy” cases (amounts, etc) • Process of repeated tagging and “matching” for hard cases • Sure-fire (high precision) rules for names where type is clear (“Phillip Morris, Inc – The Walt Disney Company”) • Partial matches to sure-fire rule are filtered maxent classifier (candidate filtering) using contextual information, etc • Higher-recall rules, avoiding conflicts with partial-match output “Phillip Morris announced today…. - “Disney’s ….” • Final partial-match & filter step on titles with different learned filter. • Exploits discourse/context information

  32. LTG Results

  33. LTG NetOwl Commercial RBS Identifinder MENE+Proteus Manitoba (NB filtered names)

More Related