330 likes | 345 Views
This special meeting will cover Pair HMMs, HMM notation, inference and learning, advanced Pair HMMs, and distances based on Pair HMMs. The meeting will take place on Wednesday, April 14 at 11am in Sennott Square Building, room 5317.
E N D
Pair HMMs and edit distance Ristad & Yianilos
Special meeting Wed 4/14 • What: Evolving and Self-Managing Data Integration Systems • Who: AnHai Doan, Univ. of Illinois at Urbana-Champaign • When: Wednesday, April 14, 2004 @ 11am (food at 10:30am) • Where: Sennott Square Building, room 5317
Special meeting 4/28 (last class) • First International Joint Conference on Information Extraction, Information Integration, and Sequential Learning • 10:30-11:50 am, Wean Hall 4601 • All project proposals have been accepted as paper abstracts, and you’re all invited to present for 10min (including questions)
Pair HMMs – Ristad & Yianolis • HMM review • notation • inference (forward algorithm) • learning (forward-backward & EM) • Pair HMMs • notation • generating edit strings • distance metrics (stochastic, viterbi) • inference (forward) • learning (forward-backward & EM) • Results from R&Y paper • K-NN with trained distance, hidden prototypes • problem: phoneme strings => words • Advanced Pair HMMs • adding state (eg for affine gap models) • Smith Waterman? • CRF training? last week today
HMM Example Sample output: xT=heehahaha, sT=122121212 Pr(1->2) Pr(1->1) Pr(2->2) Pr(2->x) Pr(1->x) 1 2 Pr(2->1)
HMM Inference Key point: Pr(si=l)depends only on Pr(l’->l) and si-1 x1 x2 x3 xT
HMM Inference Key point: Pr(si=l)depends only on Pr(l’->l) and si-1 so you can propogate probabilities forward... x1 x2 x3 xT
HMM Inference – Forward Algorithm x1 x2 x3 xT
HMM Learning - EM Expectation maximization: • Find expectations, i.e. Pr(si=l) for i=1,...,T • forward algorithm + epsilon • hidden variables are states s at times t=1,...,t=T • Maximize probability of parameters given expectations: • replace #(l’->l)/#(l’) with weighted version of counts • replace #(l’->x)/#(l’) with weighted version
HMM Inference Forward algorithm: computes probabilities α(l,t) based on information in first t letters of string, ignores “downstream” information x1 x2 x3 xT
x1 x2 x3 xT HMM Inference
HMM Learning - EM Expectation maximization: • Find expectations, i.e. Pr(si=l) for i=1,...,T • forward backward algorithm • hidden variables are states s at times t=1,...,t=T • Maximize probability of parameters given expectations: • replace #(l’->l)/#(l’) with weighted version of counts • replace #(l’->x)/#(l’) with weighted version
Pair HMM Example 1 Sample run: zT = <h,t>,<e,e><e,e><h,h>,<e,->,<e,e> Strings x,y produced by zT: x=heehee, y=teehe Notice that x,y is also produced by z4 + <e,e>,<e,-> and many other edit strings
Pair HMM Inference Dynamic programming is possible: fill out matrix left-to-right, top-down
Pair HMM Inference x1 x2 x3 xT
Pair HMM Inference One difference: after i emissions of pair HMM, we do not know the column position i=2 i=1 i=1 i=1 i=3 i=3
Multiple states 2 1 3
l=2 l=1 t=1 t=1 t=2 t=2 t=2 t=2 ... ... t=T t=T v=1 v=1 ... ... v=2 v=2 ... ... ... ... v=K v=K ... ... An extension: multiple states conceptually, add a “state” dimension to the model EM methods generalize easily to this setting
Back to R&Y paper... • They consider “coarse” and “detailed” models, as well as mixturesof both. • Coarse model is like a back-off model – merge edit operations into equivalence classes (e.g. based on equivalence classes for chars). • Test by learning distance for K-NN with an additional latent variable
K-NN with latent prototypes y test example y (a string of phonemes) learned phonetic distance possible prototypes x (known word pronounciation ) x1 x2 x3 xm words from dictionary w1 w2 wK
K-NN with latent prototypes Method needs (x,y) pairs to train a distance – to handle this, an additional level of E/M is used to pick the “latent prototype” to pair with each y y learned phonetic distance x1 x2 x3 xm w1 w2 wK
Experiments • E1: on-line pronounciation dictionary • E2: subset of E1 with corpus words • E3: dictionary from training corpus • E4: dictionary from training + test corpus (!) • E5: E1 + E3
Special meeting Wed 4/14 • What: Evolving and Self-Managing Data Integration Systems • Who: AnHai Doan, Univ. of Illinois at Urbana-Champaign • When: Wednesday, April 14, 2004 @ 11am (food at 10:30am) • Where: Sennott Square Building, room 5317
Special meeting 4/28 (last class) • First International Joint Conference on Information Extraction, Information Integration, and Sequential Learning • 10:30-11:50 am, Wean Hall 4601 • All project proposals have been accepted as paper abstracts, and you’re all invited to present for 10min (including questions)