290 likes | 626 Views
Weighting Finite-State Transductions With Neural Context. Pushpendre Rastogi Ryan Cotterell Jason Eisner. The Setting:. string-to-string transduction. Pronunciation!. Morphology!. Transliteration!. N. V. P. D. N. Time flies like an arrow. 日文章魚怎麼 說. Washington. bathe. break.
E N D
Weighting Finite-State Transductions With Neural Context PushpendreRastogi Ryan Cotterell Jason Eisner
The Setting: string-to-string transduction Pronunciation! Morphology! Transliteration! N V P D N Time flies like an arrow 日文章魚怎麼 說 Washington bathe break • 日文 章魚 怎麼 說 واشنطون • beð • broken Segmentation! Tagging! Supertagging!
The Setting: string-to-string transduction • The Cowboys: Finite-state transducers
The Setting: string-to-string transduction • The Cowboys: • The Aliens: Finite-state transducers seq2seq models(recurrent neural nets)
Review: Weighted FST x = break • y = broken b r ea k b r e a k π { , , … } b r o k e n b r ok e n ea:o k:k :e r:r :n b:b • Latent monotonic alignment π • Represented by path in a finite graph • p(y | x) = = sum of -to- paths e:r a:ok r: k:e
Review: Weighted FST ea:o k:k :e r:r :n b:b • Enforces hard, monotonic alignments (latent path variable) • Globally normalized no label bias • Exact computations by dynamic programming • Don’t need beam search! Can sum over all paths, and thus … • compute Z, p(y | x), expected loss; sample random strings • compute gradients (training), Viterbi or MBR string (testing) e:r a:ok r: k:e • p(y | x) = = sum of -to- paths
Review: seq2seq model(see Faruqui et al. 2016 – next talk!) x = break • y = broken b r e a k # LSTM b # n o k e r p(y | x) reads x, then stochastically emitschars of y, 1 by 1, like a language model
Exact-match accuracy on 4 morphology tasks Heck, I can look at context too. I just need more states. Now using my weird soft alignment … I call it “attention” Your attention’s unsteady, friend. You’re shooting all over the input. Ha! You ain’t got any alignment! I sluuuurp featuresright out of the context But you cannot learn what context to look at :-P Not from 500 examples you can’t … I can learn anything … !
You can guess the ending … • You’re rooting for the cowboys, ain’tya? • Dreyer, Smith, & Eisner (2008) • “Latent-variable modeling of string transductions with finite-state methods” • Beats everything in this paper by a couple of points • More local context (trigrams of edits) • Latent word classes • Latent regions within a word (e.g., can find stem vowel) • But might have to redesign for other tasks (G2P?) • And it’s quite slow – this FST has lots of states
How do we give a cowboy alien genes? • First, we’ll need to upgrade our cowboy. • The new weapon comes from CRFs. • Discriminative training? Already doing it. • Global normalization?Already doing it. • Conditioning on entire input (like seq2seq)?Aha!
CRF: p(y | x) conditions on entire input x = y = N N V V P P D D N N Time flies like an arrow Time flies like an arrow x = • emission weights • transition weights • But CRF weights can depend freely on all of x! • Hand-designed feature templates typically don’t • But recurrent neural nets do exploit this freedom y =
What do FST weights depend on? ea:o k:k :e r:r :n b:b • A path is a sequence of edits applied to input • Path weight = product of individual arc weights • Arc weight depends only on arc:input:output strings and the states • That’s why we can do dynamic programming e:r a:ok r: k:e
What’s wrong with FSTs ea:o k:k :e r:r :n b:b • All dependence on context must becaptured in the state (Markov property) • Need lots of states to get the linguistics right • Our choice of states limits the context we can see • But does it have to be this way? e:r a:ok r: k:e
Find all paths turning given x into anyy a a a a • input x = 2 3 4 1 0 a:b a:c define weights at F hand-built FST F = E D a:c a:b so F specifies the full model p(y | x) a:b a:b a:b a:b a:c a:c a:c a:c • paths G = 4E 0E 1D 3D 2E G simply inherits F’s weights – note tied parameters now run our dynamic programming algorithms on G
Find all paths turning given x into anyy a a a a • input x = 2 3 4 1 0 a:b a:c define weights at F hand-built FST F = E D a:c a:b a:b a:b a:b a:b a:c a:c a:c a:c • paths G = 4E 0E 1D 3D 2E new generalization: define weights on G directly! now run our dynamic programming algorithms on G
So that’s how to make an FST like a CRF • Now an edit’s weight can depend freely on input context. • (Dynamic programming is still just as efficient!) • So now we can use LSTMs to help scorethe edits in context – learn to extractcontext features. a:c 3D 2E
BiLSTM to extract features from input 0 1 2 3 4 5 right to left LSTM b r e a k 3 1 0 5 4 2 left to right LSTM
BiLSTM to extract features from input 4 right to left LSTM b r e a k 2 left to right LSTM
Scoring an arc using neural context 4 To score this edit token: G First encode the edit type, then combine with context: ea:o • F b r e a k E D 2 ea:o 4D 2E weight
So that’s how we define weights of G’s arcs a a a a • input x = 2 3 4 1 0 a:b a:c hand-built FST F = E D a:c a:b a:b a:b a:b a:b a:c a:c a:c a:c • paths G = 4E 0E 1D 3D 2E now run our dynamic programming algorithms on G
Conclusions • Cowboys are good • Monotonic hard alignments, exact computation • Aliens are good • Learn to extract arbitrary features from context • They’re compatible: “FSTs w/ neural context” • We can inject LSTMs into classical probabilistic models for structured prediction [not just FSTs] • This is limit of efficient exact computation (?) • More powerful models could use this model as a proposal distribution for importance sampling
Questions? Weighting Finite-State Transductions With Neural Context PushpendreRastogi Ryan Cotterell Jason Eisner
Exact-match accuracy on 4 morphology tasks Heck, I can look at context too. I just need more states. Now using my weird soft alignment … I call it “attention” Your attention’s unsteady, friend. You’re shooting all over the input. Ha! You ain’t got any alignment! But you cannot learn what context to look at :-P Not from 500 examples you can’t … I can learn anything … !