1.11k likes | 1.28k Views
Joint Models with Missing Data for Semi-Supervised Learning. Jason Eisner NAACL Workshop Keynote – June 2009. 1. Outline. Why use joint models? Making big joint models tractable: Approximate inference and training by loopy belief propagation
E N D
Joint Models with Missing Datafor Semi-Supervised Learning Jason Eisner NAACL Workshop Keynote – June 2009 1
Outline • Why use joint models? • Making big joint models tractable:Approximate inference and training by loopy belief propagation • Open questions: Semi-supervised training of joint models
y x The standard story Task p(y|x) model Semi-sup. learning: Train on many (x,?) and a few (x,y)
y x Some running examples Task p(y|x) model Semi-sup. learning: Train on many (x,?) and a few (x,y) E.g., in low-resource languages parse sentence (with David A. Smith) morph. paradigm lemma (with Markus Dreyer)
Semi-supervised learning Semi-sup. learning: Train on many (x,?) and a few (x,y) Why would knowing p(x) help you learn p(y|x) ?? • Shared parameters via joint model • e.g., noisy channel: p(x,y) = p(y) * p(x|y) • Estimate p(x,y) to have appropriate marginal p(x) • This affects the conditional distrib p(y|x)
few params For any x, can now recover cluster c that probably generated it A few supervised examples may let us predict y from c E.g., if p(x,y) = ∑c p(x,y,c) = ∑c p(c) p(y | c) p(x | c) (joint model!) sample of p(x)
Semi-supervised learning Semi-sup. learning: Train on many (x,?) and a few (x,y) Why would knowing p(x) help you learn p(y|x) ?? • Shared parameters via joint model • e.g., noisy channel: p(x,y) = p(y) * p(x|y) • Estimate p(x,y) to have appropriate marginal p(x) • This affects the conditional distrib p(y|x) • Picture is misleading: No need to assume a distance metric (as in TSVM, label propagation, etc.) • But we do need to choose a model family for p(x,y)
y x NLP + ML = ??? Task structuredinput (may beonly partlyobserved,so infer x,too) structured output (so already need jointinference for decoding, e.g.,dynamicprogramming) p(y|x) model depends on features of<x,y> (sparse features?) or features of <x,z,y> where z are latent (so infer z, too)
Task1 Task2 Task3 Task4 y1 y2 y3 y4 x1 x2 x3 x4 Each task in a vacuum?
Solved tasks help later ones? (e.g, pipeline) Task1 z1 x Task2 z2 Task3 z3 Task4 y
Feedback? Task1 z1 x What if Task3isn’t solved yet and we have little <z2,z3> training data? Task2 z2 Task3 z3 Task4 y
Feedback? Task1 z1 x What if Task3isn’t solved yet and we have little <z2,z3> training data? Task2 z2 Task3 z3 Impute <z2,z3> given x1 and y4! Task4 y
A later step benefits from many earlier ones? Task1 z1 x Task2 z2 Task3 z3 Task4 y
A later step benefits from many earlier ones? And conversely? Task1 z1 x Task2 z2 Task3 z3 Task4 y
We end up with a Markov Random Field (MRF) z1 x Φ1 Φ2 z2 z3 Φ3 y Φ4
Φ1 Φ2 Φ3 Φ5 Φ4 Variable-centric, not task-centric = =(1/Z) p(x,z1,z2,z3,y) Φ2(z1,z2) Φ1(x,z1) Φ4(z3,y) Φ3(x,z1,z2,z3) z1 x Φ5(y) z2 z3 y
First, a familiar example Conditional Random Field (CRF) for POS tagging Familiar MRF example Possible tagging (i.e., assignment to remaining variables) … … v v v preferred find tags Observed input sentence (shaded) 18
Familiar MRF example First, a familiar example Conditional Random Field (CRF) for POS tagging Possible tagging (i.e., assignment to remaining variables) Another possible tagging … … v a n preferred find tags Observed input sentence (shaded) 19
Familiar MRF example: CRF ”Binary” factor that measures compatibility of 2 adjacent tags Model reusessame parameters at this position … … preferred find tags 20
Familiar MRF example: CRF “Unary” factor evaluates this tag Its values depend on corresponding word … … can’t be adj preferred find tags 21
Familiar MRF example: CRF “Unary” factor evaluates this tag Its values depend on corresponding word … … preferred find tags (could be made to depend onentire observed sentence) 22
Familiar MRF example: CRF “Unary” factor evaluates this tag Different unary factor at each position … … preferred find tags 23
Familiar MRF example: CRF p(van) is proportionalto the product of all factors’ values on van … … v a n preferred find tags 24
Familiar MRF example: CRF NOTE: This is not just a pipeline of single-tag prediction tasks (which might work ok in well-trained supervised case …) p(van) is proportionalto the product of all factors’ values on van = … 1*3*0.3*0.1*0.2 … … … v a n preferred find tags 25
Task-centered view of the world Task1 z1 x Task2 z2 Task3 z3 Task4 y
Φ1 Φ2 Φ3 Φ5 Φ4 Variable-centered view of the world = =(1/Z) p(x,z1,z2,z3,y) Φ2(z1,z2) Φ1(x,z1) Φ4(z3,y) Φ3(x,z1,z2,z3) z1 x Φ5(y) z2 z3 y
Variable-centric, not task-centric Throw in any variables that might help!Model and exploit correlations
lexicon (word types) semantics sentences discourse context resources inflection cognates transliteration abbreviation neologism language evolution entailment correlation tokens N translation alignment editing quotation speech misspellings,typos formatting entanglement annotation
Back to our (simpler!) running examples parse sentence (with David A. Smith) morph. paradigm lemma (with Markus Dreyer)
parse sentence (with David A. Smith) parse oftranslation translation Parser projection little directtraining data sentence parse much moretraining data
Parser projection Auf diese Frage habe ich leider keine Antwort bekommen I did not unfortunately receive an answer to this question
word-to-word alignment parse oftranslation translation Parser projection little directtraining data sentence parse much moretraining data
NULL Parser projection Auf diese Frage habe ich leider keine Antwort bekommen I did not unfortunately receive an answer to this question
word-to-word alignment parse oftranslation translation Parser projection little directtraining data sentence parse need aninteresting model much moretraining data
Parses are not entirelyisomorphic Auf diese Frage habe ich leider keine Antwort bekommen NULL I did not unfortunately receive an answer to this question null siblings head-swapping monotonic
Dependency Relations + “none of the above”
Parser projection Typical test data (no translation observed): sentence parse word-to-word alignment parse oftranslation translation
Parser projection Small supervised training set (treebank): sentence parse word-to-word alignment parse oftranslation translation
Parser projection Moderate treebank in other language: sentence parse word-to-word alignment parse oftranslation translation
Parser projection Maybe a few gold alignments: sentence parse word-to-word alignment parse oftranslation translation
Parser projection Lots of raw bitext: sentence parse word-to-word alignment parse oftranslation translation
Parser projection Given bitext, sentence parse word-to-word alignment parse oftranslation translation
Parser projection Given bitext, try to impute other variables: sentence parse word-to-word alignment parse oftranslation translation
Parser projection Given bitext, try to impute other variables:Now we have more constraints on the parse … sentence parse word-to-word alignment parse oftranslation translation
Parser projection Given bitext, try to impute other variables:Now we have more constraints on the parse …which should help us train the parser. sentence parse word-to-word alignment parse oftranslation translation We’ll see how belief propagation naturally handles this.
English does help us impute Chinese parse Seeing noisy output of an English WSJ parser fixes these Chinese links 组织 中国 在 基本 建设 方面 , 开始 利用 国际 金融 的 贷款 进行 国际性 竞争性 招标 采购 Complement verbs swap objects Subject attaches to intervening noun N N P J N N V V N N ‘s N V J N N N , The corresponding bad versions found without seeing the English parse China: in: infrastructure: construction: area:, : has begun: to utilize: international: financial: organizations: ‘s: loans: to implement: international: competitive: bidding: procurement In the area of infrastructure construction, China has begun to utilize loans from international financial organizations to implement international competitive bidding procurement
parse oftranslation’ translation’ alignment alignment (Could add a 3rd language …) sentence parse alignment parse oftranslation translation
world (Could add world knowledge …) sentence parse word-to-word alignment parse oftranslation translation