870 likes | 1.01k Views
Machine Translation Domain Adaptation. Day 19. Project #2. MEMM tools. Online description of project #2 has been updated with more information. Quick walk through. training.txt. I/PRP left/VBD ./. John/NNP arrived/VBD ./. You write code to convert this to features!
E N D
MEMM tools • Online description of project #2 has been updated with more information
Quick walk through training.txt I/PRP left/VBD ./. John/NNP arrived/VBD ./.
You write code to convert this to features! “featurize.pl training.txt training.feats” Quick walk through training.txt I/PRP left/VBD ./. John/NNP arrived/VBD ./. training.feats PRP w0=I:1 w-1=<s>:1 VBD w0=left:1 w-1=I:1 . w0=.:1 w-1=left:1 NNP w0=John:1 w-1=<s>:1 VBD w0=arrived:1 w-1=John:1 . w0=.:1 w-1=arrived:1
Run memm_train to train this model “memm_train --input training.feats--classifier trigram.model --markovOrder 2” Quick walk through training.txt I/PRP left/VBD ./. John/NNP arrived/VBD ./. training.feats PRP w0=I:1 w-1=<s>:1 VBD w0=left:1 w-1=I:1 . w0=.:1 w-1=left:1 NNP w0=John:1 w-1=<s>:1 VBD w0=arrived:1 w-1=John:1 . w0=.:1 w-1=arrived:1 trigram.model <binary gobbledegoo>
Get some unseen test data… Quick walk through training.txt test.txt I/PRP left/VBD ./. John/NNP arrived/VBD ./. he/PRP arrived/VBD ./. John/NNP left/VBD ./. training.feats PRP w0=I:1 w-1=<s>:1 VBD w0=left:1 w-1=I:1 . w0=.:1 w-1=left:1 NNP w0=John:1 w-1=<s>:1 VBD w0=arrived:1 w-1=John:1 . w0=.:1 w-1=arrived:1 trigram.model <binary gobbledegoo>
Use the same featurization code on test data “featurize.pl test.txt test.feats” Quick walk through training.txt test.txt I/PRP left/VBD ./. John/NNP arrived/VBD ./. he/PRP arrived/VBD ./. John/NNP left/VBD ./. training.feats test.feats PRP w0=I:1 w-1=<s>:1 VBD w0=left:1 w-1=I:1 . w0=.:1 w-1=left:1 NNP w0=John:1 w-1=<s>:1 VBD w0=arrived:1 w-1=John:1 . w0=.:1 w-1=arrived:1 PRP w0=he:1 w-1=<s>:1 VBD w0=arrived:1 w-1=he:1 . w0=.:1 w-1=arrived:1 NNP w0=John:1 w-1=<s>:1 VBD w0=left:1 w-1=John:1 . w0=.:1 w-1=left:1 trigram.model <binary gobbledegoo>
memm_test predicts tags (memm_testignores first column; can include true tags) “memm_test --input test.feats --classifier trigram.model --markovOrder 2 --output test.tags” Quick walk through training.txt test.txt I/PRP left/VBD ./. John/NNP arrived/VBD ./. he/PRP arrived/VBD ./. John/NNP left/VBD ./. training.feats test.feats test.tags PRP w0=I:1 w-1=<s>:1 VBD w0=left:1 w-1=I:1 . w0=.:1 w-1=left:1 NNP w0=John:1 w-1=<s>:1 VBD w0=arrived:1 w-1=John:1 . w0=.:1 w-1=arrived:1 PRP w0=he:1 w-1=<s>:1 VBD w0=arrived:1 w-1=he:1 . w0=.:1 w-1=arrived:1 NNP w0=John:1 w-1=<s>:1 VBD w0=left:1 w-1=John:1 . w0=.:1 w-1=left:1 PRP VBD . NNP VBD . trigram.model <binary gobbledegoo>
MEMM features training.txt I/PRP left/VBD ./. John/NNP arrived/VBD ./. You provide these features… …and add the argument “--markovOrder 2” training.feats Actual features used by MEMM PRP w0=I:1 w-1=<s>:1 VBD w0=left:1 w-1=I:1 . w0=.:1 w-1=left:1 NNP w0=John:1 w-1=<s>:1 VBD w0=arrived:1 w-1=John:1 . w0=.:1 w-1=arrived:1 PRP w0=I:1 w-1=<s>:1 t[-1]=<s>:1 t[-1]=<s>,t[-2]=<s>:1 VBD w0=left:1 w-1=I:1 t[-1]=PRP:1 t[-1]=PRP,t[-2]=<s>:1 . w0=.:1 w-1=left:1 t[-1]=VBD:1 t[-1]=VBD,t[-2]=PRP:1 <s> t[-1]=.:1 t[-1]=.,t[-2]=VBD:1 NNP w0=John:1 w-1=<s>:1 t[-1]=<s>:1 t[-1]=<s>,t[-2]=<s>:1 VBD w0=arrived:1 w-1=John:1 t[-1]=NNP:1 t[-1]=NNP,t[-2]=<s>:1 . w0=.:1 w-1=arrived:1 t[-1]=VBD:1 t[-1]=VBD,t[-2]=NNP:1 <s> t[-1]=.:1 t[-1]=.,t[-2]=VBD:1 The MEMM adds in features about tag context add training and test time
Acknowledgments • Many thanks to (for helpful content and input on content): • Chris Callison-Burch, Matt Post, & Adam Lopez (JHU) • Philipp Koehn & Barry Haddow (U Edinburgh) • Kevin Knight (ISI)
Non-English Internet content and user communities are increasing explosively Human translation costs are excessive: major languages range from 10-50 cents per word Translation: global problem and interesting research problem Result: the vast majority of published material remains untranslated!
Prevalence of MT on the Web From Rarrick et al, 2010
The Goal: (sentence) translation • 滴水之恩當以涌泉相報 • A drop of water shall be returned with a burst of spring. • Translate source sentences into target sentences • For now, ignore discourse structure, co-reference, and phenomena across sentence boundaries
Types of MT systems Modified Vauquois pyramid • Source of information • Rule based: People write rules to specify translations of words, phrases • Data-driven: Use learning techniques to derive translation “rules” from data sources (e.g., parallel corpora) • Level of representation
Advantages of data-driven translation • We can model the genres of documents that we would like to model • Learn contextually appropriate translations for technical data, chat data, etc. • Very flexible system • Given corpus C= ({x1,y1}, {x2,y2}, …) of sentence pairs • Translate(C, x) = y is a function of the training data and the input sentence • To build a new system (or optimize our old one) we just change the data • But…we need oodles of data to get “good” models
Statistical MT • Learn word and phrase alignments from “parallel” data
Statistical MT • Learn word and phrase alignments from “parallel” data • Parallel data? • Parallel documents?
Statistical MT • Learn word and phrase alignments from “parallel” data • Parallel documents?
Statistical MT • Learn word and phrase alignments from “parallel” data • Parallel documents?
Statistical MT • Learn word and phrase alignments from “parallel” data • Parallel documents?
Statistical MT • Learn word and phrase alignments from “parallel” data • Start with parallel documents • Need parallel sentences • Sentence break and sentence align • Word align and produce word and phrase translation tables (our translation models)
Statistical MT • Learn word and phrase alignments from “parallel” data • Start with parallel documents • Need parallel sentences • Sentence break and sentence align • Word align and produce word and phrase translation tables (our translation models)
Statistical MT • Learn word and phrase alignments from “parallel” data • Start with parallel documents • Need parallel sentences • Sentence break and sentence align • Word align and produce word and phrase translation tables (our translation models) • Use monolingual data to • Build language models • Inform ordering • Choose best translation from n-best list
Statistical MT Recipe Start With Build These Components Translation Model Probs associated with aligned words & phrases – P (E|F) • Parallel sentences • Align words & phrases, & generate counts
Statistical MT Recipe Start With Build These Components Translation Model Probs associated with aligned words & phrases – P (E|F) Language Model – P(E) • Parallel sentences • Align words & phrases, & generate counts • Monolingual data
Statistical MT Recipe Start With Build These Components Translation Model Probs associated with aligned words & phrases – P (E|F) Language Model – P(E) Decoder Maximizes P(F|E)*P(E) • Parallel sentences • Align words & phrases, & generate counts • Monolingual data • Decoding Algorithm
Statistical Machine Translation • Given foreign f, find best English translation e* e* = argmaxe P(e | f) • Use Bayes’ rule to get “noisy channel” model P(e | f) = P(f | e) ∙ P(e) / P(f) argmaxe P(e | f) = argmax P(f | e) ∙ P(e) • P(f | e) is the channelor translation model • P(e) is the language model
Centauri/Arcturan [Knight, 1997] Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp Slides 38-74 adapted from Kevin Knight and CCB’s JHU crew
1a. ok-voon ororok sprok . 1b. at-voon bichat dat . 7a. lalok farok ororok lalok sprok izok enemok . 7b. wat jjat bichat wat dat vat eneat . 2a. ok-drubel ok-voon anok plok sprok . 2b. at-drubel at-voon pippat rrat dat . 8a. lalok brok anok plok nok . 8b. iat lat pippat rrat nnat . 3a. erok sprok izok hihok ghirok . 3b. totat dat arrat vat hilat . 9a. wiwok nok izok kantok ok-yurp . 9b. totat nnat quat oloat at-yurp . 4a. ok-voon anok drok brok jok . 4b. at-voon krat pippat sat lat . 10a. lalok mok nok yorok ghirok clok . 10b. wat nnat gat mat bat hilat . 5a. wiwok farok izok stok . 5b. totat jjat quat cat . 11a. lalok nok crrrok hihok yorok zanzanok . 11b. wat nnat arrat mat zanzanat . 6a. lalok sprok izok jok stok . 6b. wat dat krat quat cat . 12a. lalok rarok nok izok hihok mok . 12b. wat nnat forat arrat vat gat . Centauri/Arcturan [Knight, 1997] Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp
1a. ok-voon ororok sprok . 1b. at-voon bichat dat . 7a. lalok farok ororok lalok sprok izok enemok . 7b. wat jjat bichat wat dat vat eneat . 2a. ok-drubel ok-voon anok plok sprok . 2b. at-drubel at-voon pippat rrat dat . 8a. lalok brok anok plok nok . 8b. iat lat pippat rrat nnat . 3a. erok sprok izok hihok ghirok . 3b. totat dat arrat vat hilat . 9a. wiwok nok izok kantok ok-yurp . 9b. totat nnat quat oloat at-yurp . 4a. ok-voon anok drok brok jok . 4b. at-voon krat pippat sat lat . 10a. lalok mok nok yorok ghirok clok . 10b. wat nnat gat mat bat hilat . 5a. wiwok farok izok stok . 5b. totat jjat quat cat . 11a. lalok nok crrrok hihok yorok zanzanok . 11b. wat nnat arrat mat zanzanat . 6a. lalok sprok izok jok stok . 6b. wat dat krat quat cat . 12a. lalok rarok nok izok hihok mok . 12b. wat nnat forat arrat vat gat . Centauri/Arcturan [Knight, 1997] Your assignment, translate this to Arcturan: farokcrrrokhihokyorokclokkantok ok-yurp
1a. ok-voon ororok sprok . 1b. at-voon bichat dat . 7a. lalok farok ororok lalok sprok izok enemok . 7b. wat jjat bichat wat dat vat eneat . 2a. ok-drubel ok-voon anok plok sprok . 2b. at-drubel at-voon pippat rrat dat . 8a. lalok brok anok plok nok . 8b. iat lat pippat rrat nnat . 3a. erok sprok izok hihok ghirok . 3b. totat dat arrat vat hilat . 9a. wiwok nok izok kantok ok-yurp . 9b. totat nnat quat oloat at-yurp . 4a. ok-voon anok drok brok jok . 4b. at-voon krat pippat sat lat . 10a. lalok mok nok yorok ghirok clok . 10b. wat nnat gat mat bat hilat . 5a. wiwok farok izok stok . 5b. totat jjat quat cat . 11a. lalok nok crrrok hihok yorok zanzanok . 11b. wat nnat arrat mat zanzanat . 6a. lalok sprok izok jok stok . 6b. wat dat krat quat cat . 12a. lalok rarok nok izok hihok mok . 12b. wat nnat forat arrat vat gat . Centauri/Arcturan [Knight, 1997] Your assignment, translate this to Arcturan: farokcrrrokhihokyorokclokkantok ok-yurp
1a. ok-voon ororok sprok . 1b. at-voon bichat dat . 7a. lalok farok ororok lalok sprok izok enemok . 7b. wat jjat bichat wat dat vat eneat . 2a. ok-drubel ok-voon anok plok sprok . 2b. at-drubel at-voon pippat rrat dat . 8a. lalok brok anok plok nok . 8b. iat lat pippat rrat nnat . 3a. erok sprok izok hihok ghirok . 3b. totat dat arrat vat hilat . 9a. wiwok nok izok kantok ok-yurp . 9b. totat nnat quat oloat at-yurp . 4a. ok-voon anok drok brok jok . 4b. at-voon krat pippat sat lat . 10a. lalok mok nok yorok ghirok clok . 10b. wat nnat gat mat bat hilat . 5a. wiwok farok izok stok . 5b. totat jjat quat cat . 11a. lalok nok crrrok hihok yorok zanzanok . 11b. wat nnat arrat mat zanzanat . 6a. lalok sprok izok jok stok . 6b. wat dat krat quat cat . 12a. lalok rarok nok izok hihok mok . 12b. wat nnat forat arrat vat gat . Centauri/Arcturan [Knight, 1997] Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp ???
1a. ok-voon ororok sprok . 1b. at-voon bichat dat . 7a. lalok farok ororok lalok sprok izok enemok . 7b. wat jjat bichat wat dat vat eneat . 2a. ok-drubel ok-voon anok plok sprok . 2b. at-drubel at-voon pippat rrat dat . 8a. lalok brok anok plok nok . 8b. iat lat pippat rrat nnat . 3a. erok sprok izok hihok ghirok . 3b. totat dat arrat vat hilat . 9a. wiwok nok izok kantok ok-yurp . 9b. totat nnat quat oloat at-yurp . 4a. ok-voon anok drok brok jok . 4b. at-voon krat pippat sat lat . 10a. lalok mok nok yorok ghirok clok . 10b. wat nnat gat mat bat hilat . 5a. wiwok farok izok stok . 5b. totat jjat quat cat . 11a. lalok nok crrrok hihok yorok zanzanok . 11b. wat nnat arrat mat zanzanat . 6a. lalok sprok izok jok stok . 6b. wat dat krat quat cat . 12a. lalok rarok nok izok hihok mok . 12b. wat nnat forat arrat vat gat . Centauri/Arcturan [Knight, 1997] Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp
1a. ok-voon ororok sprok . 1b. at-voon bichat dat . 7a. lalok farok ororok lalok sprok izok enemok . 7b. wat jjat bichat wat dat vat eneat . 2a. ok-drubel ok-voon anok plok sprok . 2b. at-drubel at-voon pippat rrat dat . 8a. lalok brok anok plok nok . 8b. iat lat pippat rrat nnat . 3a. erok sprok izok hihok ghirok . 3b. totat dat arrat vat hilat . 9a. wiwok nok izok kantok ok-yurp . 9b. totat nnat quat oloat at-yurp . 4a. ok-voon anok drok brok jok . 4b. at-voon krat pippat sat lat . 10a. lalok mok nok yorok ghirok clok . 10b. wat nnat gat mat bat hilat . 5a. wiwok farok izok stok . 5b. totat jjat quat cat . 11a. lalok nok crrrok hihokyorok zanzanok . 11b. wat nnat arrat mat zanzanat . 6a. lalok sprok izok jok stok . 6b. wat dat krat quat cat . 12a. lalok rarok nok izok hihok mok . 12b. wat nnat forat arrat vat gat . Centauri/Arcturan [Knight, 1997] Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp
1a. ok-voon ororok sprok . 1b. at-voon bichat dat . 7a. lalok farok ororok lalok sprok izok enemok . 7b. wat jjat bichat wat dat vat eneat . 2a. ok-drubel ok-voon anok plok sprok . 2b. at-drubel at-voon pippat rrat dat . 8a. lalok brok anok plok nok . 8b. iat lat pippat rrat nnat . 3a. erok sprok izok hihok ghirok . 3b. totat dat arrat vat hilat . 9a. wiwok nok izok kantok ok-yurp . 9b. totat nnat quat oloat at-yurp . 4a. ok-voon anok drok brok jok . 4b. at-voon krat pippat sat lat . 10a. lalok mok nok yorok ghirok clok . 10b. wat nnat gat mat bat hilat . 5a. wiwok farok izok stok . 5b. totat jjat quat cat . 11a. lalok nok crrrok hihok yorok zanzanok . 11b. wat nnat arrat mat zanzanat . 6a. lalok sprok izok jok stok . 6b. wat dat krat quat cat . 12a. lalok rarok nok izok hihok mok . 12b. wat nnat forat arrat vat gat . Centauri/Arcturan [Knight, 1997] Your assignment, translate this to Arcturan: farok crrrok hihokyorok clok kantok ok-yurp
1a. ok-voon ororok sprok . 1b. at-voon bichat dat . 7a. lalok farok ororok lalok sprok izok enemok . 7b. wat jjat bichat wat dat vat eneat . 2a. ok-drubel ok-voon anok plok sprok . 2b. at-drubel at-voon pippat rrat dat . 8a. lalok brok anok plok nok . 8b. iat lat pippat rrat nnat . 3a. erok sprok izok hihok ghirok . 3b. totat dat arrat vat hilat . 9a. wiwok nok izok kantok ok-yurp . 9b. totat nnat quat oloat at-yurp . 4a. ok-voon anok drok brok jok . 4b. at-voon krat pippat sat lat . 10a. lalok mok nok yorok ghirok clok . 10b. wat nnat gat mat bat hilat . 5a. wiwok farok izok stok . 5b. totat jjat quat cat . 11a. lalok nok crrrok hihok yorok zanzanok . 11b. wat nnat arrat mat zanzanat . 6a. lalok sprok izok jok stok . 6b. wat dat krat quat cat . 12a. lalok rarok nok izok hihok mok . 12b. wat nnat forat arrat vat gat . Centauri/Arcturan [Knight, 1997] Your assignment, translate this to Arcturan: farok crrrok hihokyorok clok kantok ok-yurp ???
1a. ok-voon ororok sprok . 1b. at-voon bichat dat . 7a. lalok farok ororok lalok sprok izok enemok . 7b. wat jjat bichat wat dat vat eneat . 2a. ok-drubel ok-voon anok plok sprok . 2b. at-drubel at-voon pippat rrat dat . 8a. lalok brok anok plok nok . 8b. iat lat pippat rrat nnat . 3a. erok sprok izok hihok ghirok . 3b. totat dat arrat vat hilat . 9a. wiwok nok izok kantok ok-yurp . 9b. totat nnat quat oloat at-yurp . 4a. ok-voon anok drok brok jok . 4b. at-voon krat pippat sat lat . 10a. lalok mok nok yorok ghirok clok . 10b. wat nnat gat mat bat hilat . 5a. wiwok farok izok stok . 5b. totat jjat quat cat . 11a. lalok nok crrrok hihok yorok zanzanok . 11b. wat nnat arrat mat zanzanat . 6a. lalok sprok izok jok stok . 6b. wat dat krat quat cat . 12a. lalok rarok nok izok hihok mok . 12b. wat nnat forat arrat vat gat . Centauri/Arcturan [Knight, 1997] Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp
1a. ok-voon ororok sprok . 1b. at-voon bichat dat . 7a. lalok farok ororok lalok sprok izok enemok . 7b. wat jjat bichat wat dat vat eneat . 2a. ok-drubel ok-voon anok plok sprok . 2b. at-drubel at-voon pippat rrat dat . 8a. lalok brok anok plok nok . 8b. iat lat pippat rrat nnat . 3a. erok sprok izok hihok ghirok . 3b. totat dat arrat vat hilat . 9a. wiwok nok izok kantok ok-yurp . 9b. totat nnat quat oloat at-yurp . 4a. ok-voon anok drok brok jok . 4b. at-voon krat pippat sat lat . 10a. lalok mok nok yorok ghirok clok . 10b. wat nnat gat mat bat hilat . 5a. wiwok farok izok stok . 5b. totat jjat quat cat . 11a. lalok nok crrrok hihok yorok zanzanok . 11b. wat nnat arrat mat zanzanat . 6a. lalok sprok izok jok stok . 6b. wat dat krat quat cat . 12a. lalok rarok nok izok hihok mok . 12b. wat nnat forat arrat vat gat . Centauri/Arcturan [Knight, 1997] Your assignment, translate this to Arcturan: farok crrrok hihok yorokclok kantok ok-yurp process of elimination
1a. ok-voon ororok sprok . 1b. at-voon bichat dat . 7a. lalok farok ororok lalok sprok izok enemok . 7b. wat jjat bichat wat dat vat eneat . 2a. ok-drubel ok-voon anok plok sprok . 2b. at-drubel at-voon pippat rrat dat . 8a. lalok brok anok plok nok . 8b. iat lat pippat rrat nnat . 3a. erok sprok izok hihok ghirok . 3b. totat dat arrat vat hilat . 9a. wiwok nok izok kantok ok-yurp . 9b. totat nnat quat oloat at-yurp . 4a. ok-voon anok drok brok jok . 4b. at-voon krat pippat sat lat . 10a. lalok mok nok yorok ghirok clok . 10b. wat nnat gat mat bat hilat . 5a. wiwok farok izok stok . 5b. totat jjat quat cat . 11a. lalok nok crrrok hihok yorok zanzanok . 11b. wat nnat arrat mat zanzanat . 6a. lalok sprok izok jok stok . 6b. wat dat krat quat cat . 12a. lalok rarok nok izok hihok mok . 12b. wat nnat forat arrat vat gat . Centauri/Arcturan [Knight, 1997] Your assignment, translate this to Arcturan: farok crrrok hihok yorokclok kantok ok-yurp cognate?
1a. ok-voonororoksprok . 1b. at-voonbichatdat . 7a. lalok farok ororok lalok sprok izok enemok . 7b. wat jjat bichat wat dat vat eneat . 2a. ok-drubel ok-voon anok plok sprok . 2b. at-drubel at-voon pippat rrat dat . 8a. lalok brok anok plok nok . 8b. iat lat pippat rrat nnat . 3a. erok sprok izok hihok ghirok . 3b. totat dat arrat vat hilat . 9a. wiwok nok izok kantok ok-yurp . 9b. totat nnat quat oloat at-yurp . 4a. ok-voon anok drok brok jok . 4b. at-voon krat pippat sat lat . 10a. lalok mok nok yorok ghirok clok . 10b. wat nnat gat mat bat hilat . 5a. wiwok farok izok stok . 5b. totat jjat quat cat . 11a. lalok nok crrrok hihok yorok zanzanok . 11b. wat nnat arrat mat zanzanat . 6a. lalok sprok izok jok stok . 6b. wat dat krat quat cat . 12a. lalok rarok nok izok hihok mok . 12b. wat nnat forat arrat vat gat . Centauri/Arcturan [Knight, 1997] Your assignment, translate this to Arcturan: farok crrrok hihok yorokclok kantok ok-yurp zero fertility