130 likes | 268 Views
Introduction to Natural Language Processing (600.465) Statistical Translation: Alignment and Parameter Estimation. Dr. Jan Hajič CS Dept., Johns Hopkins Univ. hajic@cs.jhu.edu www.cs.jhu.edu/~hajic. Alignment. Available corpus assumed: parallel text (translation E ↔ F)
E N D
Introduction to Natural Language Processing (600.465)Statistical Translation: Alignment and Parameter Estimation Dr. Jan Hajič CS Dept., Johns Hopkins Univ. hajic@cs.jhu.edu www.cs.jhu.edu/~hajic JHU CS 600.465/Jan Hajic
Alignment • Available corpus assumed: • parallel text (translation E ↔F) • No alignment present (day marks only)! • Sentence alignment • sentence detection • sentence alignment • Word alignment • tokenization • word alignment (with restrictions) JHU CS 600.465/ Intro to NLP/Jan Hajic
Sentence Boundary Detection • Rules, lists: • Sentence breaks: • paragraphs (if marked) • certain characters: ?, !, ; (...almost sure) • The Problem: period . • could be end of sentence (... left yesterday. He was heading to...) • decimal point: 3.6 (three-point-six) • thousand segment separator: 3.200 (three-thousand-two-hundred) • abbreviation never at the end of sentence: cf., e.g., Calif., Mt., Mr. • ellipsis: ... • other languages: ordinal number indication (2nd ~ 2.) • initials: A. B. Smith • Statistical methods: e.g., Maximum Entropy JHU CS 600.465/ Intro to NLP/Jan Hajic
Sentence Alignment • The Problem: sentences detected only: • E: • F: • Desired output: Segmentation with equal number of segments, spanning continuously the whole text. • Original sentence boundaries kept: • E: • F: • Alignments obtained: 2-1, 1-1, 1-1, 2-2, 2-1, 0-1 • New segments called “sentences” from now on. JHU CS 600.465/ Intro to NLP/Jan Hajic
Alignment Methods • Several methods (probabilistic and not prob.) • character-length based • word-length based • “cognates” (word identity used) • using an existing dictionary (F: prendre ~ E: make, take) • using word “distance” (similarity): names, numbers, borrowed words, Latin origin words, ... • Best performing: • statistical, word- or character- length based (with some words perhaps) JHU CS 600.465/ Intro to NLP/Jan Hajic
Length-based Alignment • First, define the problem probabilistically: argmaxA P(A|E,F) = argmaxA P(A,E,F) (E,F fixed) • Define a “bead”: • E: • F: • Approximate: P(A,E,F) @Pi=1..nP(Bi), where Bi is a bead; P(Bi) does not depend on the rest of E,F. “bead” (2:2 in this case) JHU CS 600.465/ Intro to NLP/Jan Hajic
The Alignment Task • Given the model definition, P(A,E,F) @Pi=1..nP(Bi), find the partitioning of (E,F) into n beads Bi=1..n, that maximizes P(A,E,F) over training data. • Define Bi = p:qai, where p:q {0:1,1:0,1:1,1:2,2:1,2:2} • describes the type of alignment • Want to use some sort of dynamic programming: • Define Pref(i,j)... probability of the best alignment from the start of (E,F) data (1,1) up to (i,j) JHU CS 600.465/ Intro to NLP/Jan Hajic
P(1:0ak) Pref(i-2,j-2) Pref(i-2,j-1) Pref(i-1,j-2) Pref(i-1,j-1) Pref(i-1,j) Pref(i,j-1) P(2:2ak) P(2:1ak) P(1:2ak) P(0:1ak) P(1:1ak) Recursive Definition • Initialize: Pref(0,0) = 0. • Pref(i,j) = max ( Pref(i,j-1) P(0:1ak), Pref(i-1,j) P(1:0ak), Pref(i-1,j-1) P(1:1ak), Pref(i-1,j-2) P(1:2ak), Pref(i-2,j-1) P(2:1ak), Pref(i-2,j-2) P(2:2ak) ) • This is enough for a Viterbi-like search. • E: • F: i j JHU CS 600.465/ Intro to NLP/Jan Hajic
Probability of a Bead • Remains to define P(p:qak) (the red part): • k refers to the “next” bead, with segments of p and q sentences, lengths lk,e and lk,f. • Use normal distribution for length variation: • P(p:qak) = P(d(lk,e,lk,f,m,s2),p:q) @ P(d(lk,e,lk,f,m,s2))P(p:q) • d(lk,e,lk,f,m,s2) = (lk,f - mlk,e)/lk,es2 • Estimate P(p:q) from small amount of data, or even guess and re-estimate after aligning some data. • Words etc. might be used as better clues in P(p:qak) def. JHU CS 600.465/ Intro to NLP/Jan Hajic
Saving time • For long texts (> 104 sentences), even Viterbi (in the version needed) is not effective (o(S2) time) • Go paragraph by paragraph if they are aligned 1:1 • What if not? • Apply the same method first to paragraphs! • identify paragraphs roughly in both languages • run the algorithm to get aligned paragraph-like segments • then, run on sentences within paragraphs. • Performs well if not many consecutive 1:0 or 0:1 beads. JHU CS 600.465/ Intro to NLP/Jan Hajic
Word alignment • Length alone does not help anymore. • mainly because words can be swapped, and mutual translations have often vastly different length. • ...but at least, we have “sentences” (sentence-like segments) aligned; that will be exploited heavily. • Idea: • Assume some (simple) translation model (such as Model 1). • Find its parameters by considering virtually all alignments. • After we have the parameters, find the best alignment given those parameters. JHU CS 600.465/ Intro to NLP/Jan Hajic
Word Alignment Algorithm • Start with sentence-aligned corpus. • Let (E,F) be a pair of sentences (actually, a bead). • Initialize p(f|e) randomly (e.g., uniformly), fF, eE. • Compute expected counts over the corpus: c(f,e) = S(E,F);eE,fF p(f|e) " aligned pair (E,F), find if e in E and f in F; if yes, add p(f|e). • Reestimate: p(f|e) = c(f,e) / c(e) [c(e) = Sf c(f,e)] • Iterate until change of p(f|e) is small. JHU CS 600.465/ Intro to NLP/Jan Hajic
Best Alignment • Select, for each (E,F), A = argmaxA P(A|F,E) = argmaxAP(F,A|E)/P(F) = argmaxA P(F,A|E) = argmaxA (e / (l+1)mPj=1..m p(fj|eaj)) = argmaxA Pj=1..mp(fj|eaj) • Again, use dynamic programming, Viterbi-like algorithm. • Recompute p(f|e) based on the best alignment • (only if you are inclined to do so; the “original” summed-over-all distribution might perform better). • Note: we have also got all Model 1 parameters. JHU CS 600.465/ Intro to NLP/Jan Hajic