240 likes | 387 Views
Machine Translation Discriminative Word Alignment. Stephan Vogel Spring Semester 2011. Generative Alignment Models. Generative word alignment models: P(f, a|e) = … Alignment a as hidden variable Actual word alignment is not observed Sum over all alignments
E N D
Machine TranslationDiscriminative Word Alignment Stephan Vogel Spring Semester 2011
Generative Alignment Models • Generative word alignment models: P(f, a|e) = … • Alignment a as hidden variable • Actual word alignment is not observed • Sum over all alignments • Well-known IBM 1 … 5 models, HMM, ITG • Model lexical association, distortion, fertility • It is difficult to incorporate additional information • POS of words (used in distortion model, not as direct link features) • Manual dictionary • Syntax information • … Stephan Vogel - Machine Translation)
Discriminative Word Alignment • Model alignment directly: p(a | f, e) • Find alignment that maximizes p(a | f, e) • Well-suited framework: maximum entropy • Set of feature functions hm(a, f, e), m = 1, …, M • Set of model parameters (feature weights) cm, m = 1, …, M • Decision rule: Stephan Vogel - Machine Translation)
Tasks • Modeling: design feature functions which capture cross-lingual divergences • Search: find alignment with highest probability • Training: find optimal feature weights • Minimize alignment errors given some gold-standard alignments(Notice: Alignments no longer hidden!) • Supervised training, i.e. we evaluate against gold standard • Notice: features functions may result from some training procedure themselves • E.g. use statistical dictionary resulting from IBMn alignment, trained on large corpus • Here now additional training step, on small (hand-aligned) corpus(Similar to MERT for decoder) Stephan Vogel - Machine Translation)
2005 – Year of DWA • Yang Liu, Qun Liu, and Shouxun Lin. 2005.Loglinear Models for Word Alignment. • Abraham Ittycheriah and Salim Roukos. 2005.A Maximum Entropy Word Aligner for Arabic-English Machine Translation. • Ben Taskar, Simon Lacoste-Julien, and Dan Klein. 2005.A Discriminative Matching Approach to Word Alignment. • Robert C. Moore. 2005.A Discriminative Framework for Bilingual Word Alignment. • Necip Fazil Ayan, Bonnie J. Dorr, and Christof Monz. 2005.NeurAlign: Combining Word Alignments Using Neural Networks. Stephan Vogel - Machine Translation)
Yang Liu et al. 2005 • Start out with features used in generative alignment • Lexicons E.g. IBM1 • Use both directions: p(fj|ei) and p(ei|fj), =>Symmetrical alignment model • And/or symmetric model • Fertility model: p(fi|ei) Stephan Vogel - Machine Translation)
More Features • Cross count: number of crossings in alignment • Neighbor count: count the number of links in the immediate neighborhood • Exact match: count number of src/tgt pairs, where src=tgt • Linked word count: total number of links (to influence density) • Link types: count how many 1-1, 1-m, m-1, n-m alignments • Sibling distance: if word is aligned to multiple words, add the distance between these aligned words • Link Co-occurrence count: given multiple alignments (e.g. Viterbi alignments from IBM models) count how often links co-occur Stephan Vogel - Machine Translation)
Search • Greedy search based on gain by adding a link • For each of the features the gain can be calculated • E.g. IBM1 • Algorithm:Start with empty alignmentLoop until no addition gain Loop over all (j,i) not in set if gain(j,i) > best_gain then store as (j’,i’) Set link(j’,i’) Stephan Vogel - Machine Translation)
Moore 2005 • Log-Likelihood-based model • Measure word association strength • Values can get large • Conditional-Link-Probability-based • Estimated probability of two words being linked • Used simpler alignment model to establish links • Add simple smoothing • Additional features: one-to-one, one-to-many, non-monotonicity Stephan Vogel - Machine Translation)
Training • Finding optimal alignment is non-trivial • Adding link can affect nonmonotonicity, one-to-many features • Dynamic programming does not work • Beam search could be used • Requires pruning • Parameter optimization • Modified version of average perceptron learning Stephan Vogel - Machine Translation)
Modeling Alignment with CRF • CRF is an undirected graphical model • Each vertex (node) represents a random variable whose distribution is to be inferred • Each edge represents a dependency between two random variables • The distribution of each discrete random variable Y in the graph is conditioned on an input sequence X. • Cliques: set of nodes in graph fully connected • In our case: • Features derived from source and target words are the input sequence X • Alignment links are the random variables Y • Different ways to model alignment • Blunsom & Cohn (2006): many-to-one word alignments, where each source word is aligned with zero or one target words (-> asymmetric) • Niehues & Vogel (2008): model not sequence, but entire alignment matrix(->symmetric) Stephan Vogel - Machine Translation)
Modeling Alignment Matrix • Random variables yji for all possible alignment links • 2 values: 0/1 – word in position j is not linked/linked to word in position i • Represented as nodes in a graph Stephan Vogel - Machine Translation)
Modeling Alignment Matrix • Factored nodes x representing features (observables) • Linked to random variables • Define potential for each yji Stephan Vogel - Machine Translation)
Probability of Alignment Stephan Vogel - Machine Translation)
Features • Local features, e.g. lexical, POS, … • Fertility features • First-order features: capturing relation between links • Phrase-features: interaction between word and phrase alignment Stephan Vogel - Machine Translation)
Local Features • Local information about link probability • Features derived from positions j and i only • Factored node connected to only one random variable • Features • Lexical probabilities, also normalized to (f,e) • Word identity (e.g. for numbers, names) • Word similarity (e.g. cognates) • Relative position distance • Link indicator feature: is (j,i) linkedin Viterbi alignment from generative alignment • POS: Indicator feature for every src/tgt POS pair • High frequency word indicator feature for everysrc/tgt word pair for most frequent words Stephan Vogel - Machine Translation)
Fertility Features • Model word fertility, src and tgt side • Link to all nodes in row/column • Constraint: model fertility only upto maximum fertility • Indicator features:one for each fertility n <= None for all fertilities n > N • Alternative: use fertility probabilitiesfrom IBM4 training • Now different for different words Stephan Vogel - Machine Translation)
First Order Features • Links depend on links ofneighboring words • Link always 2 nodes • Different features for differentdirections • (1,1), (1,2), (2,1), (1,0), … • Captures distortions, similar toHMM and IBM4 alignment • Indicator features, if both links are set • Also POS 1-order feature: indicator feature link(j,i) and (POSj, POSi) and link(j+k, i+l) Stephan Vogel - Machine Translation)
Inference – Finding the Best Alignment • Word alignment corresponds to assignment of random variables • => Find most probable variable assignment • Problem: • Complex model structure: many loops • No exact inference possible • Solution: • Belief propagation algorithm • Inference by message passing • Runtime exponential in number of connected nodes Stephan Vogel - Machine Translation)
Belief Propagation • Messages are sent from random variable nodes to factored nodes, and also in the opposite direction • Start with some initial values, e.g. uniform • In each iteration • Calculate messages from hidden node (j,i) and sent to factored node c • Calculate messages from factored node c and sent to hidden node (j,i) Stephan Vogel - Machine Translation)
Getting the Probability • After several iterations, belief value calculated from messages send to hidden nodes • Belief value can be interpreted as posterior probability Stephan Vogel - Machine Translation)
Training • Maximum log-likelihood of correct alignment • Use gradient descent to find optimum • Train towards minimum alignment error • Need smoothed version of AER • Express AER in terms of link indicator functions • Use sigmoid of link probability • Can use 2-step approach • 1. Optimize towards ML • 2. Optimize towards AER Stephan Vogel - Machine Translation)
Some Results: Spanish-English • Features • IBM1 and IBM4 lexicons • Fertilties • Link indicator feature • POS features • Phrase features • Impact on translationquality (Bleu scores) Stephan Vogel - Machine Translation)
Summary • In last 5 years new efforts in word alignment • Discriminative word alignment • Integrate many features • Need small amount of hand aligned data to tune (train) feature weights • Different variants • Log-linear modeling • Conditional random fields: sequence and alignment matrix • Significant improvements in word alignment error rate • Not always improvements in translation quality • Different density of alignment -> different phrase table size • Need to adjust phrase extraction algorithms? Stephan Vogel - Machine Translation)