Machine Translation Discriminative Word Alignment

Machine TranslationDiscriminative Word Alignment Stephan Vogel Spring Semester 2011

Generative Alignment Models • Generative word alignment models: P(f, a|e) = … • Alignment a as hidden variable • Actual word alignment is not observed • Sum over all alignments • Well-known IBM 1 … 5 models, HMM, ITG • Model lexical association, distortion, fertility • It is difficult to incorporate additional information • POS of words (used in distortion model, not as direct link features) • Manual dictionary • Syntax information • … Stephan Vogel - Machine Translation)

Discriminative Word Alignment • Model alignment directly: p(a | f, e) • Find alignment that maximizes p(a | f, e) • Well-suited framework: maximum entropy • Set of feature functions hm(a, f, e), m = 1, …, M • Set of model parameters (feature weights) cm, m = 1, …, M • Decision rule: Stephan Vogel - Machine Translation)

Tasks • Modeling: design feature functions which capture cross-lingual divergences • Search: find alignment with highest probability • Training: find optimal feature weights • Minimize alignment errors given some gold-standard alignments(Notice: Alignments no longer hidden!) • Supervised training, i.e. we evaluate against gold standard • Notice: features functions may result from some training procedure themselves • E.g. use statistical dictionary resulting from IBMn alignment, trained on large corpus • Here now additional training step, on small (hand-aligned) corpus(Similar to MERT for decoder) Stephan Vogel - Machine Translation)

2005 – Year of DWA • Yang Liu, Qun Liu, and Shouxun Lin. 2005.Loglinear Models for Word Alignment. • Abraham Ittycheriah and Salim Roukos. 2005.A Maximum Entropy Word Aligner for Arabic-English Machine Translation. • Ben Taskar, Simon Lacoste-Julien, and Dan Klein. 2005.A Discriminative Matching Approach to Word Alignment. • Robert C. Moore. 2005.A Discriminative Framework for Bilingual Word Alignment. • Necip Fazil Ayan, Bonnie J. Dorr, and Christof Monz. 2005.NeurAlign: Combining Word Alignments Using Neural Networks. Stephan Vogel - Machine Translation)

Yang Liu et al. 2005 • Start out with features used in generative alignment • Lexicons E.g. IBM1 • Use both directions: p(fj|ei) and p(ei|fj), =>Symmetrical alignment model • And/or symmetric model • Fertility model: p(fi|ei) Stephan Vogel - Machine Translation)

More Features • Cross count: number of crossings in alignment • Neighbor count: count the number of links in the immediate neighborhood • Exact match: count number of src/tgt pairs, where src=tgt • Linked word count: total number of links (to influence density) • Link types: count how many 1-1, 1-m, m-1, n-m alignments • Sibling distance: if word is aligned to multiple words, add the distance between these aligned words • Link Co-occurrence count: given multiple alignments (e.g. Viterbi alignments from IBM models) count how often links co-occur Stephan Vogel - Machine Translation)

Search • Greedy search based on gain by adding a link • For each of the features the gain can be calculated • E.g. IBM1 • Algorithm:Start with empty alignmentLoop until no addition gain Loop over all (j,i) not in set if gain(j,i) > best_gain then store as (j’,i’) Set link(j’,i’) Stephan Vogel - Machine Translation)

Moore 2005 • Log-Likelihood-based model • Measure word association strength • Values can get large • Conditional-Link-Probability-based • Estimated probability of two words being linked • Used simpler alignment model to establish links • Add simple smoothing • Additional features: one-to-one, one-to-many, non-monotonicity Stephan Vogel - Machine Translation)

Training • Finding optimal alignment is non-trivial • Adding link can affect nonmonotonicity, one-to-many features • Dynamic programming does not work • Beam search could be used • Requires pruning • Parameter optimization • Modified version of average perceptron learning Stephan Vogel - Machine Translation)

Modeling Alignment with CRF • CRF is an undirected graphical model • Each vertex (node) represents a random variable whose distribution is to be inferred • Each edge represents a dependency between two random variables • The distribution of each discrete random variable Y in the graph is conditioned on an input sequence X. • Cliques: set of nodes in graph fully connected • In our case: • Features derived from source and target words are the input sequence X • Alignment links are the random variables Y • Different ways to model alignment • Blunsom & Cohn (2006): many-to-one word alignments, where each source word is aligned with zero or one target words (-> asymmetric) • Niehues & Vogel (2008): model not sequence, but entire alignment matrix(->symmetric) Stephan Vogel - Machine Translation)

Modeling Alignment Matrix • Random variables yji for all possible alignment links • 2 values: 0/1 – word in position j is not linked/linked to word in position i • Represented as nodes in a graph Stephan Vogel - Machine Translation)

Modeling Alignment Matrix • Factored nodes x representing features (observables) • Linked to random variables • Define potential for each yji Stephan Vogel - Machine Translation)

Probability of Alignment Stephan Vogel - Machine Translation)

Features • Local features, e.g. lexical, POS, … • Fertility features • First-order features: capturing relation between links • Phrase-features: interaction between word and phrase alignment Stephan Vogel - Machine Translation)

Local Features • Local information about link probability • Features derived from positions j and i only • Factored node connected to only one random variable • Features • Lexical probabilities, also normalized to (f,e) • Word identity (e.g. for numbers, names) • Word similarity (e.g. cognates) • Relative position distance • Link indicator feature: is (j,i) linkedin Viterbi alignment from generative alignment • POS: Indicator feature for every src/tgt POS pair • High frequency word indicator feature for everysrc/tgt word pair for most frequent words Stephan Vogel - Machine Translation)

Fertility Features • Model word fertility, src and tgt side • Link to all nodes in row/column • Constraint: model fertility only upto maximum fertility • Indicator features:one for each fertility n <= None for all fertilities n > N • Alternative: use fertility probabilitiesfrom IBM4 training • Now different for different words Stephan Vogel - Machine Translation)

First Order Features • Links depend on links ofneighboring words • Link always 2 nodes • Different features for differentdirections • (1,1), (1,2), (2,1), (1,0), … • Captures distortions, similar toHMM and IBM4 alignment • Indicator features, if both links are set • Also POS 1-order feature: indicator feature link(j,i) and (POSj, POSi) and link(j+k, i+l) Stephan Vogel - Machine Translation)

Inference – Finding the Best Alignment • Word alignment corresponds to assignment of random variables • => Find most probable variable assignment • Problem: • Complex model structure: many loops • No exact inference possible • Solution: • Belief propagation algorithm • Inference by message passing • Runtime exponential in number of connected nodes Stephan Vogel - Machine Translation)

Belief Propagation • Messages are sent from random variable nodes to factored nodes, and also in the opposite direction • Start with some initial values, e.g. uniform • In each iteration • Calculate messages from hidden node (j,i) and sent to factored node c • Calculate messages from factored node c and sent to hidden node (j,i) Stephan Vogel - Machine Translation)

Getting the Probability • After several iterations, belief value calculated from messages send to hidden nodes • Belief value can be interpreted as posterior probability Stephan Vogel - Machine Translation)

Training • Maximum log-likelihood of correct alignment • Use gradient descent to find optimum • Train towards minimum alignment error • Need smoothed version of AER • Express AER in terms of link indicator functions • Use sigmoid of link probability • Can use 2-step approach • 1. Optimize towards ML • 2. Optimize towards AER Stephan Vogel - Machine Translation)

Some Results: Spanish-English • Features • IBM1 and IBM4 lexicons • Fertilties • Link indicator feature • POS features • Phrase features • Impact on translationquality (Bleu scores) Stephan Vogel - Machine Translation)

Summary • In last 5 years new efforts in word alignment • Discriminative word alignment • Integrate many features • Need small amount of hand aligned data to tune (train) feature weights • Different variants • Log-linear modeling • Conditional random fields: sequence and alignment matrix • Significant improvements in word alignment error rate • Not always improvements in translation quality • Different density of alignment -> different phrase table size • Need to adjust phrase extraction algorithms? Stephan Vogel - Machine Translation)

Machine Translation Discriminative Word Alignment

Machine Translation Discriminative Word Alignment

Presentation Transcript

Machine Translation

Statistical Machine Translation Part V – Better Word Alignment, Morphology and Syntax

Discriminative Learning of Extraction Sets for Machine Translation

Machine Translation

Machine Translation

Word Sense Disambiguation for Machine Translation

Machine Translation Phrase Alignment

Statistical Machine Translation Word Alignment

Machine Translation Word Alignment

Machine Translation

Statistical Machine Translation Part IX – Better Word Alignment, Morphology and Syntax

Chinese Word Segmentation and Statistical Machine Translation

Statistical Alignment and Machine Translation

Bayesian Word Alignment for Statistical Machine Translation

Word Alignment

Machine Translation

Word Alignment

Statistical Machine Translation Part VI – Better Word Alignment, Morphology and Syntax

Chinese Word Segmentation and Statistical Machine Translation

Statistical Machine Translation Part VI – Better Word Alignment, Morphology and Syntax

Machine Translation, Free Machine Translation