520 likes | 650 Views
Towards Syntactically Constrained Statistical Word Alignment. Greg Hanneman 11-734: Advanced Machine Translation Seminar April 30, 2008. Outline. The word alignment problem Base approaches Syntax-based approaches Distortion models Tree-to-string models Tree-to-tree models Discussion.
E N D
Towards Syntactically Constrained Statistical Word Alignment Greg Hanneman 11-734: Advanced Machine Translation Seminar April 30, 2008
Outline • The word alignment problem • Base approaches • Syntax-based approaches • Distortion models • Tree-to-string models • Tree-to-tree models • Discussion
Word Alignment • Parallel sentence pair: F and E • Most general: map a subset of F to a subset of E
Word Alignment • Very large alignment spaces! • An n-word parallel sentence has n2 possible links and 2n2 possible alignments • Restrict to one-to-one alignments: n! possible alignments • Alignment models try to restrict or learn a probability distribution over this space to get the “best” alignment of a sentence
Outline • The word alignment problem • Base approaches • Syntax-based approaches • Distortion models • Tree-to-string models • Tree-to-tree models • Discussion
English sentence Fertility The proposal will not be implemented Lexical generation Les propositions ne seront pas mises en application Les propositions seront ne pas mises en application Distortion A Generative Story[Brown et al. 1990]
The Framework • F: words f1 … fj … fn • E: words e1 … ei … em • Compute P(F, A | E) for hidden alignment variable A: a1 … aj … an • The major step: decomposition, model parameters, EM algorithm, etc. • aj = i: word fj is aligned to word ei
The IBM Models[Brown et al. 1993; Och and Ney 2003] • Model 1: “Bag of words” — word order doesn’t affect alignment • Model 2: Position of words being aligned does matter
The IBM Models[Brown et al. 1993; Och and Ney 2003] • Later models use more implicit structural or linguistic information, but not really syntax, and not really overtly • Fertility: P(φ | ei) of ei producing φ words in F • Distortion: P(τ, π | E) for a set of F words τ in a permutation π • Previous alignments: Probs. for positions in F of the different words of a fertile ei
The HMM Model[Vogel et al. 1996; Och and Ney 2003] • Linguistic intuition: words, and their alignments, tend to clump together in clusters • aj depends on absolute size of “jump” between it and aj–1
Discriminative Training • Consider all possible alignments, score them, and pick the best ones under some set of constraints • Can incorporate arbitrary features; generative models more fixed • Generative models’ EM requires lots of unlabeled training data; discriminative requires some labeled data
The Les proposal propositions will ne not seront be pas implemented mises en application Discriminative Alignment[Taskar et al. 2005] • Co-occurrence • Position difference • Co-occurrence of following words • Word-frequency rank • Model 4 prediction • …
Outline • The word alignment problem • Base approaches • Syntax-based approaches • Distortion models • Tree-to-string models • Tree-to-tree models • Discussion
Syntax-Based Approaches • Constrain alignment space by looking beyond flat text stream: take higher-level sentence structure into account • Representations • Constituency structure • Inversion Transduction Grammar • Dependency structure
Syntax-Based Distortion[DeNero and Klein 2007] • Syntax-based MT should start from syntax-aware word alignments • HMM model + target-language parse trees: prefer alignments that respect tree • Handled in distortion model: jumps should reflect tree structure
Syntax-Based Distortion[DeNero and Klein 2007] • HMM distortion: size of jump between aj–1 and aj • Syntactic distortion: tree path between aj–1 and aj
Syntax-Based Distortion[DeNero and Klein 2007] • Training:100,000 parallel French–English and Chinese–English sentences with English parse trees • Both E→F and F → E; combined with different unions and intersections, plus thresholds • Test: Hand-aligned Hansards and NIST MT 2002 data
Syntax-Based Distortion[DeNero and Klein 2007] • HMMs roughly equal, better than GIZA++ • Soft union for French; hard union for Chinese; competitive thresholding
Tree-to-String Models • New generative story • Word-level fertility and distortion replaced with node insertion and sibling reordering • Lexical translation still the same • Word alignment produced as a side effect from lexical translations
Tree-to-String Alignment[Yamada and Knight 2001] • Discussed in other sessions this semester • Training: 2121 short Japanese–English sentences, modified Collins parser output for English • Test: First 50 sentences of training corpus • Beat IBM Model 5 on human judgements; perplexity between Model 1 and Model 5
Subtree Cloning[Gildea 2003] • Original tree-to-string model is too strict • Syntactic divergences, reordering • Soft constraint: allow alignments that violate tree structure, but at a cost • Tweak the tree side of the alignment to contain things needed for the string side • Ex.: SVO to OSV
S VP NP AUX VP PRP do ADVP VB NP I NP RB understand PRP PRP$ NN entirely I your language Subtree Cloning[Gildea 2003]
NP PRP ADVP VB NP I NP RB understand PRP PRP$ NN entirely I your language Subtree Cloning[Gildea 2003] S VP AUX VP do
S VP NP AUX VP PRP do ADVP NP VB NP I tung NULL NULL tu ni hua wo PRP$ RB understand NN PRP your men entirely ti language I Subtree Cloning[Gildea 2003]
Subtree Cloning[Gildea 2003] • For a node np: • Probability of cloning something as a new child of np: single EM-learned constant for all np • Probability of making that clone a node nc: uniform over all nc • Surprising that this works…
Subtree Cloning[Gildea 2003] • Compared with IBM 1–3, basic tree-to-string, basic tree-to-tree models • Training: 4982 Korean–English sentence pairs, with manual Korean parse trees • Test: 101 hand-aligned held-out sentences
Subtree Cloning[Gildea 2003] • Cloning helps: as good or better than IBM • Tree-to-tree model runs faster
Tree-to-Tree Models • Alignment must conform to tree structure on both sides — space is more constrained • Requires more transformation operations to handle divergent structures [Gildea 2003] • Or we could be more permissive…
Inversion Transduction Grammar[Wu 1997] • For bilingual parsing; get one-to-one word alignment as a side effect • Parallelbinary-branchingtrees with reordering
ITG Operations • A → [A A] • Produce “A1 A2” in source and target streams • A → <A A> • Produce “A1 A2” in source stream, “A2 A1” in target stream • A → e / f • Produce “e” in source stream, “f” in target stream
ITG Operations • “Canonical form” ITG produces only one derivation for a given alignment • S → A | B | C • A → [A B] | [B B] | [C B] | [A C] | [B C] | [C C] • B → <A A> | <B A> | <C A> | <A C> | <B C> | <C C> • C → e / f
Alignment with ITG[Zhang and Gildea 2004] • Compared IBM 1, IBM 4, ITG, and tree-to-string (with and without cloning) • Training: Chinese–English (18,773) and French–English (20,000) sentences less than 25 words long • Test: Hand-aligned Chinese–English (48) and French–English (447)
Alignment with ITG[Zhang and Gildea 2004] • ITG best, or at least as good as IBM or tree-to-string plus cloning • ITG has no linguistic syntax…
Dependency Parsing • Discussed in other sessions this semester • Notion of violating “phrasal cohesion” • Usually bad, but not always
Dependencies + ITG[Cherry and Lin 2006] • Find invalid dependency spans; assign score of –∞ if used by the ITG parser • Simple model: maximize co-occurrence score with penalty for distant words • ITG reduces AER by 13% relative; dependencies + ITG reduce by 34%
Dependencies + ITG[Cherry and Lin 2006] • Discriminative training with an SVM • Feature vector for each ITG rule instance • Features from Taskar et al. [2005] • Feature marking ITG inversion rules • Feature (penalty) marking invalid spans based on dependency tree
Dependencies + ITG[Cherry and Lin 2006] • Compared Taskar et al. to D-ITG with hard and soft constraints • Training: 50,000 French–English sentence pairs for counts and probabilities; 100 hand-annotated pairs with derived ITG trees for discriminative training • Test: 347 hand-annotated sentences from 2003 parallel text workshop
Dependencies + ITG[Cherry and Lin 2006] • Relative improvement smaller in discriminative training scenario with stronger objective function • Hard constraint starts to hurt recall
Outline • The word alignment problem • Base approaches • Syntax-based approaches • Distortion models • Tree-to-string models • Tree-to-tree models • Discussion
All These Tradeoffs… • Mathematical and statistical correctness vs. computability • Simple model vs. capturing linguistic phenomena • Not enough syntactic information vs. too much syntactic information • Ruling out bad alignments vs. keeping good alignments around
Alignment Spaces • Completely unconstrained: every alignment link (ei, fj) either “on” or “off” • Permutation space: one-to-one alignment with reordering [Taskar et al. 2005] • ITG space: permutation space satisfying binary tree constraint [Wu 1997] • Dependency space: permutation space maintaining phrasal cohesion
Alignment Spaces • D-ITG space: Dependency ∩ ITG space [Cherry and Lin 2006] • HD-ITG space: D-ITG space where each span must contain a head [Cherry and Lin 2006a]
Examining Alignment Spaces[Cherry and Lin 2006a] • Alignment score • Learned co-occurrence score • Gold-standard oracle score
Examining Alignment Spaces[Cherry and Lin 2006a] • Learned co-occurrence score • More restricted spaces give better results
Examining Alignment Spaces[Cherry and Lin 2006a] • Oracle score: subsets of permutation space • ITG rules out almost nothing correct • Beam search in dependency space does worst
Conclusions • Base alignment models are mathematical, limited notions of sentence structure • Syntax-aware alignment helpful for syntax-aware MT [DeNero and Klein 2007] • Using structure as a hard constraint is harmful for divergent sentences; tweaking trees [Gildea 2003] or using soft constraints [Cherry and Lin 2006] helps fix this
Conclusions • Surprise winner: ITG • Computationally straightforward • Permissive, simple grammar that mostly only rules out bad alignments [Cherry and Lin 2006a] • Does a lot, even when it’s not the best • Discriminative framework looks promising and flexible — can incorporate generative models as features [Taskar et al. 2005]
Towards the Future • Easy-to-run GIZA++ made complicated IBM models the norm — promising discriminative or syntax-based models currently lack such a toolkit • Syntax-based discriminative techniques — morphology, POS, semantic information… • Any other ideas?