130 likes | 283 Views
Competitive Grouping in Integrated Segmentation and Alignment Model. Ying Zhang Stephan Vogel Language Technologies Institute School of Computer Science Carnegie Mellon University. Integrated Segmentation and Alignment Model.
E N D
Competitive Grouping in Integrated Segmentation and Alignment Model Ying Zhang Stephan Vogel Language Technologies Institute School of Computer Science Carnegie Mellon University
Integrated Segmentation and Alignment Model • Phrase alignment models (Och et al., 1999; Marcu and Wong, 2002; Kohen et al., 2003) • Many of these models rely on the pre-calculated word alignment. • Use different heuristics to extract phrase pairs from the Viterbi word alignment path. • Integrated Segmentation and Alignment model (Zhang 2003) • No such word alignments needed • Segment source and target sentences into phrases and align them simultaneously • Use chi-square(f, e) instead of the conditional probability P(f|e) for word pair associations • Greedy search for phrase pairs • Key idea: competitive grouping algorithm • Inspired by the competitive linking algorithm (Melamed 1997) for word alignment
Competitive Linking Algorithm • A greedy word alignment algorithm. • The word pair has the highest likelihood L(f,e) “wins” the competition. • One-to-one assumption: when pair{f, e} is “linked”, neither f nor e can be aligned with any other words. • Example:
Competitive Grouping Algorithm • Discard the one-to-one assumption in competitive linking, make it less greedy. • When a pair {e, f} wins the competition, inviting the neighboring pairs to join the “winner’s club”. • Introducing the locality assumption: a source phrase of adjacent words can only be aligned to a target phrase of adjacent words. • Words inside the aligned phrase pairs can not be aligned to other words
Expanding the Phrase Pair Aligned • Two criteria have to be satisfied to expand the seeding word pair to phrase pairs • If a new source word f is to be grouped, the best e that f is associated should not be “blocked” by this expansion; similar for grouping a new target word. • The highest word pair likelihood value in the expanded area needs to be “similar” to the seed value • According to the locality assumption, words in the aligned phrase pairs can not be aligned with other words again.
Exploring All Possible Phrase Pairs • Criterion 2 is used to control the granularity of the phrase pairs aligned • Two short phrase pairs • Or one long phrase pairs • Short phrases give better coverage for unseen testing data • Long phrases encapsulate more context, e.g. local reordering, word sense, and etc. • Hard to decided on the optimal granularity without knowing the testing data • Solution: for each grouping, try all possible granularities
Exploring All Possible Phrase Pairs French: Je déclare reprise la session English: I declare resumed the session
The Likelihood of Word Associations • Chi-square statistics is used to measure the likelihood of word associations for pair {e, f} • For each word pair {e, f} null hypothesis: e and f are independent of each other. • Calculating to measure how true is this hypothesis • Construct the contingency table using the counts from the corpus given the current alignment, e.g. uniform alignment • O11: number of times when e and f are aligned • O12: number of times when e aligned with other f • O21: number of times when f aligned with other e • O22: number of times when other f aligned with other e
In WPT-05 • Submitted results for all four languages • Training data as provided • Language model as provided • Decoder (Pharaoh) as provided
Conclusion • Competitive grouping algorithm at the core of the ISA model • Simple and efficient model • Comparable results as other phrase alignment models