90 likes | 189 Views
Improving SMT Preformance by Training Data Selection and Optimization. Yajuan Lü, Jin Huang and Qun Liu EMNLP, 2007 Presented by Mei Yang, May 12nd, 2008. Goal and Approach. Translation model adaptation/optimization Relevant data is better than more data
E N D
Improving SMT Preformance by Training Data Selection and Optimization Yajuan Lü, Jin Huang and Qun Liu EMNLP, 2007 Presented by Mei Yang, May 12nd, 2008
Goal and Approach • Translation model adaptation/optimization • Relevant data is better than more data • select test-related training data with IR • optimize the distribution of training data • Offline model optimization • Online model optimization
Select Relevant Data • Query: a testing sentence • Document: a source language training sentence • Information retrieval uses TF-IDF term • both the query and the document are represented by a vector • similarity score is calculated by the cosine of two vectors
Offline Model Optimization • For each sentence in the devset and the testset, retrieve the top-N similar sentences in the original training set T • All retrieved sentence pairs form the adaptive training set D • with or without duplicated instances • Train the adaptive model with D • Train the optimized model by adding D into T • address data sparseness issue in the adaptive model • in practice, adding D into T can be done by making necessary change to the count of each training sentence pair in T (Fig 1)
Online Model Optimization • Train a general model with the entire data and several candidate sub-models with prepared subsets, which can be obtained by • divide data according to its origins (in this paper) • clustering method • use IR with a small amount of domain-specific data • Translate using a log-linear model with weights optimized online • given a testing sentence, retrieve the top-N similar sentences in the original training set T • determine the model weights according to the proportions of sentences that are used for sub-model training • 4 different weighting schemes are proposed (page 346)
Experiment: Data • Chinese-to-English translation task • 200K sentence pairs randomly selected from each of the three corpora: FBIS, HK_Hansards, and HK_News. Totally 600K sentence pairs are used as training set • Devset and testset • Offline: use NIST02 as devset and NIST05 as testset (both 4 references) • Online: use 500 randomly selected sentence pairs from each of the three corpora, and 500 sentence pairs from NIST05. Totally 2K sentence pairs are used as testset (1 references)
Experiment #1: Offline optimization • N = 100, 200, 500, 1000, 2000 • Adaptive models (Table 3) • comparable BLEU scores with much smaller model sizes • duplicated data achieved better results than distinct data • when N is large, the performance starts to drop • Optimized models (Table 4) • significant improvement over baseline and adaptive models
Experiment #2: Online optimization • Preliminary results • N = 500 • No significant difference observed by using different weighting schemes • Small improvement over the baseline model
Future Work • More sophisticated similarity measure for information retrieval • Optimization algorithm for online weighting of sub-models • Introducing language model optimization into the system