Improving SMT Preformance by Training Data Selection and Optimization

Improving SMT Preformance by Training Data Selection and Optimization Yajuan Lü, Jin Huang and Qun Liu EMNLP, 2007 Presented by Mei Yang, May 12nd, 2008

Goal and Approach • Translation model adaptation/optimization • Relevant data is better than more data • select test-related training data with IR • optimize the distribution of training data • Offline model optimization • Online model optimization

Select Relevant Data • Query: a testing sentence • Document: a source language training sentence • Information retrieval uses TF-IDF term • both the query and the document are represented by a vector • similarity score is calculated by the cosine of two vectors

Offline Model Optimization • For each sentence in the devset and the testset, retrieve the top-N similar sentences in the original training set T • All retrieved sentence pairs form the adaptive training set D • with or without duplicated instances • Train the adaptive model with D • Train the optimized model by adding D into T • address data sparseness issue in the adaptive model • in practice, adding D into T can be done by making necessary change to the count of each training sentence pair in T (Fig 1)

Online Model Optimization • Train a general model with the entire data and several candidate sub-models with prepared subsets, which can be obtained by • divide data according to its origins (in this paper) • clustering method • use IR with a small amount of domain-specific data • Translate using a log-linear model with weights optimized online • given a testing sentence, retrieve the top-N similar sentences in the original training set T • determine the model weights according to the proportions of sentences that are used for sub-model training • 4 different weighting schemes are proposed (page 346)

Experiment: Data • Chinese-to-English translation task • 200K sentence pairs randomly selected from each of the three corpora: FBIS, HK_Hansards, and HK_News. Totally 600K sentence pairs are used as training set • Devset and testset • Offline: use NIST02 as devset and NIST05 as testset (both 4 references) • Online: use 500 randomly selected sentence pairs from each of the three corpora, and 500 sentence pairs from NIST05. Totally 2K sentence pairs are used as testset (1 references)

Experiment #1: Offline optimization • N = 100, 200, 500, 1000, 2000 • Adaptive models (Table 3) • comparable BLEU scores with much smaller model sizes • duplicated data achieved better results than distinct data • when N is large, the performance starts to drop • Optimized models (Table 4) • significant improvement over baseline and adaptive models

Experiment #2: Online optimization • Preliminary results • N = 500 • No significant difference observed by using different weighting schemes • Small improvement over the baseline model

Future Work • More sophisticated similarity measure for information retrieval • Optimization algorithm for online weighting of sub-models • Introducing language model optimization into the system

Improving SMT Preformance by Training Data Selection and Optimization

Improving SMT Preformance by Training Data Selection and Optimization

Presentation Transcript

Selection and training

Improving Revenue by System Integration and Cooperative Optimization

SMT TRAINING

Network Selection Optimization

Improving the quality of a customized SMT system using shared training data

TUTOR SELECTION AND TRAINING

Network Selection Optimization

18. Selection and Training

BIG Data: Knowledge for Improving Vaccine Virus Selection

Improving 3-D processing: an efficient data structure and scale selection

Clustering and optimization in genetic data: the problem of Tag-SNPs selection

Improving Driver Selection

Improving SMT with Phrase to Phrase Translations

Selection and Training of PLA

Improving SMT with Morphology Knowledge for Baltic Languages

Improving Region Selection in Dynamic Optimization Systems

DATA COLLECTION AND IMPROVING DATA QUALITY

Improving Driver Selection

DLL-Conscious Instruction Fetch Optimization for SMT Processors

The Optimization and Selection Strategies of ADC