220 likes | 244 Views
Learn about semi-supervised boosting for word alignment in Natural Language Processing (NLP), involving combining labeled and unlabeled data for improved alignment accuracy and translation quality. Evaluation metrics and results are discussed.
E N D
Semi-Supervised Boosting for Statistical Word Alignment Wu Hua 2006/10/18
Outline • Introduction to semi-supervised learning • Introduction to boosting • Semi-supervised boosting for word alignment • Evaluation results • Conclusion
Machine Learning Methods • Supervised Learning • Labeled data • Unsupervised learning • Unlabeled data • Semi-supervised learning • Combine both labeled data and unlabeled data
Semi-Supervised Learning in NLP • Word sense disambiguation • (Yarowsky, 1995; Pham et al., 2005) • Classification • (Blum and Mitchell, 1998; Thorsten, 1999) • Clustering • (Basu et al., 2004) • Named entity classification • (Collins and Singer, 1999) • Parsing • (Sarkar, 2001)
Reference Set End? Boosting – Supervised Learning Initialization Supervised Learning Call Learner Calculate Error Rate Re-weight Training data Yes Build Ensemble
Boosting in NLP • Tagging and PP attachment • (Abney et al., 1999) • Word sense disambiguation • (Escudero et al., 2000) • Parser construction • (Haruno et al., 1999; Henderson and Brill, 2000) • Sentence generation • (Walker et al., 2001)
Semi-Supervised Boosting • Three main problems • Semi-supervised learner • Combine labeled data and unlabeled data • Reference set • Automatically construct a reference set for unlabeled data • Error rate calculation • How to calculate the error rate with both labeled data and unlabeled data
End? Semi-Supervised Boosting Applied to Word Alignment Labeled Data Unlabeled Data Supervised Training Unsupervised Training Model Interpolation Real Reference Set Error Rate Calculation Pseudo Reference Set Re-weight Training data Yes Build Ensemble
Semi-Supervised Boosting Applied to Word Alignment • Five main components • Word alignment model interpolation • Pseudo reference set construction for unlabeled data • Error rate calculation • Weight update • Final Ensemble
Word Alignment Model • Supervised alignment model • Calculate the probabilities for IBM Model 4 based on the labeled data • Unsupervised alignment model • Use GIZA++ to train IBM Model 4 • Perform model interpolation
Pseudo Reference Set Construction • Obtain bi-directional word alignment sets S1 and S2 on the training data • Obtain the intersection set of these two alignment sets • Filter the union set of the two alignment sets • Build the pseudo reference set where
Error Rate Calculation • For a sentence pair • Calculate the error rate of a aligner • Based on the labeled data instead of the whole data where is the normalized weight of the ith sentence pair at the lth round
Re-Weight the Training Data • Reweight each sentence pair in the training set • For each sentence pair, there may exist correct links and incorrect links as compared with the pseudo reference set • Calculate the weight of each sentence pair according to the correct and incorrect links where K is the number of the error links n is the total number of the links in the reference
Final Ensemble • Obtain the final ensemble according to the trained word aligners on each round where is the final ensemble for word alignment is the weight of each alignment pair (s,t) produced by the word aligner is the weight of the word aligner
Evaluation • Training set • Unlabeled data: 320,000 English-Chinese pairs • Labeled data: 30,000 English-Chinese pairs • Held-out set • 1,500 sentence pairs • Testing set • 1,000 bilingual English-Chinese sentence pairs • Totally 8,651 alignment links
Evaluation Metric • Word alignment • Precision and Recall • Alignment Error Rate (AER) • Phrase-based machine translation • System: Pharaoh • Metrics: NIST and BLEU
Method Precision Recall AER Baseline 0.7946 0.7775 0.2140 Our method 0.8175 0.7858 0.1987 Weights in Ensembles • Two kinds of weights • Weights for the individual aligners • Weights for the individual alignment links Baseline: only use the first kind of weights Our method: use the two kinds of weights
Conclusion • Features in our semi-supervised boosting method • Perform model interpolation • Automatically build pseudo reference set • Calculate the error rate of training set with the labeled data • Use two kinds of weights in the ensemble • One for aligners • The other for alignment links • Boosting improves the word alignment and translation quality • Boosting does improve word alignment and translation quality • Semi-supervised boosting performs the best