1 / 25

Confidence Estimation for Machine Translation

Confidence Estimation for Machine Translation. J. Blatz et.al, Coling 04 SSLI MTRG 11/17/2004 Takahiro Shinozaki. Abstract. Detailed study of CE for machine translation Various machine learning methods CE for sentences and for words Different definitions of correctness Experiments

hakan
Download Presentation

Confidence Estimation for Machine Translation

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Confidence Estimation for Machine Translation J. Blatz et.al, Coling 04 SSLI MTRG 11/17/2004 Takahiro Shinozaki

  2. Abstract • Detailed study of CE for machine translation • Various machine learning methods • CE for sentences and for words • Different definitions of correctness • Experiments • NIST 2003 Chinese-to-English MT evaluation

  3. 1 Introduction • CE can improve usability of NLP based systems • CE techniques is not well studied in Machine translation • Investigate sentence and word level CE

  4. CE Score CE Score Threshold Threshold Binary output Binary output 2 Background Strong vs. weak CE Strong CE: require probability Correctness probabilities Weak CE: require only binary classification NOT necessary probability

  5. 2 Background Has CE layer or not No distinct CE layer NLP system Has distinct CE Layer NLP system • Require a training corpus • Powerful and modular CE module Naïve Bayes, NN, SVM etc…

  6. 3 Experimental Setting Correct or Not Input sentences N-best C Hyp Src Train Translation system ISI Alignment Template MT system Validation Test Reference sentences

  7. 3.1 Corpora • Chinese-to-English • Evaluation sets from NIST MT competitions • Multi reference corpus from LDC

  8. 3.2 CE Techniques • Data : A collection of pairs (x,c) • X: feature vector, c: correctness • Weak CE • X  score • X  MLP  score (Regressing MT evaluation score) • Strong CE • X  naïve Bayes  P(c=1|x) • X  MLP  P(c=1|x)

  9. C x1 x2 xD 3.2 Naïve Bayes (NB) • Assume features are statistically independent • Apply absolute discounting

  10. 3.2 Multi Layer Perceptron • Non-linear mapping of input features • Linear transformation layers • Non-linear transfer functions • Parameter estimation • Weak CE (Regression) • Target: MT evaluation score • Minimizing a squared error loss • Strong CE (Classification) • Target: Binary correct/incorrect class • Minimizing negative log likelihood

  11. 3.3 Metrics for Evaluation • Strong CE metric:Evaluates probability distribution • Normalized cross entropy (NCE) • Weak CE metrics:Evaluates discriminability • Classification error rate (CER) • Receiver operating characteristic (ROC)

  12. 3.3 Normalized Cross Entropy • Cross Entropy (negative log-likelihood) Estimated probability from CE module • Normalized Cross Entropy (NCE) Empirical probability obtained from test set

  13. 3.3 Classification Error Rate • CER: Ratio of samples with wrong binary (Correct/Incorrect) prediction • Threshold optimization • Sentence-level experiments: test set • Word-level experiments: validation set • Baseline

  14. 3.3 Receiver operating characteristic Prediction 1 ROC curve Better Correct-accept-ratio Fact IROC random 0,0 1 Correct-reject-ratio Cf.

  15. 4 Sentence Level Experiments • MT evaluation measures • WERg: normalized word error rate • NIST: sentence-level NIST score • “Correctness” definition • Thresholding WERg • Thresholding NIST • Threshold value • 5% “correct” examples • 30% “correct” examples

  16. 4.1 Features • Total of 91 sentence level features • Base-Model-Intrinsic • Output from 12 functions for Maximum entropy based base-system • Pruning statistics • N-best List • Rank, score ratio to the best, etc… • Source Sentence • Length, ngram frequency statistics, etc… • Target Sentence • LM scores, parenthesis matching, etc… • Source/Target Correspondence • IBM model1 probabilities, semantic similarity, etc…

  17. 4.2 MLP Experiments • MLPs are trained on all features for the four problem settings • Classification models are better than regression model • Performance is better than baseline N:NIST BASE CER W:WERg 3.21 32.5 5.65 32.5 Strong CE (Classification) Weak CE (Regression) N/A Table 2

  18. 4.3 Feature Comparison • Compare contributions of features • Individual feature • Group of features • All: All features • Base: base model scores • BD: base-model dependent • BI: base model independent • S: apply to source sentence • T: apply to target sentence • ST: apply to source and target sentence

  19. ALL Base BD BI S T ST 4.3 Feature Comparison (results) • Base All • BD > BI • T>ST>S • CE Layer > No CE Layer Exp. Condition: NIST 30% Table 3 Figure 1

  20. 5 Word Level Experiments • Definition of word correctnessA word is correct if: • Pos: occurs exactly at the same position as reference • WER: aligned to reference • PER: occurs in the reference • Select a “best” transcript from multiple references • Ratio of “correct” words • Pos(15%) < WER(43%) < PER(64%)

  21. 5.1 Features • Total of 17 features • SMT model based features (2) • Identity of alignment template, whether or not translated by a rule • IBM model 1 (1) • Averaged word translation probability • Word posterior and Related measures (3x3) • Target language based features (3+2) • Semantic features by WordNet • Syntax check, number of occurrences in the sentence WPP-any WPP-source WPP-target

  22. 5.2 Performance of Single Features • Experimental setting • Naïve Bayes classifier • PER based correctness • WPP-any give the best results • WPP-any>model1>WPP-source • Top3>any of the single features • No gain for ALL Table 4

  23. 5.3 Comparison of Different models • Naïve Bayes, MLPs with different number of hidden units • All features, PER based correctness • Naïve Bayes MLP0 • Naïve Bayes < MLP5 • MLP5 NLP10 NLP20 Figure 2

  24. 5.4 Comparison of Word Error Measures • Experimental settings • MLP20 • All features Table 5 • PER is the easiest to lean

  25. 6 Conclusion • Separate CE layer is useful • Features derived from base model are better than external ones • N-best based features are valuable • Target based features are more valuable than those not • MLPs with hidden units are better than naïve Bayes

More Related