450 likes | 586 Views
Bagging-based System Combination for Domain Adaptation. Linfeng Song , Haitao Mi, Yajuan Lü and Qun Liu Institute of Computing Technology Chinese Academy of Sciences. An Example. An Example. Initial MT system. An Example. Initial MT system. Tuned MT system that fits domain A.
E N D
Bagging-based System Combination for Domain Adaptation Linfeng Song, Haitao Mi, Yajuan Lü and Qun Liu Institute of Computing Technology Chinese Academy of Sciences
An Example • Initial MT system
An Example • Initial MT system • Tuned MT system that fits domain A Development set A:90% B:10% The translation styles of A and B are quite different
An Example • Initial MT system • Tuned MT system that fits domain A Test set A:10% B:90% Development set A:90% B:10%
An Example The translation style fits A, but we mainly want to translate B • Initial MT system • Tuned MT system that fits domain A Test set A:10% B:90% Development set A:90% B:10%
Traditional Methods Monolingual data with domain annotation
Traditional Methods Monolingual data with domain annotation Domain recognizer
Traditional Methods Bilingual training data
Traditional Methods training data : domain A Bilingual training data training data : domain B Domain recognizer
Traditional Methods training data : domain A MT system domain A Bilingual training data training data : domain B MT system domain B Domain recognizer
Traditional Methods Test set
Traditional Methods Test set domain A Test set Test set domain B Domain recognizer
Traditional Methods Test set domain A The translation result domain A MT system domain A The translation result Test set domain B The translation result domain B MT system domain B
The merits • Simple and effective • Fits Human’s intuition
The drawbacks • Classification Error (CE) • Especially for unsupervised methods • Supervised methods can make CE low, yet requiring annotation data limits its usage
Our motivation • Jump out of the alley of doing adaptation directly • Statistics methods (such as Bagging) can help.
Preliminary The general framework of Bagging
General framework of Bagging Training set D
General framework of Bagging Training set D Training set D1 Training set D2 Training set D3 …… C1 C2 C3 ……
General framework of Bagging C1 C2 C3 …… Test sample
General framework of Bagging Voting result Result of C1 Result of C2 Result of C3 …… C1 C2 C3 …… Test sample
Training Suppose there is a development set A,A,A,B,B For simplicity, there are only 5 sentences, 3 belong A, 2 belong B
Training We bootstrap N new development sets A,B,B,B,B A,A,B,B,B A,A,A,B,B A,A,B,B,B A,A,A,B,B A,A,A,A,B ……
Training For each set, a subsystem is tuned MT system-1 A,B,B,B,B A,A,B,B,B MT system-2 A,A,A,B,B A,A,B,B,B MT system-3 A,A,A,B,B MT system-4 A,A,A,A,B MT system-5 …… ……
Decoding For simplicity,Suppose only 2 subsystem has been tuned Subsystem-1 W:<-0.8,0.2> Subsystem-1 W:<-0.6,0.4>
Decoding Subsystem-1 W:<-0.8,0.2> A B Subsystem-1 W:<-0.6,0.4> Now a sentence “A B” needs a translation
Decoding After translation, each system generate its N-best candidate a b; <0.2, 0.2> a c; <0.2, 0.3> Subsystem-1 W:<-0.8,0.2> A B a b; <0.2, 0.2> a b; <0.1, 0.3> a d; <0.3, 0.4> Subsystem-1 W:<-0.6,0.4>
Decoding Fuse these N-best listsand eliminate deductions a b; <0.2, 0.2> a c; <0.2, 0.3> Subsystem-1 W:<-0.8,0.2> a b; <0.1, 0.2> a b; <0.1, 0.3> a c; <0.2, 0.3> a d; <0.3, 0.4> A B a b; <0.2, 0.2> a b; <0.1, 0.3> a d; <0.3, 0.4> Subsystem-1 W:<-0.6,0.4>
Decoding a b; <0.2, 0.2> a c; <0.2, 0.3> Subsystem-1 W:<-0.8,0.2> a b; <0.1, 0.2> a b; <0.1, 0.3> a c; <0.2, 0.3> a d; <0.3, 0.4> A B a b; <0.2, 0.2> a b; <0.1, 0.3> a d; <0.3, 0.4> Subsystem-1 W:<-0.6,0.4> Candidates are identical only if their target strings and feature values are entirely equal
S represent the number of subsystems Decoding Subsystem-1 W:<-0.8,0.2> a b; <0.2, 0.2> a b; <0.1, 0.3> a c; <0.2, 0.3> a d; <0.3, 0.4> a b; <0.2, 0.2>; -0.16 a b; <0.1, 0.3>; +0.04 a c; <0.2, 0.3>; -0.1 a d; <0.3, 0.4>; -0.18 Subsystem-1 W:<-0.6,0.4> Calculate the voting score
Decoding Subsystem-1 W:<-0.8,0.2> a b; <0.2, 0.2> a b; <0.1, 0.3> a c; <0.2, 0.3> a d; <0.3, 0.4> a b; <0.2, 0.2>; -0.16 a b; <0.1, 0.3>; +0.04 a c; <0.2, 0.3>; -0.1 a d; <0.3, 0.4>; -0.18 Subsystem-1 W:<-0.6,0.4> The one with the highest score wins
Decoding Subsystem-1 W:<-0.8,0.2> a b; <0.2, 0.2> a b; <0.1, 0.3> a c; <0.2, 0.3> a d; <0.3, 0.4> a b; <0.2, 0.2>; -0.16 a b; <0.1, 0.3>; +0.04 a c; <0.2, 0.3>; -0.1 a d; <0.3, 0.4>; -0.18 Subsystem-1 W:<-0.6,0.4> The one with the highest score wins • Since subsystems are different copies of the same model and share unique training data, calibration is unnecessary
Basic Setups • Data: NTCIR9 Chinese-English patent corpus • 1k sentence pairs as development set • Another 1k pairs as test set • The remains are used for training • System: hierarchical phrase based model • Alignment: GIZA++ grow-diag-final
Effectiveness : Show and Prove • Tune 30 subsystems using Bagging • Tune 30 subsystems with random initial weight • Evaluate the fusion results of the first N (N=5,10, 15, 20, 30) subsystems of both and compare
Results: 1-best +0.82 Number of subsystem
Results: 1-best +0.70 Number of subsystem
Results: Oracle +6.22 Number of subsystem
Results: Oracle +3.71 Number of subsystem
Compare with traditional methods • Evaluate a supervised method • For tackling data sparsity only operate on development set and test set • Evaluate a unsupervised method • Similar to Yamada (2007) • To avoid data sparsity, only LM specific
Conclusions • Propose a bagging-based method to address multi-domain translation problem. • Experiments shows that: • Bagging is effective for domain adaptation problem • Our method surpass baseline explicitly, and is even better than some traditional methods.
Thank you for listening And any questions?