240 likes | 323 Views
A Scalable Decoder for Parsing-based Machine Translation with Equivalent Language Model State Maintenance. Zhifei Li and Sanjeev Khudanpur Johns Hopkins University. J OS HU A: a scalable open-source parsing-based MT decoder. Written in JAVA language Chart-parsing Beam and Cube pruning
E N D
A Scalable Decoder for Parsing-based Machine Translation with Equivalent Language Model State Maintenance Zhifei Li and Sanjeev Khudanpur Johns Hopkins University
JOSHUA: a scalable open-source parsing-based MT decoder • Written in JAVA language • Chart-parsing • Beam and Cube pruning • K-best extraction over a hypergraph • m-gram LM Integration • Parallel Decoding • Distributed LM (Zhang et al., 2006; Brants et al., 2007) • Equivalent LM state maintenance • We plan to add more functions soon Chiang (2007) New!
Chart-parsing • Grammar formalism • Synchronous Context-free Grammar (SCFG) • Chart parsing • Bottom-up parsing • It maintains a chart, which contains an array of cells or bins • A cell maintains a list of items • The parsing process starts from axioms, and proceeds by applying the inference rules to prove more and more items, until a goal item is proved. • The hypotheses are stored in a hypergraph.
Hypergraph S S Goal Item X | 0, 4 | the mat | a cat X | 0, 4 | a cat|the mat hyperedge (X0, X0) (X0, X0) X (X0的 X1, X0 X1) X | 0, 2 | the mat |NA X | 3, 4 | a cat |NA X (X0的 X1, X0 ’s X1) X (X0的 X1, X1 of X0) X (X0的 X1, X1 onX0) on the mat of a cat X (垫子 上, the mat) X (猫, a cat) 垫子0上1 的2 猫3 item
Hypergraph and Trees S S S S X (猫, a cat) X (垫子 上, the mat) 垫子0上1 的2 猫3 the mat a cat X (X0的 X1, X0 X1) X (X0的 X1, X0 ’s X1) X (X0的 X1, X1 of X0) X (X0的 X1, X1 onX0) (X0, X0) (X0, X0) (X0, X0) (X0, X0) X (猫, a cat) X (猫, a cat) X (猫, a cat) X (垫子 上, the mat) X (垫子 上, the mat) X (垫子 上, the mat) 垫子0上1 的2 猫3 垫子0上1 的2 猫3 猫3 垫子0上1 的2 the mat ’s a cat A cat of the mat a cat on the mat
How to Integrate an m-gram LM? S | 0, 7 | <s> the |. </s> S (<s> S0 </s>,<s> S0 </s>) S | 0, 7 | the olympic |china . S (S0 X1, S0 X1) X | 1, 7 | will be |china . X (将 在 X0举行。, will be held in X0 .) S | 0, 1 | the olympic |olympic game X | 3, 6 | beijing of |of china S (X0, X0) X (X0的 X1, X1 of X0) X | 0,1 | the olympic |olympic game X | 5, 6 | beijing |NA X | 3, 4 | china |NA X (奥运会,the olympic game) X (北京, beijing) X (中国, china) 奥运会0 将1 在2 中国3 的4 北京5 举行。6 the olympic game will be held in beijing of china . • Three functions • Accumulate probability • Estimate future cost • State extraction • New 3-grams • will be held • be held in • held in beijing • in beijing of • Estimated total prob • 0.01*0.04=0.004 • Future prob • P(beijing of)=0.01 0.04=0.4*0.2*0.5 • New 3-gram • beijing of china 0.5 0.4 0.2
Equivalent State Maintenance: overview X | 0, 3 | below cat | some rat X | 0, 3 | below cats | many rat X (在 X0的 X1 下, below X1 of X0) X (在 X0的 X1 下, below X1 of X0) X | 0, 3 | below * | * rat X | 0, 3 | under cat | some rat X | 0, 3 | below cat | many rat X (在 X0的 X1 下, under X1 of X0) X (在 X0的 X1 下, below X1 of X0) X (在 X0的 X1 下, under the X1 of X0) X (在 X0的 X1 下, below the X1 of X0) • In a straightforward implementation, different LM state words lead to different items • We merge multiple items into a single item by replacing some LM state words with asterisk wildcard • By merging items, we can explore larger hypothesis space using less time. • We only merge items when the length of English span l ≥m-1
Back-off Parameterization of m-gram LMs • LM probability computation • Observations • A larger m leads to more backoff • Default backoff weight is 1 • For a m-gram not listed, β(.) = 1 -4.250922 party files -4.741889 party filled -4.250922 party finance -0.1434139 -4.741889 party financed -4.741889 party finances -0.2361806 -4.741889 party financially -3.33127 party financing -0.1119054 -3.277455 party finished -0.4362795 -4.012205 party fired -4.741889 party fires
Equivalent State Maintenance: Right-side P(el+1|el-2 el-1 el)=P(el+1| el-1 el) β(el-2 el-1 el)=P(el+1| el-1 el) state words future words State Prefix IS-A-PREFIX equivalent state el+1el+2el+3… el-2 * el-1 el-1 el el * * el-1 * el el el+1el+2el+3… el+1el+2el+3… * * el el no * * * • Why not right to left? • Whether a word can be ignored depends on both its left and right sides, which complicates the procedure. Independent from el-2 Backoff weight is one • For the case of a 4-gram LM el-2 el-1 el no el-1 el no IS-A-PREFIX(el-1 el)=no implies IS-A-PREFIX(el-1 el el+1)=no
Equivalent State Maintenance: Left-side P(e3|e0 e1 e2)=P(e3| e1 e2) β(e0 e1 e2) future words state words State Suffix IS-A-SUFFFIX equivalent state …e-2e-1e0 …e-2e-1e0 …e-2e-1e0 * e1 * * * * e1 e1 e2 * * * e1 e1 e2 e2 * e3 e1 no • Why not left to right? • Whether a word can be ignored depends on both its left and right sides, which complicates the procedure. Remember to factor in backoff weights later Independent from e3 Finalized probability • For the case of a 4-gram LM e1 e2 e3 no e1 e2 no P(e1| e-2 e-1 e0)=P(e1) β(e0)β(e-1 e0)β(e-2 e-1 e0) P(e2| e-1 e0 e1)=P(e2| e1) β(e0 e1)β(e-1 e0 e1)
Equivalent State Maintenance: summary Original Cost Function Modified Cost Function Finalizedprobability Estimated probability State extraction
Experimental Results: Decoding Speed • System Training • Task: Chinese to English translation • Sub-sampling of bitext of about 3M sentence pairs • obtain 570k sentence pairs • LM training data: Gigaword and English side of bitext • Decoding speed • Number of rules: 3M • Number of m-grams: 49M 38 times faster than the baseline!
Experimental Results: Distributed LM • Distributed Language Model • Eight 7-gram LMs • Decoding speed: 12.2 sec/sent
Experimental Results: Equivalent LM States 30 50 70 90 120 150 200 • Search effort versus search quality • Equivalent LM State Maintenance • Sparse LM: a 7-gram LM built on about 19M words • Dense LM: a 3-gram LM build on about 130M words • The equivalent LM state maintenance is slower than the regular method. • Backoff happens less frequently • Inefficient suffix/prefix information lookup
Summary • We describe a scalable parsing-based MT decoder • The decoder has been successfully used for decoding millions of sentences in a large-scale discriminative training task • We propose a method to maintain equivalent LM states • The decoder is available at • http://www.cs.jhu.edu/~zfli/
Acknowledgements • Thanks to Philip Resnik for letting me use the UMD Python decoder • Thanks to UMD MT group members for very helpful discussions • Thanks to David Chiang for Hiero and his original implementation in Python
Grammar Formalism • Synchronous Context-free Grammar (SCFG) • Ts: a set of source-language terminal symbols • Tt: a set of target-language terminal symbols • N: a shared set of nonterminal symbols • A set of rules of the form • a typical rule looks like:
Chart-parsing • Grammar formalism • Synchronous Context-free Grammar (SCFG) • Decoding task is defined as • Chart parsing • It maintains a chart, which contains an array of cells or bins • A cell maintains a list of items • The parsing process starts from axioms, and proceeds by applying the inference rules to prove more and more items, until a goal item is proved. • The hypotheses are stored in a structure called hypergraph.
m-gram LM Integration • Three Functions • Accumulate probability • Estimate future cost • State extraction Cost Function Finalized probability Estimated probability State extraction
Parallel and Distributed Decoding • Parallel Decoding • Divide the test set into multiple parts • Each part is decoded by a separate thread • The threads share the language/translation models in memory • Distributed Language Model (DLM) • Training • Divide the corpora into multiple parts • Train a LM on each part • Find the optimal weights among the LMs • Maximize the likelihood of a dev set • Decoding • Load the LMs into different servers • The decoder remotely calls the servers to obtain the probabilities • The decoder then interpolates the probabilities on the fly • To save communication overhead, a cache is maintained
Chart-parsing • Decoding task is defined as • Chart parsing • It maintains a chart, which contains an array of cells or bins • A cell maintains a list of items • The parsing process starts from axioms, and proceeds by applying the inference rules to prove more and more items, until a goal item is proved. • The hypotheses are stored in a structure called hypergraph. • State of an Item • Source span, left-side nonterminal symbol, and left/right LM state • Decoding complexity
Hypergraph S S (X0, X0) (X0, X0) X (X0的 X1, X0 X1) X (X0的 X1, X0 ’s X1) X (X0的 X1, X1 of X0) X (X0的 X1, X1 onX0) X (垫子 上, the mat) X (猫, a cat) 垫子0上1 的2 猫3 a cat on the mat • A hypergraph consists of a set of nodes and hyperedges • in parsing, they correspond to item and deductive step, respectively • Roughly, a hyperedge can be thought as a rule with pointers • State of an item • Source span, left-side nonterminal symbol, and left/right LM state Goal Item X | 0, 4 | the mat | a cat X | 0, 4 | a cat|the mat item X | 0, 2 | the mat |NA X | 3, 4 | a cat |NA hyperedge