1 / 12

Effective Approaches to attention based neural machine translation

Presentation of Paper " Effective approached to attention based neural machine translation.

faisalriaz
Download Presentation

Effective Approaches to attention based neural machine translation

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Effective approaches to Attention based Neural Machine Translation {F2019313020,F2019313001]@UMT.EDU.PK

  2. Introduction • Attentional Mechanism has been developed to improve Neural Machine Translation by selectively focusing of the parts of source sentence. • Paper examines the two classes of Attentional based Mechanism. • Global approach -> attends to all source words. • Local approach -> attends to a subset of source words • Local attention 5.0 BLEU more points achieved over non attentional system. • This Paper is based on 15 others research papers. • More research has been done based on these papers and current BLEU achieved is 59 as compared to 29 achieved BLEU points in this paper.

  3. Neural Machine Translation • The ultimate goal of any NMT model is to take a sentence in one language as input and return that sentence translated into a different language as output. • NMT is a large neural network that is trained in end to end fashion and has the ability to generalize well to very long sequences. • Does not have to store gigantic phrase tables and language models. • NMT has a small memory print. • Explanation of NMT https://towardsdatascience.com/neural-machine-translation-15ecf6b0b

  4. Neural Machine Translation • NMT Model directly models the conditional probability p(y|x)of translating a source sentence x1,….xnto a target sentence y1,……..ym. • Consists of an encoder which computes a representation s for each source sentence. • Decoder which generates one target word at a time.

  5. Neural Machine Translation • Natural choice to model such a decomposition in the decoder side is RNN architecture. • Paper presented in 2013,2014,2015 differ in terms of which RNN architectures are used for decoder and how the encoder computes the source sentence. • Loung (co-author of the paper) used the stacked layers of RNN with LSTM unit for both encoder and decoder hidden units.

  6. Probability of the decoding • Probability of the decoding each word yjas • g being the transformation function that outputs a vocabulary sized vector. Or one can provide g with other inputs such as the currently predicted word yjas proposed by Bahdanau -2015. • hj is the RNN hidden unit abstractly computed as : • Function f computes current hidden state and can either be RNN,GRU or LSTM. • Source representation s is only used once to initialize the decoder hidden decoder state.

  7. Proposed NMT Model and attention • Stacking LSTM architecture is used for proposed NMT systems. • Used LSTM Model is defined in Zaremba ,2015. • Source representation s implies a set of source hidden states • Set s is consulted throughout the entire course of translation process. • This approach is referred as attention mechanism.

  8. Attention –based Models • Classified in two broad Categories Global and Local. • Classes differs on the basis of attention placed on all source positions or a few. • Common to these types: • Each time step t in decoding phase. • Both approaches take input the hidden state ht at the top layer of stacking LSTM. • Goal is to derive context vector ct to help predict the current target word yt. • Combine information from both vectors to produce attentional hidden state.

  9. Global Attention • At each time step t , model infers a variable length alignment weight vector at. • Based on the current target state ht and the all the source states hs. • Global context vector ct is computed as the weighted average. • Weighted average is calculated according to at over all the source stats. Global attention has a draw back that it has to attend all the words on the source side for each target word.

  10. Local Attention • Local attention model focus only on small subset of the source position per target word. • The model first predicts a single aligned position pt for current target word. • Window centered around the source position pt is used to compute ct which is the weighted average of the source hidden states in the window. • Weights at are inferred from the current target state ht and those source states hs (bar)

  11. Training Details • Models trained on the WMT 14 training data. • 4.5 M sentence pairs ( 116M English Words , 110 M German Words). • Limit the vocabulary to top 50 K most frequent words. • Words not listed in this short listed vocabulary were converted to universal tokens. • Sentence pairs exceeding length of 50 words were filtered and shuffle mini batches.

  12. WMT ‘14 English German Result

More Related