1 / 55

CSE 291G : Deep Learning for Sequences

Learn about Named Entity Recognition (NER) applications, methods, importance, and approaches using BLSTM, CNNs, and RNN LSTM with character-level feature extraction. Understand how NER aids in content recommendations, customer support, and more.

ecupples
Download Presentation

CSE 291G : Deep Learning for Sequences

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CSE 291G : Deep Learning for Sequences Paper presentation Topic : Named Entity Recognition Rithesh

  2. Outline • Named Entity Recognition and its applications. • Existing methods • Character level feature extraction • RNN : BLSTM- CNNs

  3. Named Entity Recognition(NER)

  4. WHAT ? Named Entity Recognition(NER)

  5. Named Entity Recognition(entity identification, entity chunking & entity extraction) • Locate and classify named entity mentions in unstructured text into predefined categories : person names, organizations, locations, time expressions etc. • Ex : Kim bought 500 shares of IBM in 2010.

  6. Named Entity Recognition(entity identification, entity chunking & entity extraction) • Locate and classify named entity mentions in unstructured text into predefined categories : person names, organizations, locations, time expressions etc. • Ex : Kim bought 500 shares of IBM in 2010.

  7. Named Entity Recognition(entity identification, entity chunking & entity extraction) • Locate and classify named entity mentions in unstructured text into predefined categories : person names, organizations, locations, time expressions etc. • Ex : Kim bought 500 shares of IBM in 2010. Person name Organization Time

  8. Named Entity Recognition(entity identification, entity chunking & entity extraction) • Locate and classify named entity mentions in unstructured text into predefined categories : person names, organizations, locations, time expressions etc. • Ex : Kim bought 500 shares of IBM in 2010. • Importance of locating named entity in a sentence :Ex : Kim bought 500 shares of Bank of America in 2010. Person name Organization Time

  9. WHAT ? Named Entity Recognition(NER) WHY ?

  10. Applications of NER • Content Recommendations • Customer support • Classifying content for news providers • Efficient Searching algorithms • QA • Machine Translation Systems • Automatic Summarization system

  11. WHAT ? Named Entity Recognition(NER) WHY ? HOW ?

  12. Approaches : • ML Classification techniques(Ex : SVM, Perceptron model, CRF(Conditional Random Fields)) Drawback : Requires Hand-crafted features • Neural Network Model(By Collobert – Natural Language Processing (almost) from scratch) Drawbacks : (i) Simple Feedforward NN with fixed window size (ii) Depends solely on word embeddings & fails to exploit character level features – prefix, suffix etc. • RNN : LSTM • variable length input and long term memory • First proposed by Hammerton in 2003

  13. RNN : LSTM • Overcome drawbacks of existing system • Account for variable length input and long term memory • Fails to handle cases in which the ith word of a sentence(S) depends on words at positions greater than ‘i’ in S.Ex : Teddy bears are on sale.Teddy Roosevelt was a great president.

  14. RNN : LSTM • Overcome drawbacks of existing system • Account for variable length input and long term memory • Fails to handle cases in which the ith word of a sentence(S) depends on words at positions greater than ‘i’ in S.Ex : Teddy bears are on sale.Teddy Roosevelt was a great president. SOLUTION : Bi-directional LSTM (BLSTM) - Captures Information from the past and from the future.

  15. RNN : LSTM • Overcome drawbacks of existing system • Account for variable length input and long term memory • Fails to handle cases in which the ith word of a sentence(S) depends on words at positions greater than ‘i’ in S.Ex : Teddy bears are on sale.Teddy Roosevelt was a great president. SOLUTION : Bi-directional LSTM (BLSTM) - Captures Information from the past and from the future. Fails to exploit character level features

  16. Techniques to capture character level features • Santos and Labeau (2015) proposed a model for character level feature extraction using CNN for NER and POS. • Ling (2015) proposed a model for character level feature extraction using BLSTM for POS.

  17. Techniques to capture character level features • Santos and Labeau (2015) proposed a model for character level feature extraction using CNN for NER and POS. • Ling (2015) proposed a model for character level feature extraction using BLSTM for POS. • CNN or BLSTM?

  18. Techniques to capture character level features • Santos and Labeau (2015) proposed a model for character level feature extraction using CNN for NER and POS. • Ling (2015) proposed a model for character level feature extraction using BLSTM for POS. • CNN or BLSTM? • BLSTM did not perform significantly better than CNN and also, BLSTM is computationally more expensive to train.

  19. Techniques to capture character level features • Santos and Labeau (2015) proposed a model for character level feature extraction using CNN for NER and POS. • Ling (2015) proposed a model for character level feature extraction using BLSTM for POS. • CNN or BLSTM? • BLSTM did not perform significantly better than CNN and also, BLSTM is computationally more expensive to train. BLSTM : Word level feature extraction CNN : Character level feature extraction

  20. Named Entity Recognition with Bidirectional LSTM-CNNs Jason P.C. Chiu, Eric Nichols (2016). Named entity recognition with bidirectional LSTM-CNNs.Transactions of the Association for Computational Linguistics, 4, 357-370. • Inspired by : • Ronan Collobert, Jason Weston, Leon Bottou, Michael Karlen, KorayKavukcuoglu, and PavelKuksa. 2011b. Natural language processing (almost) from scratch.The journal of Machine Learning Research, 12:2493-2537.pages 25-33. • Cicero Santos, Victor Guimaraes. 2015. Boosting named entity recognition with neural character embeddings. Proceedings of the fifth Named Entities Workshop,

  21. Reference paper : Boosting NER with Neural Character Embeddings • CharWNN deep neural network – uses word and character level representations(embeddings) to perform sequential classification. • HAREM I : PortugueseSPA CoNLL-2002 : Spanish • CharWNNextendsCollobert et al.’s (2011)neural network architecture for sequential classification by adding a convolutional layer to extract character-level representations.

  22. CharWNN • Input : Sentence • Output : For each word in the sentence a score for each class

  23. CharWNN • Input : Sentence • Output : For each word in the sentence a score for each class S : <w1, w2, .. wN>

  24. CharWNN • Input : Sentence • Output : For each word in the sentence a score for each class S : <w1, w2, .. wN> wn un=[rwrd; rwch] un

  25. CharWNN • Input : Sentence • Output : For each word in the sentence a score for each class S : <w1, w2, .. wN> wn un=[rwrd; rwch] un

  26. CNN for character embedding

  27. CNN for character embedding W : <c1, c2, ..cM>

  28. CNN for character embedding W : <c1, c2, ..cM>

  29. CNN for character embedding W : <c1, c2, ..cM> Matrix vector operation with window size k

  30. CNN for character embedding W : <c1, c2, ..cM> Matrix vector operation with window size k

  31. CNN for character embedding W : <c1, c2, ..cM> Matrix vector operation with window size k rwch

  32. CharWNN • Input : Sentence • Output : For each word in the sentence a score for each class S : <w1, w2, .. wN> wn un=[rwrd; rwch] un <u1, u2, .. uN> rwch

  33. CharWNN • Input to convolution layer : <u1, u2, .. uN>

  34. CharWNN • Input to convolution layer : <u1, u2, .. uN> Two Neural Network layers

  35. CharWNN • Input to convolution layer : <u1, u2, .. uN> • For a Transition score matrix Atu Two Neural Network layers =

  36. Network Training for CharWNN • CharWNN is trained by minimizing the negative log-likelihood over the training set D. • Interpret the sentence score as a conditional probability over a path (the score is exponentiated and normalized with respect to all possible paths) • Stochastic gradient descent (SGD) to minimize the negative log-likelihood with respect to

  37. Embeddings • Word level Embedding : For Portuguese NER, the world level embeddings previously trained by Santos, 2004 was used. And for Spanish, Spanish wikipedia was used. • Character level Embedding : Unsupervised learning of character level embeddings was NOT performed. The character level embeddings are initialized by randomly sampling each value from an uniform distribution.

  38. Corpus : Portuguese & Spanish

  39. Hyperparameters

  40. Comparison of different NNs for the SPA CoNLL-2002 corpus

  41. Comparison of different NNs for the SPA CoNLL-2002 corpus Comparison with the state-of-the-art for the SPA CoNLL-2002 corpus

  42. Comparison of different NNs for the HAREM I corpus Comparison with the State-of-the-art for the HAREM I corpus

  43. Chiu, J. P., & Nichols, E. (2016). Named entity recognition with bidirectional LSTM-CNNs.Transactions of the Association for Computational Linguistics, 4, 357-370. BLSTM : Word level feature extraction CNN : Character level feature extraction

  44. Character Level feature extraction

  45. Word level feature extraction

  46. Word level feature extraction

  47. Embeddings • Word embeddings : 50 dimensional word embeddings released by Collobert (2011b) : Wikipedia & Reuters RCV-I corpus. Also, Stanford’s Glove and Google’s word2vec. • Character embeddings : randomly initialized lookup table with values drawn from a uniform distribution with range [-0.5, 0.5] to output a character embedding of 25 dimensions.

  48. Additional Features • Additional word level features : • Capitalization feature : allCaps, upperInitial, lowercase, mixedCaps, noinfo. • Lexicons : SENNA and DBpedia

  49. Training and Inference • Implementation : • torch7 library • Initial state of LSTM set to zero vectors. • Objective : Maximize sentence level log-likelihood • The objective function and its gradient can be efficiently computed by Dynamic programming. • Viterbi algorithm is used to find the optimal tag sequence [ i ]T that maximizes : • Learning : Training was done by mini-batch stochastic gradient descent (SGD) with a fixed learning rate, and each mini-batch consists of multiple sentences with same number of tokens.

  50. Results

More Related