1 / 43

Deep Learning for Domain-Specific Entity Extraction from Unstructured Text

Learn about custom entity extraction, word embeddings, and training neural network models for domain-specific tasks like biomedical named entity recognition using deep learning techniques.

mueller
Download Presentation

Deep Learning for Domain-Specific Entity Extraction from Unstructured Text

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Deep Learning for Domain-SpecificEntity Extraction from Unstructured Text • Mohamed AbdelHady, Sr. Data Scientist • Zoran Dzunic, Data Scientist • Cloud AI

  2. Goals • What is entity extraction? • When do I need to train a custom entity extraction model? • Why should I use Deep Neural Networks and which one? • What are word embeddings and why should I use them as features? • How do I train a custom word embedding model? • How do I train a custom Deep Neural Network model for entity extraction? • What is the best architecture for these two tasks?

  3. Agenda • Motivation / Entity Extraction • Recurrent Deep Neural Networks • Word Embeddings • Architecture • Results • Conclusion

  4. Entity Extraction • Subtask of information extraction • Also known as Named-entity recognition (NER), entity chunking and entity identification • Find phrases in text that refer to a real-world entity of specific types Mr. Zoran is at Strata in San Jose on March 7.

  5. Entity Extraction • Subtask of information extraction • Also known as Named-entity recognition (NER), entity chunking and entity identification • Find phrases in text that refer to a real-world entity of specific types Mr. Zoran is at Strata in San Jose on March 7. Zoran : PERSONStrata : ORGSan Jose : LOCMarch 7 : DATE

  6. Why is it useful? • Indexing • e.g., find all people in a document collection • Discovery • e.g., learn about new drugs from recent biomedical articles • Relation extraction • WORKS (PERSON, ORG), e.g., WORKS (Zoran, Microsoft) • Question answering • Where is Zoran? Zoran is in San Jose.

  7. Custom Entity Extraction • Pre-trained models for common entity types (person, org, loc, date, …) • Custom models • new entity types (e.g., drug, disease) • different language properties • foreign names • non-proper sentences (e.g., tweets) • specific domain (e.g., biomedical domain)

  8. Domain-Specific Entity Extraction Biomedical named entity recognition • Entity types drug/chemical, disease, protein, DNA, etc. • Critical step for complex biomedical NLP tasks: • Extraction of diseases, symptoms from electronic medical or health records • Understanding the interactions between different entity types such as drug-drug interaction, drug-disease relationship and gene-protein relationship, e.g., • Drug A cures Disease B. • Drug A causes Disease B. Similar for other domains (e.g., legal, finance)

  9. Biomedical Entity Extraction • Vast Amount of Medical Articles

  10. Approach • Feature Extraction Phase – Domain Specific Features Use a large amount of unlabeled domain-specific data corpus such as Medline PubMed abstracts to train a neural word embedding model. • Model Training Phase – Domain Specific Model The output embeddings are considered as automatically generated features to train a neural entity extractor using a small/reasonable amount of labeled data.

  11. Agenda • Motivation / Entity Extraction • Recurrent Deep Neural Networks • Word Embeddings • Architecture • Results • Conclusion

  12. Why Machine Learning? • Dictionary approach • What if a phrase is not in the dictionary? Zoran is at Strata in San Jose on March 7. Mr. Zoran is at Strata in San Jose on March 7. Zoran Smith is at Strata in San Jose on March 7.

  13. Why Machine Learning? • Statistical (ML) approach • Takes features from surrounding words Zoran is at Strata in San Jose on March 7. Mr. Zoran is at Strata in San Jose on March 7. Zoran Smith is at Strata in San Jose on March 7. Clues: after “Mr.”, before a known last name (“Smith”) • Structured prediction (sequence tagging) – label for a word depends on other labels • Rather than classifying each word independently

  14. Why Deep Learning? • Can find complex relationships between input and output using • Non-linear processing units • Multiple hidden layers • Special-purpose layers/architectures have been developed for different problems • Recurrent Neural Networks (RNNs) are commonly used for sequence-labeling tasks • Long Short-Term Memory layer (LSTM) • Comparison to Conditional Random Fields (CRFs) • CRFs are linear models w/o hidden layers • CRFs have short-term memory

  15. Recurrent Neural Networks • The state ht denotes the memory of the network and is responsible for capturing information about previous time steps. • All layers in RNNs share the same set of parameters because we are performing the same task at each time step with different inputs.

  16. Each rectangle represents a feature vector

  17. Long Short Term Memory Networks (LSTMs) • Vanilla RNNs suffer from a drawback that they are not able to keep track of long term dependencies (Vanishing Gradient Problem). • LSTMs are special types of RNNs designed to solve this problem by using four interacting layers in the repeating module.

  18. Agenda • Motivation / Entity Extraction • Recurrent Deep Neural Networks • Word Embeddings • Architecture • Results • Conclusion

  19. Features: Words B-Chemical O O Naloxone the Words: reverses

  20. Features: One-Hot Encoding B-Chemical O O One-Hot: dim=|V| (e.g., 105) [1, 0, 0 …] [0, 0, 1 …] [0, 1, 0 …]

  21. Features: Word Embeddings B-Chemical O O Embedding: dim small (e.g., 50, 200) [0.3, 0.2, 0.9 …] [0.8, 0.8, 0.1 …] [0.5, 0.1, 0.5 …]

  22. Word2Vec • Simple neural network of a single hidden layer with a linear activation function (Skip-Gram, CBOW) • Unsupervised learning from large corpora. The word vectors (embedding) are learned by the stochastic gradient descent optimization algorithm • Publicly available pre-trained models such as Google News • Can we do better on a specific domain? CBOW Neural Network Model

  23. Skip-Gram Neural Network Model

  24. Agenda • Motivation / Entity Extraction • Recurrent Deep Neural Networks • Word Embeddings • Architecture • Results • Conclusion

  25. Spark HDInsight Spark application a.k.a. Driver Parallel in-memory computing Big Data workloads Scales to thousands of machines Offers multiple libraries that include machine learning, streaming, and graph analysis APIs in Scala, Python, R Task Task Task Task Task Task Task Task Task Task Task Task valsc = new SparkContext(master=“mesos://..”) Slave a.k.a. Worker Slave a.k.a. Worker Executor Executor SparkContext Master a.k.a. Cluster Manager

  26. GPUDSVMs Physical Card = 2 x NVidia TESLA GPUs Use NC class Azure VMs with GPUs Great for Deep Learning workloads

  27. AML Workbench Sample, understand, and prep data rapidly Support for Spark + Python + R (roadmap) Execute jobs locally, on remote VMs, Spark clusters, SQL on-premises Git-backed tracking of code, config, parameters, data, run history

  28. Tutorial: http://github.com/Azure/MachineLearningSamples-BiomedicalEntityExtraction

  29. Agenda • Motivation / Entity Extraction • Recurrent Deep Neural Networks • Word Embeddings • Architecture • Results • Conclusion

  30. Datasets • Proteins, Cell Line, Cell Type, DNA and RNA Detection Bio-Entity Recognition Task at BioNLP/NLPBA 2004 • Chemicals and Diseases Detection BioCreative V CDR task corpus • Drugs Detection Semeval 2013 - Task 9.1 (Drug Recognition)

  31. Experimental Setup • Provision Azure HDInsight Spark cluster • Train the Word2Vec model: • window_size = 5 • vector_size = 50 • min_count =1000 • Provision an Azure GPU-DSVM using NC6 Standard (56 GB, K80 NVIDIA Tesla). • Keras with TensorFlow as backend: • Initialize the embedding layer with the trained embeddings • Define the LSTM RNN architecture • Train the LSTM RNN model CRFSuite: • Extract traditional features • Train CRF model

  32. RNN Achitecture

  33. RNN Graph Definition

  34. Dataset Description • Chemicals and Diseases Detection • BioCreative V CDR task corpus, 2015 http://www.biocreative.org/tasks/biocreative-v/track-3-cdr/ • Training Data Stats • # words = 214,198 • # disease entities = 7699 • # chemical entities = 9575 • Test Data Stats • # words = 112,285 • # disease entities = 4080 • # chemical entities = 4941

  35. Results (exact match)

  36. Embedding Comparison

  37. Agenda • Motivation / Entity Extraction • Recurrent Deep Neural Networks • Word Embeddings • Architecture • Results • Conclusion

  38. Takeaways • Recipe for building a custom entity extraction pipeline: • Get a large amount of in-domain unlabeled data • Train a word2vec model on unlabeled data on Spark • Get as much of labeled data as possible • Train an LSTM -based Neural Network on a GPU-enabled machine • Word embeddings are powerful features • Convey word semantics • Perform better than traditional features • No feature engineering • LSTM NN is more powerful model than traditional CRF

  39. Reference • Github – http://github.com/Azure/MachineLearningSamples-BiomedicalEntityExtraction • Documentation – http://docs.microsoft.com/en-us/azure/machine-learning/preview/scenario-tdsp-biomedical-recognition • Azure ML – https://docs.microsoft.com/en-us/azure/machine-learning/preview/overview-what-is-azure-ml • Azure ML Workbench – https://docs.microsoft.com/en-us/azure/machine-learning/preview/quickstart-installation • DSVM –http://aka.ms/dsvm

  40. Extras

  41. Input Gate Layer • A sigmoid layer that decides the values that need to be updated. The tanh layer completes the step by adding new candidate values for the state. • Forget Gate Layer • A sigmoid layer which looks at the output of the previous state (ht-1) and the input at the current state (xt) and decides which information to forget from the previous cell state. • Update Previous States • Update the cell state based on the previous decisions • Output Layer • Use the sigmoid and the tanh layer to decide what we are going to output

More Related