430 likes | 439 Views
Learn about custom entity extraction, word embeddings, and training neural network models for domain-specific tasks like biomedical named entity recognition using deep learning techniques.
E N D
Deep Learning for Domain-SpecificEntity Extraction from Unstructured Text • Mohamed AbdelHady, Sr. Data Scientist • Zoran Dzunic, Data Scientist • Cloud AI
Goals • What is entity extraction? • When do I need to train a custom entity extraction model? • Why should I use Deep Neural Networks and which one? • What are word embeddings and why should I use them as features? • How do I train a custom word embedding model? • How do I train a custom Deep Neural Network model for entity extraction? • What is the best architecture for these two tasks?
Agenda • Motivation / Entity Extraction • Recurrent Deep Neural Networks • Word Embeddings • Architecture • Results • Conclusion
Entity Extraction • Subtask of information extraction • Also known as Named-entity recognition (NER), entity chunking and entity identification • Find phrases in text that refer to a real-world entity of specific types Mr. Zoran is at Strata in San Jose on March 7.
Entity Extraction • Subtask of information extraction • Also known as Named-entity recognition (NER), entity chunking and entity identification • Find phrases in text that refer to a real-world entity of specific types Mr. Zoran is at Strata in San Jose on March 7. Zoran : PERSONStrata : ORGSan Jose : LOCMarch 7 : DATE
Why is it useful? • Indexing • e.g., find all people in a document collection • Discovery • e.g., learn about new drugs from recent biomedical articles • Relation extraction • WORKS (PERSON, ORG), e.g., WORKS (Zoran, Microsoft) • Question answering • Where is Zoran? Zoran is in San Jose.
Custom Entity Extraction • Pre-trained models for common entity types (person, org, loc, date, …) • Custom models • new entity types (e.g., drug, disease) • different language properties • foreign names • non-proper sentences (e.g., tweets) • specific domain (e.g., biomedical domain)
Domain-Specific Entity Extraction Biomedical named entity recognition • Entity types drug/chemical, disease, protein, DNA, etc. • Critical step for complex biomedical NLP tasks: • Extraction of diseases, symptoms from electronic medical or health records • Understanding the interactions between different entity types such as drug-drug interaction, drug-disease relationship and gene-protein relationship, e.g., • Drug A cures Disease B. • Drug A causes Disease B. Similar for other domains (e.g., legal, finance)
Biomedical Entity Extraction • Vast Amount of Medical Articles
Approach • Feature Extraction Phase – Domain Specific Features Use a large amount of unlabeled domain-specific data corpus such as Medline PubMed abstracts to train a neural word embedding model. • Model Training Phase – Domain Specific Model The output embeddings are considered as automatically generated features to train a neural entity extractor using a small/reasonable amount of labeled data.
Agenda • Motivation / Entity Extraction • Recurrent Deep Neural Networks • Word Embeddings • Architecture • Results • Conclusion
Why Machine Learning? • Dictionary approach • What if a phrase is not in the dictionary? Zoran is at Strata in San Jose on March 7. Mr. Zoran is at Strata in San Jose on March 7. Zoran Smith is at Strata in San Jose on March 7.
Why Machine Learning? • Statistical (ML) approach • Takes features from surrounding words Zoran is at Strata in San Jose on March 7. Mr. Zoran is at Strata in San Jose on March 7. Zoran Smith is at Strata in San Jose on March 7. Clues: after “Mr.”, before a known last name (“Smith”) • Structured prediction (sequence tagging) – label for a word depends on other labels • Rather than classifying each word independently
Why Deep Learning? • Can find complex relationships between input and output using • Non-linear processing units • Multiple hidden layers • Special-purpose layers/architectures have been developed for different problems • Recurrent Neural Networks (RNNs) are commonly used for sequence-labeling tasks • Long Short-Term Memory layer (LSTM) • Comparison to Conditional Random Fields (CRFs) • CRFs are linear models w/o hidden layers • CRFs have short-term memory
Recurrent Neural Networks • The state ht denotes the memory of the network and is responsible for capturing information about previous time steps. • All layers in RNNs share the same set of parameters because we are performing the same task at each time step with different inputs.
Long Short Term Memory Networks (LSTMs) • Vanilla RNNs suffer from a drawback that they are not able to keep track of long term dependencies (Vanishing Gradient Problem). • LSTMs are special types of RNNs designed to solve this problem by using four interacting layers in the repeating module.
Agenda • Motivation / Entity Extraction • Recurrent Deep Neural Networks • Word Embeddings • Architecture • Results • Conclusion
Features: Words B-Chemical O O Naloxone the Words: reverses
Features: One-Hot Encoding B-Chemical O O One-Hot: dim=|V| (e.g., 105) [1, 0, 0 …] [0, 0, 1 …] [0, 1, 0 …]
Features: Word Embeddings B-Chemical O O Embedding: dim small (e.g., 50, 200) [0.3, 0.2, 0.9 …] [0.8, 0.8, 0.1 …] [0.5, 0.1, 0.5 …]
Word2Vec • Simple neural network of a single hidden layer with a linear activation function (Skip-Gram, CBOW) • Unsupervised learning from large corpora. The word vectors (embedding) are learned by the stochastic gradient descent optimization algorithm • Publicly available pre-trained models such as Google News • Can we do better on a specific domain? CBOW Neural Network Model
Agenda • Motivation / Entity Extraction • Recurrent Deep Neural Networks • Word Embeddings • Architecture • Results • Conclusion
Spark HDInsight Spark application a.k.a. Driver Parallel in-memory computing Big Data workloads Scales to thousands of machines Offers multiple libraries that include machine learning, streaming, and graph analysis APIs in Scala, Python, R Task Task Task Task Task Task Task Task Task Task Task Task valsc = new SparkContext(master=“mesos://..”) Slave a.k.a. Worker Slave a.k.a. Worker Executor Executor SparkContext Master a.k.a. Cluster Manager
GPUDSVMs Physical Card = 2 x NVidia TESLA GPUs Use NC class Azure VMs with GPUs Great for Deep Learning workloads
AML Workbench Sample, understand, and prep data rapidly Support for Spark + Python + R (roadmap) Execute jobs locally, on remote VMs, Spark clusters, SQL on-premises Git-backed tracking of code, config, parameters, data, run history
Tutorial: http://github.com/Azure/MachineLearningSamples-BiomedicalEntityExtraction
Agenda • Motivation / Entity Extraction • Recurrent Deep Neural Networks • Word Embeddings • Architecture • Results • Conclusion
Datasets • Proteins, Cell Line, Cell Type, DNA and RNA Detection Bio-Entity Recognition Task at BioNLP/NLPBA 2004 • Chemicals and Diseases Detection BioCreative V CDR task corpus • Drugs Detection Semeval 2013 - Task 9.1 (Drug Recognition)
Experimental Setup • Provision Azure HDInsight Spark cluster • Train the Word2Vec model: • window_size = 5 • vector_size = 50 • min_count =1000 • Provision an Azure GPU-DSVM using NC6 Standard (56 GB, K80 NVIDIA Tesla). • Keras with TensorFlow as backend: • Initialize the embedding layer with the trained embeddings • Define the LSTM RNN architecture • Train the LSTM RNN model CRFSuite: • Extract traditional features • Train CRF model
Dataset Description • Chemicals and Diseases Detection • BioCreative V CDR task corpus, 2015 http://www.biocreative.org/tasks/biocreative-v/track-3-cdr/ • Training Data Stats • # words = 214,198 • # disease entities = 7699 • # chemical entities = 9575 • Test Data Stats • # words = 112,285 • # disease entities = 4080 • # chemical entities = 4941
Agenda • Motivation / Entity Extraction • Recurrent Deep Neural Networks • Word Embeddings • Architecture • Results • Conclusion
Takeaways • Recipe for building a custom entity extraction pipeline: • Get a large amount of in-domain unlabeled data • Train a word2vec model on unlabeled data on Spark • Get as much of labeled data as possible • Train an LSTM -based Neural Network on a GPU-enabled machine • Word embeddings are powerful features • Convey word semantics • Perform better than traditional features • No feature engineering • LSTM NN is more powerful model than traditional CRF
Reference • Github – http://github.com/Azure/MachineLearningSamples-BiomedicalEntityExtraction • Documentation – http://docs.microsoft.com/en-us/azure/machine-learning/preview/scenario-tdsp-biomedical-recognition • Azure ML – https://docs.microsoft.com/en-us/azure/machine-learning/preview/overview-what-is-azure-ml • Azure ML Workbench – https://docs.microsoft.com/en-us/azure/machine-learning/preview/quickstart-installation • DSVM –http://aka.ms/dsvm
Input Gate Layer • A sigmoid layer that decides the values that need to be updated. The tanh layer completes the step by adding new candidate values for the state. • Forget Gate Layer • A sigmoid layer which looks at the output of the previous state (ht-1) and the input at the current state (xt) and decides which information to forget from the previous cell state. • Update Previous States • Update the cell state based on the previous decisions • Output Layer • Use the sigmoid and the tanh layer to decide what we are going to output