90 likes | 119 Views
Learn more about how NLP(Natural Processing Language) can help extracting important entities from unstructured data!
E N D
NLP and ML By Example Let's talk how NLP can help extracting important entities from unstructured data! By Parth Barot #TuesdayTalk #BoTreeTechnologies BoTree Technologies
What is NLP and ML? NLP stands for Natural Language “Processing”, main focus is to understand what a natural language text means. It helps identify PoS tagging, NER (Named entity recognition) etc. ML is part of NLP (inside), helps the algorithm to get trained for NLP. ML - Where the programme tries to learn/identify patterns based on a lot of data, which usual human mind or normal programmes can’t do - even using regex where the data is quite unstructured.
Use case - Unstructured text parsing Client receives many emails from vendors, but the email content is quite unstructured - no tables, no fixed positions for data, no fixed set of labels. Email format is again different from vendor to vendor, and so the pattern for the data points. Right now they have to map everything via excel sheets, manually. Which they want to automate such that email would be automatically read, parsed and relevant data points (entities) are extracted and sent to database. It is nothing but a set of different mathematical algorithms.
Possible Approaches 1. Implement Regex - not feasible as finite patterns are not there. 2. Define templates - not feasible as positions are not fixed. 3. NLP - May be, but would it work? How to do it? What tools are available?
NLP Approach - Using NER Validate on huge dataset
Prepare Training Data Once you have the data, need to mark all the training data with annotations/tags (Mark “what” is “what” in each data) to train your model. Mostly 80% data is train data, and 20% data is kept for evaluation/dev data so ML engine can use it to verify the accuracy. Also need another bigger data set which will be used after the training, to generate real data extraction.
Train Your Model We used spaCy to implement NLP, and trained a custom model for this problem because we wanted to have non-standard entities. spaCy comes with a standard english language dataset which could analyze any general english text and identify diff. Standard entities - Like, person, organization etc. It also has support for many other languages. It uses ML based models which have been trained on huge datasets of english language. It provides support to prepare required dataset in specific formats and train a new model using CLI commands, we have taken a week to study everything and prepare a basic working model which can perform NER with at least 95% accuracy.
NLP Related Blogs Natural Language Processing using Python - Top 5 libraries Machine Learning and NLP for Mobile Apps