Leveraging NLP for Unstructured Data Accessibility at Aker BP

USING NLP TO MAKE UNSTRUCTURED DATA HIGHLY ACCESSABLE Machine learning at Aker BP William Naylor, Masa Nekic, PederAursand, Vidar Hjemmeland Brekke 20th September, 2019

Document search engine customised for oil and gas documents PrettyPoly • Polygon search • Geotagging • Advanced query builder • Collaboration/sharing • Admin panel • Document engine • Sensitive content filtering • Document tagging

Pdf woRD Excel ..... PrettyPoly’s document engine Doc type Language Keywords ML tags used for filtering All docs to json

The data: Extremely varied text documents Around 3 million documents Currently 20 classes (with labelled data) Between 60 and 2,000 examples per class Large ‘Undefined’ class • Have some ‘known unknowns’

User feedback: Additional data

Needs to handle: undefined class (open set) growing numbers of classes growing numbers of examples per class (great) multiple languages extremely varied texts • long texts illogical sentence structure • imagine the text from an excel spreadsheet varied class importance Demands on ML classifier

20 class open set classification Train simple ‘yes/no’ classifierfor EACH CLASS Load json content Apply preprocessing • regex • stemming Tfidf encoding Loop through classifiers percategory predicting probability Pick highest or ‘Unknown’ ifless than 0.6 ML classification Peer review Contracts Mud report ...

Random forest / Decision tree / XGBoost all handle any type of feature For text, keep inputs sparse Adding additional fields

A lot of information lies in the unlabelled data. Labelled data won’t be a representative sample Idea 1: • Take (some) random unlabelled data and label it as “Undefined” in training Idea 2: • Train model initially • Predict on unlabelled data • Add data with probability over 0.8 to target class • Add data with probability under 0.2 to “Undefined” • Retrain Using unlabelled data No additional data * acc: 0.90 * Acc (with ROS): 0.91 1 K added * Acc (with ROS): 0.90 5 K added * ACC: 0.87 * Acc (with ROS): 0.89 HAVEN’T TESTED, DOES WORK IN MANY OTHER CASES. DON’T BELIEVE IT WILL HELP WITH SAMPLING PROBLEM

Overfitting can be a major problem ~4 k training examples ~20 k features (words) Loop over models in training and pick out best against a dev set Frequently a Log Reg or DT overfit Forcing a RF (lower dev accuracy) gives better test results. Supressing overfitting

Topics covered • Document enrichment a part of PrettyPoly • Built an open set classifier for long documents • Has user feedback as part of training loop Not covered (feel free to ask me) • Preprocessing • Encoding schemes • Handling of sensitive classes (contracts) Future ideas / problems • Model evaluation • Numbers and excel spreadsheets • Clustering • Explicit filtering for some classes (regex rules) Summary and future plans

Leveraging NLP for Unstructured Data Accessibility at Aker BP

Leveraging NLP for Unstructured Data Accessibility at Aker BP

Presentation Transcript

Using DIBELS Data to Make Instructional Decisions

GeoSpatial “Unstructured Data”

Using Data to Make Decisions

Using EVAAS to Make Data-Driven Decisions

Using Data to Make Decisions

Using Yield Data to Make Decisions

Using Data to Make Graphs

Using Data to Make Informed Decisions

Using State Data to Make A Case

Using Data to Make Decisions

Using ontologies to make sense of unstructured medical data

Lecture 9: Unstructured Data

Managing Unstructured Data

Using Data to Make Strategic Decisions

Using Data to Make Graphs

Leveraging the Unstructured Data

Convert unstructured data to structured data

From Unstructured Text to StructureD Data

Using ontologies to make sense of unstructured medical data

How to make sense out of unstructured data?

Managing Semi/Unstructured Data

Response to Instruction: Using Data to Make Decisions