120 likes | 136 Views
USING NLP TO MAKE UNSTRUCTURED DATA HIGHLY ACCESSABLE. Machine learning at Aker BP. William Naylor, Masa Nekic , Peder Aursand , Vidar Hjemmeland Brekke 20 th September, 2019. Document search engine customised for oil and gas documents. PrettyPoly. Polygon search Geotagging
E N D
USING NLP TO MAKE UNSTRUCTURED DATA HIGHLY ACCESSABLE Machine learning at Aker BP William Naylor, Masa Nekic, PederAursand, Vidar Hjemmeland Brekke 20th September, 2019
Document search engine customised for oil and gas documents PrettyPoly • Polygon search • Geotagging • Advanced query builder • Collaboration/sharing • Admin panel • Document engine • Sensitive content filtering • Document tagging
Pdf woRD Excel ..... PrettyPoly’s document engine Doc type Language Keywords ML tags used for filtering All docs to json
The data: Extremely varied text documents Around 3 million documents Currently 20 classes (with labelled data) Between 60 and 2,000 examples per class Large ‘Undefined’ class • Have some ‘known unknowns’
User feedback: Additional data
Needs to handle: undefined class (open set) growing numbers of classes growing numbers of examples per class (great) multiple languages extremely varied texts • long texts illogical sentence structure • imagine the text from an excel spreadsheet varied class importance Demands on ML classifier
20 class open set classification Train simple ‘yes/no’ classifierfor EACH CLASS Load json content Apply preprocessing • regex • stemming Tfidf encoding Loop through classifiers percategory predicting probability Pick highest or ‘Unknown’ ifless than 0.6 ML classification Peer review Contracts Mud report ...
Random forest / Decision tree / XGBoost all handle any type of feature For text, keep inputs sparse Adding additional fields
A lot of information lies in the unlabelled data. Labelled data won’t be a representative sample Idea 1: • Take (some) random unlabelled data and label it as “Undefined” in training Idea 2: • Train model initially • Predict on unlabelled data • Add data with probability over 0.8 to target class • Add data with probability under 0.2 to “Undefined” • Retrain Using unlabelled data No additional data * acc: 0.90 * Acc (with ROS): 0.91 1 K added * Acc (with ROS): 0.90 5 K added * ACC: 0.87 * Acc (with ROS): 0.89 HAVEN’T TESTED, DOES WORK IN MANY OTHER CASES. DON’T BELIEVE IT WILL HELP WITH SAMPLING PROBLEM
Overfitting can be a major problem ~4 k training examples ~20 k features (words) Loop over models in training and pick out best against a dev set Frequently a Log Reg or DT overfit Forcing a RF (lower dev accuracy) gives better test results. Supressing overfitting
Topics covered • Document enrichment a part of PrettyPoly • Built an open set classifier for long documents • Has user feedback as part of training loop Not covered (feel free to ask me) • Preprocessing • Encoding schemes • Handling of sensitive classes (contracts) Future ideas / problems • Model evaluation • Numbers and excel spreadsheets • Clustering • Explicit filtering for some classes (regex rules) Summary and future plans