290 likes | 298 Views
Learn about Word Embedding, GloVe, StarSpace, and predicting categories from text data. Explore recent AI developments and implementation considerations. Dive into text analysis tools with Amanda Beedham, Data Scientist.
E N D
Harnessing AI to Create Insight from Text Amanda Beedham Data Scientist , RSA
Contents • Background to Word Embedding • GloVe – grouping claims using text • StarSpace – predicting category using text • Implementation considerations • Recent developments • Q & A
All About Context “You shall know a word by the company it keeps” (John Rupert Firth, 1957)
Why All the Investment in Word Embedding? • In order to use text data in predictive models must represent it numerically • What about one-hot encoding? • Problems: • High dimensionality means reduced efficiency • Words lose their meaning
Benefits of Word Embedding • Embedding layer maps each word to dense vector of numbers • Captures relationships between words • Finds different ways of saying the same thing • Understands words that are opposite in meaning • No need to build dictionaries
GloVe Word Embedding Business problem - given set of claims descriptions - Can we group claims? - For each group, can we understand type of claim? Steps • Data cleaning - stringR, textTinyR • Word Embedding - text2vec, textTinyR • Clustering - clusterR, wordcloud
Word AssociationsWhich words are most likely to occur near a target word?
Investigating Clusters of Claims Cluster 3 Cluster 1 Cluster 2
GloVe - The Story so Far • Successes: • Determined word similarity • Clusters made sense and represented different events • Results powerful but • Requires few hundred lines of code • Does not build supervised embeddings • text2vec: http://text2vec.org/index.html • gloVe: http://nlp.stanford.edu/projects/glove/
Starspace / Ruimtehol • StarSpace: developed by Facebook AI • R package ruimteholallows • Multi-label text classification • Word, sentence, document embeddings • Document and sentence similarity • Ranking web documents • Content-based/collaborative filtering-based recommendation
Can we Predict Category using Text? • Pet invoice data • 44k invoice lines • Any categories incorrectly assigned? • Predict unassigned categories?
Tag Embedding • Build Tag Embedding model to predict category • Input text - separated by spaces • Response - list of all categories - no spaces • Dataset split into train (35k), test (9k)
Model Build • Tag embedding runs in just a few lines of code
Creating Word Embeddings • Simple code • How do we interpret embeddings? • “inj” and “injection” - similar values - closely related
Finding Associated Words: “Anaesthesia” • Model found “anaesthetic”, “gen”, “anaesthesia”, “anaes”, “ga” relate to “anaesthesia” • No dictionaries provided
Finding Associated Words: “Imaging” • Model found “xray”, “radiography”, “radiograph”, “radiographic”, “exposures”, “radiology”, “plate” relate to “Imaging”
Predicting Category • “Metacam” predicted as “Drugs”:
How Well Does our Model Predict? • T-SNE plot on unseen data • Embeddings reduced to 2D • Overlay with category • If embeddingsgood, categories should form clusters
How Well Does our Model Predict? Actual • Unseen data, miss-classification rate = 4% Predicted
Finding Errors in Classification • Assigned = “Conditions”, model predicted = “Drugs” • Assigned = “Anaesthesia”, model predicted = “Imaging” • “metacam oral suspension ml give kg dose once daily with food stop if vomiting” • “prontosan wound gel ml” • “xray per extra plate without sedation”
Predicting Missing Category“Anaesthesia” – Typos or Abbreviations general anaesthic extended anaes isoflurane kg kg gen anaestheticp kg
Predicting Missing Category“Drugs” – Typos metacarninj dogs cats mgml ml per ml oomfortan lnj ml per ml q comforlanlnj ml per ml
Implementation Considerations • Word Embedding - powerful technique • The more data the better • Can be computationally intensive • Overcome with cloud servers and multi-core algorithms • Pre-trained embeddings available • Trained on large datasets • Embedding matrices very large - may need big machine to apply
Recent DevelopmentsOpen AI’s GPT-2 “The AI That's Too Dangerous to Release” Input text: “In a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains…” Model completion: “The scientist named the population, after their distinctive horn, Ovid’s Unicorn…”
Summary • Background to Word Embedding • GloVe – grouping claims using text • StarSpace – predicting category using text • Implementation considerations • Recent developments