1 / 29

Harnessing AI to Create Insight from Text

Learn about Word Embedding, GloVe, StarSpace, and predicting categories from text data. Explore recent AI developments and implementation considerations. Dive into text analysis tools with Amanda Beedham, Data Scientist.

greerd
Download Presentation

Harnessing AI to Create Insight from Text

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Harnessing AI to Create Insight from Text Amanda Beedham Data Scientist , RSA

  2. Contents • Background to Word Embedding • GloVe – grouping claims using text • StarSpace – predicting category using text • Implementation considerations • Recent developments • Q & A

  3. All About Context “You shall know a word by the company it keeps” (John Rupert Firth, 1957)

  4. Word Embedding -Rapid Pace of Change

  5. Why All the Investment in Word Embedding? • In order to use text data in predictive models must represent it numerically • What about one-hot encoding? • Problems: • High dimensionality means reduced efficiency • Words lose their meaning

  6. Benefits of Word Embedding • Embedding layer maps each word to dense vector of numbers • Captures relationships between words • Finds different ways of saying the same thing • Understands words that are opposite in meaning • No need to build dictionaries

  7. GloVe Word Embedding Business problem - given set of claims descriptions - Can we group claims? - For each group, can we understand type of claim? Steps • Data cleaning - stringR, textTinyR • Word Embedding - text2vec, textTinyR • Clustering - clusterR, wordcloud

  8. Word AssociationsWhich words are most likely to occur near a target word?

  9. Investigating Clusters of Claims Cluster 3 Cluster 1 Cluster 2

  10. GloVe - The Story so Far • Successes: • Determined word similarity • Clusters made sense and represented different events • Results powerful but • Requires few hundred lines of code • Does not build supervised embeddings • text2vec: http://text2vec.org/index.html • gloVe: http://nlp.stanford.edu/projects/glove/

  11. Starspace / Ruimtehol • StarSpace: developed by Facebook AI • R package ruimteholallows • Multi-label text classification • Word, sentence, document embeddings • Document and sentence similarity • Ranking web documents • Content-based/collaborative filtering-based recommendation

  12. Pet Invoice Data: Predicting Category Using Tag Embedding

  13. Can we Predict Category using Text? • Pet invoice data • 44k invoice lines • Any categories incorrectly assigned? • Predict unassigned categories?

  14. Tag Embedding • Build Tag Embedding model to predict category • Input text - separated by spaces • Response - list of all categories - no spaces • Dataset split into train (35k), test (9k)

  15. Model Build • Tag embedding runs in just a few lines of code

  16. Creating Word Embeddings • Simple code • How do we interpret embeddings? • “inj” and “injection” - similar values - closely related

  17. Finding Associated Words: “Anaesthesia” • Model found “anaesthetic”, “gen”, “anaesthesia”, “anaes”, “ga” relate to “anaesthesia” • No dictionaries provided

  18. Finding Associated Words: “Imaging” • Model found “xray”, “radiography”, “radiograph”, “radiographic”, “exposures”, “radiology”, “plate” relate to “Imaging”

  19. Predicting Category • “Metacam” predicted as “Drugs”:

  20. How Well Does our Model Predict? • T-SNE plot on unseen data • Embeddings reduced to 2D • Overlay with category • If embeddingsgood, categories should form clusters

  21. How Well Does our Model Predict? Actual • Unseen data, miss-classification rate = 4% Predicted

  22. Finding Errors in Classification • Assigned = “Conditions”, model predicted = “Drugs” • Assigned = “Anaesthesia”, model predicted = “Imaging” • “metacam oral suspension ml give kg dose once daily with food stop if vomiting” • “prontosan wound gel ml” • “xray per extra plate without sedation”

  23. Predicting Missing Category“Anaesthesia” – Typos or Abbreviations general anaesthic extended anaes isoflurane kg kg gen anaestheticp kg

  24. Predicting Missing Category“Drugs” – Typos metacarninj dogs cats mgml ml per ml oomfortan lnj ml per ml q comforlanlnj ml per ml

  25. Implementation Considerations • Word Embedding - powerful technique • The more data the better • Can be computationally intensive • Overcome with cloud servers and multi-core algorithms • Pre-trained embeddings available • Trained on large datasets • Embedding matrices very large - may need big machine to apply

  26. Recent DevelopmentsOpen AI’s GPT-2 “The AI That's Too Dangerous to Release” Input text: “In a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains…” Model completion: “The scientist named the population, after their distinctive horn, Ovid’s Unicorn…”

  27. Recent Developments

  28. Summary • Background to Word Embedding • GloVe – grouping claims using text • StarSpace – predicting category using text • Implementation considerations • Recent developments

  29. Any Questions?

More Related