1 / 23

Introduction to JMP Text Explorer Platform

Learn how to explore and quantify text comments using JMP tools. This tutorial provides examples of data curation steps and demonstrates ways to model ratings data with quantified text comments.

richardf
Download Presentation

Introduction to JMP Text Explorer Platform

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Introduction to JMP Text Explorer Platform Jeff Swartzel and Tracy Desch With credit to Scott Reese and Jeremy Christman

  2. Objectives To introduce Text Exploration via JMP tools To provide examples of data curation steps Illustrate ways to quantify text comments Explore modeling ratings data with quantified text comments

  3. Data requirements • Stacked data file • One row per comment • Matched ratings • There can be exceptions to this approach. • Make sure that you don’t have duplicate documents in your corpus!

  4. Overall process • Data preparation aka Curation • Tokenizing • Recoding • Phrasing • Stemming • Stopwords • Analysis • Which terms are most common? • What context are terms used? • Which terms appear together? Are there recurring themes? • Modeling • Save Vectors to table • Model building

  5. Our Data Source “TASTES GOOD AND IS GOOD FOR YOU” -5 stars “I wish I could find these in a store instead of online!” -5 stars “Not natural/organic at all” -1 star “Mixed thoughts” -3 stars “bugs all over it” -2 stars • Amazon Products Reviews and Metadata from 1996 to 2014 • Found here: • http://jmcauley.ucsd.edu/data/amazon/ • We focused on gourmet food review summaries “Flavorful, great price, and surprisingly not that hot” -5 stars

  6. Key Definitions • Term – smallest piece of text, similar to a word in a sentence • Phrase – collection of terms that occur together • Document – all of the terms in a specific row/column intersection • This is often the panelist’s response to a single question. • Corpus- a collection of documents, a single column • Stemming – the process of removing word endings from words with the same beginning. • “Dogs”, “Doggies”, “Dog” all can be stemmed to “Dog-” so they are counted the same

  7. Overall process- bag of words approach • Document term matrix built • A table that is used to COUNT all the terms • Maintained in the background due to size • Very Sparse matrix (not many data points) • Basis for most of the Text Analysis options • The manner in which words are counted can be modified • For example: Yes or No • For example: No occurrence, One occurrence, or Many occurrences • Singular Value Decomposition (will be discussed later…) • A method of dimensionality reduction • Preserves as much information as possible and reduces the number of columns

  8. How does it work? • The Document Term Matrix (DTM) is a table of all terms. • Every term is given its own column • Default setting is to create Binary responses for each term: • 1 = Yes the term is present in the comment • 0 = No the term is NOT present in the comment

  9. Counting • There are several ways to ‘count’ the terms. JMP refers to this as “Weighting” • The weighting determines the values that go into the cells of the document term matrix. • Binary: • 1 if a term occurs in each document and 0 if not. • Ternary: • 2 if a term occurs more than once in each document, 1 if it occurs only once and 0 otherwise. • Frequency: • The count of a term’s occurrence in each document. • Log Freq: • Log10 ( 1 + x ), where x is the count of a term’s occurrence in each document. • TF IDF: • Term Frequency- Inverse Document Frequency. This is the default approach. • A method of counting that accounts for how often a term is used by the total number of uses • TF * log( nDoc / nDocTerm ). The terms in the formula are defined as follows: • TF = frequency of the term in the document • nDoc = number of documents in the corpus • nDocTerm = number of documents that contain the term

  10. CURATION Why Curation- • JMP’s bag of words approach uses a frequency count for analysis • Therefore, HOW you count the words matters. • Curation is the process of wrangling the data into a way that is useful for you. • For example, would you count the following as separate terms or the same: • Perfume, Parfume, Pefume, Perfumes, perfumed? • Clean, Kleen, Cleaner, Cleaning lady, Cleaners, cleaned, cleaning? • Perfume, scent, aroma, odor, smell, fragrance?

  11. CURATION Curation tools • Recode • Enables you to change the values for one or more terms. Select the terms in the list before selecting this option. • Always recode before stemming • Add phrase • Adds the selected phrases to the Term List and updates the Term Counts accordingly. • Only added phrases will be included in the analysis and Document Term Matrix • Stemming • Combining words with identical beginnings (stems) by removing the endings that differ. • This results in “jump”, “jumped”, and “jumping” all being treated as the term “jump·”. • Add stop word • Excludes a word that is not providing benefit to the analysis. • For example, if every review contains, “diaper” this does not provide additional benefit to the analysis. • There is a default list of stop words (such as, the, of, or…etc.) that can be modified and saved. • Although stop words are not eligible to be terms, they can be used in phrases.

  12. CURATION Recode example • Used for the following: • To correct typos or misspellings • Combining synonyms • Grouping terms together based upon category expertise of known themes or topics

  13. ANALYSIS Common Analysis questions Which terms are most common? What context are terms used? Which terms appear together/ are there recurring themes? How can I use this in a predictive model?

  14. ANALYSIS Topic Analysis, Rotated SVD “Performs a varimax rotated singular value decomposition of the document term matrix to produce groups of terms called topics.” In other words…… It takes the Document Term Matrix, which is mostly 0’s, and converts it into a more compact data set where topics are oriented towards a set of words. Topic analysis is similar to factor analysis. You need to set the number of vectors, which is how many ‘topics’ you will end up with. Negative values indicate that a term occurs less frequently compared to terms with positive values.

  15. ANALYSIS Which terms appear together? • Let’s start with 20 • For real analysis, you would modify this several times to generate a meaningful set • Rotated SVD • Red hotspot > Topic Analysis, Rotated SVD • Keep all defaults for now • Select OK • In future runs, you could modify the weighting

  16. ANALYSIS A note on Iterations Data cleaning will continue after you conduct your SVD. This most often takes place as: Clean the data, Conduct SVD, Clean the data, Conduct SVD… repeat until your data is most meaningful to you.

  17. ANALYSIS Which terms appear together? • Iterate to find the optimal topics by modifying the following: • Number of topics • Stop words • Consider low frequency words as stop words • Use various approaches on your newly quantified text to improve your understanding of the text: • Partition- shows biggest breaks in the data • Generalized regression- shows model effects • Tabulate then Graphbuilder of topics- shows biggest differences between products

  18. ANALYSIS How can I use this in a predictive model? • Use SVD to understand themes (like PC or Factor Analysis) • This helps: • Group comments by theme • Discover the common themes much faster • Turn comments into series of continuous factors • Implemented Directly in JMP 14

  19. ANALYSIS How can I use this in a predictive model? • We will start with the Topic Analysis, Rotated SVD results • Approach: • Curate data to useful topics of interest • Save vectors for each topic to the data table • Use various tools to further analysis and drive understanding of impact

  20. Key points to remember: • Text Explorer is intended to combine similar terms, recode terms, and provide understanding on underlying patterns to enable efficient exploration of the comments. • JMP uses a ‘bag of words’ approach- frequency matters, not order. • Iterative steps will take time and effort. • It is still necessary to always explore and read actual verbatims • JMP has “Show Text” tool that can help with this • Category expertise will result in the most robust learnings and insight creation.

  21. Overall process (one more time) • Data preparation aka Curation • Tokenizing • Recoding • Phrasing • Stemming • Stopwords • Analysis • Which terms are most common? • What context are terms used? • Which terms appear together? Are there recurring themes? • Modeling • Save Vectors to table • Model building

  22. Remember: JMP Text tools are intended to enable more insight generation by helping to make you more efficient! Context is critical Use SHOW TEXT

  23. Set Display Options to be default File > Preferences > Platform > Text Explorer

More Related