1 / 24

Text Analytics for Unlocking the Potential of Big Data

Text Analytics for Unlocking the Potential of Big Data. 1. T ext analytics & big data. 2. New opportunities with text analytics. 3. Challenges when mining text. 4. Solutions to overcome challenges. 5. Wrap-up. Bhavani Raskutti @ Pacific Brands.

emlyn
Download Presentation

Text Analytics for Unlocking the Potential of Big Data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Text Analytics for Unlocking the Potential of Big Data 1 Text analytics & big data 2 New opportunities with text analytics 3 Challenges when mining text 4 Solutions to overcome challenges 5 Wrap-up BhavaniRaskutti@ Pacific Brands

  2. Text Analytics for Unlocking the Potential of Big Data 1 Text analytics & big data 2 New opportunities with text analytics 3 Challenges when mining text 4 Solutions to overcome challenges 5 Wrap-up BhavaniRaskutti@ Pacific Brands

  3. Text Analytics & Big Data

  4. Text Analytics & Big Data

  5. Text Analytics & Big Data

  6. Text Analytics & Big Data

  7. Text Analytics & Big Data

  8. Text Analytics for Unlocking the Potential of Big Data 1 Text analytics & big data 2 New opportunities with text analytics 3 Challenges when mining text 4 Solutions to overcome challenges 5 Wrap-up BhavaniRaskutti@ Pacific Brands

  9. New Opportunities with Text Analytics Mine freely available social media data for: • Understanding customer sentiment • Identifying major customer concerns • Tracking sentiment/issues over time Business implications: • Ability to act on negative sentiments quickly • Respond to customer concerns in a timely manner • Target initiatives appropriately by continuous tracking Superior market research & focus group outcomes

  10. New Opportunities Sentiment Analysis Methodology: • Score based on positive & negative sentiment words • OR Use supervised learning with labelled examples No sarcasm detection

  11. New Opportunities Topic Detection Methodology: • Create term frequency matrix from text sequences • Use un-supervised learning to create clusters • Create cluster descriptions

  12. Text Analytics for Unlocking the Potential of Big Data 1 Text analytics & big data 2 New opportunities with text analytics 3 Challenges when mining text 4 Solutions to overcome challenges 5 Wrap-up BhavaniRaskutti@ Pacific Brands

  13. Challenges in Text Analytics • Creating term frequency matrix for machine learning • One row for each entry • One column for each term/feature describing the entries Treat non-alpha as white space Case-insensitive Term = word

  14. Challenges 1. Term Frequency Matrix • Presence of non-informative words • Different forms of the same words • Spelling error & typos • Synonyms • Homonyms

  15. Challenges 2. Very Large Feature Space • Many different terms within a single entry • 104features with just 50 to 100 entries • Sparse entries: Many zeros in the martrix • Unsupervised learning • Hard to form cohesive clusters with sparse entries • Supervised learning • Traditional statistical learning techniques need at least 10 labelled examples for each uncorrelated feature

  16. Text Analytics for Unlocking the Potential of Big Data 1 Text analytics & big data 2 New opportunities with text analytics 3 Challenges when mining text 4 Solutions to overcome challenges 5 Wrap-up BhavaniRaskutti@ Pacific Brands

  17. Solutions 1. Term Frequency Matrix • Presence of non-informative words • Create a list of stopwords • Remove them from consideration • Different forms of the same words • Use rule based stemming to remove suffix • Spelling error & typos • Use some spell-checker OR • Use n-grams (character sequences) as features • 5-grams for 'single bill': 'singl', 'ingle', 'ngle ', 'gle b', 'le bi', 'e bil‘, ' bill' • Synonyms • Use a thesaurus (manual or statistical) • Homonyms • Provide context by using word pair or triplets as features

  18. Solutions 2. Very Large Feature Space Unsupervised learning Hard to form cohesive clusters with sparse entries • Use feature selection to identify significant features • Features are of 3 types: • Very frequent low information content (e.g., stopwords) • Infrequent low information content (occurs once/twice in the set) • Significant middle frequency features • Many statistical techniques • Inverse document frequency weight • signal-noise ratio • Average discrimination value • …

  19. Solutions 2. Very Large Feature Space (Cont’d) Supervised learning Traditional statistical learning techniques need at least 10 labelled examples for each uncorrelated feature • Use new techniques based on maximal margin separators that can handle large feature space • Support Vector Machines

  20. Solutions Support Vector Machines Objective: To learn a separator to identify people likely to churn before they do Customers who are loyal Customers who Churned to other providers

  21. Solutions Support Vector Machines What is a good separator? Maximisesmargin between two parallel supporting hyperplanes Separator depends on support vectors

  22. Solutions Support Vector Machines Why does maximising margins work? Small margin means more choice & overfits data Large margin means less choice& no overfitting

  23. Solutions 2. Very Large Feature Space (Cont’d) Supervised learning Traditional statistical learning techniques need at least 10 labelled examples for each uncorrelated feature • Use new techniques based on maximal margin separators that can handle large feature space • Support Vector Machines • Maximises margin between two classes • Separator depends only on support vectors • Separator obtained using quadratic programming • Available in some statistical packages

  24. Wrap-up • Text analytics creates new opportunities for businesses to understand their customers • Understanding customer sentiment • Identifying major customer concerns • Tracking sentiment/issues over time • A few challenges in implementing text analytics • Creating term frequency matrix from text sequence • Large number of features in matrix • Many techniques to overcome these challenges Now is the time to use text analytics to unlock the potential of big data in your business!!

More Related