240 likes | 378 Views
Text Analytics for Unlocking the Potential of Big Data. 1. T ext analytics & big data. 2. New opportunities with text analytics. 3. Challenges when mining text. 4. Solutions to overcome challenges. 5. Wrap-up. Bhavani Raskutti @ Pacific Brands.
E N D
Text Analytics for Unlocking the Potential of Big Data 1 Text analytics & big data 2 New opportunities with text analytics 3 Challenges when mining text 4 Solutions to overcome challenges 5 Wrap-up BhavaniRaskutti@ Pacific Brands
Text Analytics for Unlocking the Potential of Big Data 1 Text analytics & big data 2 New opportunities with text analytics 3 Challenges when mining text 4 Solutions to overcome challenges 5 Wrap-up BhavaniRaskutti@ Pacific Brands
Text Analytics for Unlocking the Potential of Big Data 1 Text analytics & big data 2 New opportunities with text analytics 3 Challenges when mining text 4 Solutions to overcome challenges 5 Wrap-up BhavaniRaskutti@ Pacific Brands
New Opportunities with Text Analytics Mine freely available social media data for: • Understanding customer sentiment • Identifying major customer concerns • Tracking sentiment/issues over time Business implications: • Ability to act on negative sentiments quickly • Respond to customer concerns in a timely manner • Target initiatives appropriately by continuous tracking Superior market research & focus group outcomes
New Opportunities Sentiment Analysis Methodology: • Score based on positive & negative sentiment words • OR Use supervised learning with labelled examples No sarcasm detection
New Opportunities Topic Detection Methodology: • Create term frequency matrix from text sequences • Use un-supervised learning to create clusters • Create cluster descriptions
Text Analytics for Unlocking the Potential of Big Data 1 Text analytics & big data 2 New opportunities with text analytics 3 Challenges when mining text 4 Solutions to overcome challenges 5 Wrap-up BhavaniRaskutti@ Pacific Brands
Challenges in Text Analytics • Creating term frequency matrix for machine learning • One row for each entry • One column for each term/feature describing the entries Treat non-alpha as white space Case-insensitive Term = word
Challenges 1. Term Frequency Matrix • Presence of non-informative words • Different forms of the same words • Spelling error & typos • Synonyms • Homonyms
Challenges 2. Very Large Feature Space • Many different terms within a single entry • 104features with just 50 to 100 entries • Sparse entries: Many zeros in the martrix • Unsupervised learning • Hard to form cohesive clusters with sparse entries • Supervised learning • Traditional statistical learning techniques need at least 10 labelled examples for each uncorrelated feature
Text Analytics for Unlocking the Potential of Big Data 1 Text analytics & big data 2 New opportunities with text analytics 3 Challenges when mining text 4 Solutions to overcome challenges 5 Wrap-up BhavaniRaskutti@ Pacific Brands
Solutions 1. Term Frequency Matrix • Presence of non-informative words • Create a list of stopwords • Remove them from consideration • Different forms of the same words • Use rule based stemming to remove suffix • Spelling error & typos • Use some spell-checker OR • Use n-grams (character sequences) as features • 5-grams for 'single bill': 'singl', 'ingle', 'ngle ', 'gle b', 'le bi', 'e bil‘, ' bill' • Synonyms • Use a thesaurus (manual or statistical) • Homonyms • Provide context by using word pair or triplets as features
Solutions 2. Very Large Feature Space Unsupervised learning Hard to form cohesive clusters with sparse entries • Use feature selection to identify significant features • Features are of 3 types: • Very frequent low information content (e.g., stopwords) • Infrequent low information content (occurs once/twice in the set) • Significant middle frequency features • Many statistical techniques • Inverse document frequency weight • signal-noise ratio • Average discrimination value • …
Solutions 2. Very Large Feature Space (Cont’d) Supervised learning Traditional statistical learning techniques need at least 10 labelled examples for each uncorrelated feature • Use new techniques based on maximal margin separators that can handle large feature space • Support Vector Machines
Solutions Support Vector Machines Objective: To learn a separator to identify people likely to churn before they do Customers who are loyal Customers who Churned to other providers
Solutions Support Vector Machines What is a good separator? Maximisesmargin between two parallel supporting hyperplanes Separator depends on support vectors
Solutions Support Vector Machines Why does maximising margins work? Small margin means more choice & overfits data Large margin means less choice& no overfitting
Solutions 2. Very Large Feature Space (Cont’d) Supervised learning Traditional statistical learning techniques need at least 10 labelled examples for each uncorrelated feature • Use new techniques based on maximal margin separators that can handle large feature space • Support Vector Machines • Maximises margin between two classes • Separator depends only on support vectors • Separator obtained using quadratic programming • Available in some statistical packages
Wrap-up • Text analytics creates new opportunities for businesses to understand their customers • Understanding customer sentiment • Identifying major customer concerns • Tracking sentiment/issues over time • A few challenges in implementing text analytics • Creating term frequency matrix from text sequence • Large number of features in matrix • Many techniques to overcome these challenges Now is the time to use text analytics to unlock the potential of big data in your business!!