340 likes | 356 Views
TEXT ANALYTICS - LABS. Maha Althobaiti Udo Kruschwitz Massimo Poesio. LABS. Basic text analytics: text classification using bags-of-words Sentiment analysis of tweets using Python’s SciKit Learn library More advanced text analytics: information extraction using NLP pipelines
E N D
TEXT ANALYTICS - LABS MahaAlthobaiti UdoKruschwitz Massimo Poesio
LABS • Basic text analytics: text classification using bags-of-words • Sentiment analysis of tweets using Python’s SciKit Learn library • More advanced text analytics: information extraction using NLP pipelines • Named Entity Recognition
LABS • Basic text analytics: text categorization using bags-of-words • Specifically, sentiment analysis of tweets using Python’s SciKit-Learn’s library • More advanced text analytics: information extraction using NLP pipelines • Named Entity Recognition
Sentiment analysis using SciKit Learn • Materials for this part of the tutorial: • http://csee.essex.ac.uk/staff/poesio/Teach/TextAnalyticsTutorial/SentimentLab • Based on: chap. 6 of
TEXT ANALYTICS IN PYTHON • Not quite as easy to do text manipulation in Python as in Perl, but a number of useful packages • SCIKIT-LEARN for machine learning including basic text classification • NLTK for NLP processing including libraries for tokenization, POS tagging, chunking, parsing, NE recognition; also support for ML-based methods eg for text classification
TEXT ANALYTICS IN PYTHON • Not quite as easy to do text manipulation in Python as in Perl, but a number of useful packages • SCIKIT-LEARN for machine learning including basic text classification • NLTK for NLP processing including libraries for tokenization, POS tagging, chunking, parsing, NE recognition; also support for ML-based methods eg for text classification
SCIKIT-LEARN • An open-source library supporting machine learning work • Based on numpy, scipy, and matplotlib • Provides implementations of • Several supervised ML algorithms including eg regression, Naïve Bayes, SVMs • Clustering • Dimensionality reduction • It includes several facilities to support text classification including eg ways to create NLP pipelines out of componen td • Website: • http://scikit-learn.org/stable/
REMINDER : SENTIMENT ANALYSIS • (or opinion mining) • Develop algorithms that can identify the ‘sentiment’ expressed by a text • Product X sucks • I was mesmerized by film Y
SENTIMENT ANALYSIS AS TEXT CATEGORIZATION • Sentiment analysis can be viewed as just another type of text categorization, like spam detection or topic classification • Most successful approaches use SUPERVISED LEARNING: • Use corpora annotated for subjectivity and/or sentiment • To train models using supervised machine learning algorithms: • Naïve bayes • Decision trees • SVM • Good results can already be obtained using only WORDS as features
TEXT CATEGORIZATION USING A NAÏVE BAYES, WORD-BASED APPROACH • Attributes are text positions, values are words.
SENTIMENT ANALYSIS OF TWEETS • A very popular application of sentiment analysis is trying to extract sentiment towards products or organizations from people’s comments about them on Twitter • Several datasets for that • E.g., SEMEVAL-2014 • In this lab: Nick Sanders’s dataset • 5000 Tweets • Annotated as positive / negative / neutral / irrelevant • List of ID / sentiment pairs, + script to download tweets on the basis of their ID
First Script Start an IDLE window Open the file: 01_start.py (but do not run it yet!!)
A word-based, Naïve Bayes sentiment analyzer using SciKit-Learn • The library sklearn.naive_bayes includes implementations of three Naïve Bayes classifiers • GaussianNB(for features that have a Gaussian distribution, e.g., physical traits – height, etc) • MultinomialNB(when features are frequencies of words) • BernoulliNB(for boolean features)
A word-based, Naïve Bayes sentiment analyzer using SciKit-Learn • The library sklearn.naive_bayes includes implementations of three Naïve Bayes classifiers • GaussianNB(for features that have a Gaussian distribution, e.g., physical traits – height, etc) • MultinomialNB(when features are frequencies of words) • BernoulliNB(for boolean features) • For sentiment analysis: MultinomialNB
Creating the model • The words contained in the tweets are used as features. They are extracted and weighted using the function create_ngram_model • create_ngram_modeluses the function TfidfVectorizer from the package feature_extraction in scikitlearn to extract terms from tweets • http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html • create_ngram_modeluses MultinomialNB to learn a classifier • http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html • The function Pipelineof scikit-learn is used to combine the feature extractor and the classifier in a single object (an estimator) that can be used to extract features from data, create (‘fit’) a model, and use the model to classify • http://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html
Tweet term extraction & classification Extract features and weights them Naïve Bayes classifier Creates Pipeline
Training and evaluation • The function train_model • Uses a method from the cross_validation library in scikit-learn, ShuffleSplit, to calculate the folds to use in cross validation • At each iteration, the function creates a model using fit, then evaluates the results using score
Creating a model Identifies the indices in each fold Trains the model
Optimization • The program above uses the default values of the parametes for TfidfVectorizerandMultinomialNB • In text analytics it’s usually easy to build a first prototype, but lots of experimentation is needed to achieve good results • Alternative choices for TfidfVectorizer: • Using unigrams, bigrams, trigrams (Ngrams parameter) • Removing stopwords (stop_words parameter) • Using binomial format of counts • Alternative choices for MultinomialNB: • Which type of SMOOTHING to use
Smoothing • Even a very large corpus remains a limited sample of language use, so many words even of common use are not found • Problem particularly common with tweets where a lot of ‘creative’ use of words found • Solution: SMOOTHING – distribute the probability so that every word gets some • Most used: ADD ONE or LAPLACE smoothing
Optimization • Looking for the best values for the parameters is a standard operation in machine learning • Scikit-learn, like Weka and similar packages, provides a function (GridSearchCV) to explore the results that can be achieved with different parameter configurations
Optimizing with GridSearchCV Note the syntax to specify the values of the parameters Use F metric to evaluate Which smoothing function to use
Second Script Start an IDLE window Open the file: 02_tuning.py (but do not run it yet!!)
Additional improvements: normalization, preprocessing • Further improvements may be possible by doing some form of NORMALIZATION
Other possible improvements • Using NLTK’s POS tagger • Using a sentiment lexicon such as SentiWordNet • http://sentiwordnet.isti.cnr.it/download.php • (in the data/ directory)
Third Script (Start an IDLE window) Open and run the file: 03_clean.py
NLTK http://www.nltk.org/book