560 likes | 698 Views
Blogvox2: A Modular Domain Independent Sentiment Analysis System. Sandeep Balijepalli Masters Thesis, 2007. Overview. Introduction / Motivation Problem Statement & Contribution Related Work Framework Sentiment Filters Search and Trend Analysis Experiments & Results
E N D
Blogvox2: A Modular Domain Independent Sentiment Analysis System Sandeep Balijepalli Masters Thesis, 2007
Overview • Introduction / Motivation • Problem Statement & Contribution • Related Work • Framework • Sentiment Filters • Search and Trend Analysis • Experiments & Results • Conclusion & Future Work
Overview • Introduction / Motivation • Problem Statement & Contribution • Related Work • Framework • Sentiment Filters • Search and Trend Analysis • Experiments & Results • Conclusion & Future Work
Social Media & Blogs Social media defines the socialization of information as well as the tools to facilitate conversations . – [1] (Examples: MySpace, YouTube, Wikipedia…) Blogs are popular due to their ability to express opinions and express critiques on topics. We focus on political blogs since they have lots of sentiments words associated with them. Examples Hillary Clinton, Obama and Howard Dean are just some of the famous politicians who use blogs [1] http://en.wikipedia.org/wiki/Social_media
Motivation • Lack of domain independent framework for sentiment analysis • Upcoming Elections • Better tool for Politicians. • Better tool for the Average American. • Need of sentence level analysis for sentiment classification • Opinmind was propriety.
Overview • Introduction / Motivation • Problem Statement & Contribution • Related Work • Framework • Sentiment Filters • Search and Trend Analysis • Experiments & Results • Conclusion & Future Work
Problem Statement • Analyze the sentiment detection on sentence level. • Examine the performance of various techniques employed for classification. • Develop a sentiment analysis framework that is domain is Independent.
Contribution • Sentence level sentiment analysis framework. • Prototype applications to use the framework. • Performance analysis of different filter techniques. • Worked with Justin Martineau to develop trend analysis. • Akshay Java provided the political URL dataset.
Overview • Introduction / Motivation • Problem Statement & Contribution • Related Work • Framework • Sentiment Filters • Search and Trend Analysis • Experiments & Results • Conclusion & Future Work
Related Work • Blogvox1 [1] • Document level scoring module, sentence level should be focused • Classification is based on the bag of words approach, other machine level analysis will improve the results • Turney (2002) [2] • Unsupervised review classification. • Deals with Paragraph level and its difficult for classification of sentences in blogs with their method. [1] Akshay Java, Pranam Kolari, Tim Finin, James Mayfield, Anupam Joshi, and Justin Martineau, BlogVox: Separating Blog Wheat from Blog Chaff, January 2007 [2] Peter D. Turney, Thumbs up or thumbs down?: semantic orientation applied to unsupervised classification of reviews, Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, July 07-12, 2002, Philadelphia, Pennsylvania
Related Work cont… • Pang, Lee & Vaithyanathan(2002) [1] • Different techniques are analyzed and shown that unigrams perform well in movie domain. • But according to Engstrom [2], these techniques are domain dependent. • Soo-Min Kim and Eduard Hovy [3] • They have seed wordlist and unigram approach to identify the sentence sentiments. • This is not sufficient as the seed wordlist is from the “wordnet” dataset introduces lot of noise [4] [1] Bo Pang, Lillian Lee, and Shivakumar Vaithyanathan. Thumbs up? sentiment classification using machine learning techniques. 2002. [2] Charlotte Engstrom. Topic dependence in sentiment classification. Master’s thesis, University of Cambridge, July 2004. [3] "Determining the Sentiment of Opinions", Soo-Min Kim and Eduard Hovy. Proceedings of the 20th International Conference on Computational Linguistics (COLING), August 23-27, Geneva, Switzerland. 2004. [4] Brian Eriksson, Sentiment Classification of Movie Reviews using Linguistic Parsing,2005
Overview • Introduction / Motivation • Problem Statement & Conclusion • Related Work • Framework • Sentiment Filters • Search and Trend Analysis • Experiments & Results • Conclusion & Future Work
Framework www.dailykos.com www.mediamatters.com http://www.dailykos.com/storyonly/2007/6/5/1211/30670 Obama is good. I like Edwards. <President Bush is good.> <Edwards is hasty.> Bush Clinton Obama If President Bush and Vice President Cheney can blurt out vulgar language. “Hillary Clinton”
Datasets (Political URLs) Datasets employed • Lada A. Adamic Political Dataset – 3028 political URLs. • Lada A. Adamic Labeled Dataset – 1490 blogs. • Twitter Dataset [1] • Spinn3r Dataset – live feeds [2] Experimental analysis 109 feeds were used for experimental analysis [1] www.twitter.com [2] www.tailrank.com
Overview • Introduction / Motivation • Problem Statement • Related Work • Framework • Sentiment Filters • Search and Trend Analysis • Experiments & Results • Conclusion & Future Work
Naïve Bayes Unigram Bigram Overview of Filters insentiment analysis Sentences Parts of Speech Pattern Recognizer No No No Objective Sentences Yes Yes Yes Multiple Indexer
Datasets (filter) Pattern Matching Dataset classified - Manually 92 Positive patterns 163 Negative Patterns Training (Naïve Bayes) Political Dataset classified - Manually Movie Dataset 5331 Negative sentences 5000 Neutral sentences 5331 Positive sentences 273 Negative sentences 320 Neutral sentences 178 Positive sentences The political wordlist contains: [1] 2712 Negative words 915 Positive words [1] Akshay Java, Pranam Kolari, Tim Finin, James Mayfield, Anupam Joshi, and Justin Martineau, BlogVox: Separating Blog Wheat from Blog Chaff, January 2007
Pattern Recognition filter - Overview Naïve Bayes Unigram Bigram “Pattern recognizer filter is custom developed domain based filter for identification of patterns.” Sentences Parts of Speech Pattern Recognizer No No No Objective Sentences Yes Yes Yes Multiple Indexer
Pattern Recognition filter – Working Model Chunked Sentences Sentences She is well respected and won many admirers for her staunch support for women. I hate George Bush. John Edwards is my least favorite. Pattern Recognizer No I want to be like Hillary. Yes I admire Hillary. Obama is annoying Multiple Indexer Current Index Sentence :"they hate Bush“ Date :Thu Apr 19, 2007 at 08:14:12 PM PDT Url :www.mediamatters.com Permalink :http://mediamatters.org/items/200508290005 Polarity : negative Strength : 1 Sentence : “I like Clinton.” Date :Thu Apr 19, 2007 at 08:14:12 PM PDT Url :www.dailykos.com Permalink :http://www.dailykos.com/story/2007/4/13/114310/235Polarity : Positive Strength : 1
Naïve Bayes Filter- Overview Naïve Bayes Unigram Bigram “Naïve Bayes classifier is a simple probabilistic classifier based on applying the Bayes theorem with strong (Naïve) independent assumptions.” Sentences Parts of Speech Pattern Recognizer No No No Objective Sentences Yes Yes Multiple Indexer
Naïve Bayes Analysis - Outline Each document “d” is represented by the document vector where {f1,f2,…. fm} - set of predefined feature vectors d = (n1(d),…..nm(d)) - where ni(d) = no. of times feature vector “fi” occurs in “d”. If a sentence S = ∑Wi, where “i” is the no words in that sentence Then, Probability of a sentence being positive is P(Spos) = ∑(Wipos) / (Wineg) + (Wipos) + (Winet) Probability of a sentence being negative is P(Sneg) = ∑(Wineg) / (Wineg) + (Wipos) + (Winet) “ This is a slight modification of the naïve bayes method.”
Naïve Bayes Analysis – Working model Example – Unigram analysis + + + + = 1.28 / 5 < .6 Probability of the word present in the training dataset (negative) “ Hillary is an exciting leader.” + + + + 27.4 / 5 = ~.6 == .6 Probability of the word present in the training set (positive) Hence, the sentence is positive…. Similarly, for Bigrams we use two words together instead of one.
Threshold analysis for naives bayes filter • .7 misses lots of subjective sentences, hence threshold value of .7 will not capture the expected number of subjective sentences. • .5 indexes lots of sentences that are both objective and subjective. Indexing of unwanted sentences needs to be avoided which is why we do not chose .5 as our threshold value • According to our experimental analysis. Optimal threshold value is .6
Parts of Speech Filter - Overview Naïve Bayes Unigram Bigram Sentences Parts of Speech Parts of Speech Pattern Recognizer No No No Objective Sentences Yes Yes Multiple Indexer
Parts of speech analysis - Outline “Part-of-speech tagging, also called grammatical tagging, is the process of marking up the words in a text as corresponding to a particular part of speech, based on both its definition, as well as its context .” - wiki For Example: Mr. Bill Clinton, the former president of the United States, will become personal advisor of Hillary, Clinton announced yesterday in New York. For Example: Mr.$NNP Bill$NNP Clinton$NNP ,$, the$DT former$JJ president$NN of the$DT United$IN States$NNP ,$, will$MD become$VB Personal$NNadvisor$NN of$IN Hillary$NN,$, Clinton$NN announced$VBD yesterday$RB in$IN New$NNP York$NNP.$. NN singular or mass noun NNP proper nounDT singular determiner JJ adjective IN preposition VB verb, base form VBD verb, past tense • Working model: • The Unigrams and bigrams are tagged with Parts of speech for analysis. [1] • Each sentence is passed and experiments are carried out against the tagged naïve bayes for analysis. • The working is similar to Naïve Bayes filters. [1] www.lingpipe.com
Naïve Bayes Unigram Bigram Named Entity –Overview Sentences Parts of Speech Parts of Speech Pattern Recognizer No No No Objective Sentences Yes Yes Multiple Indexer
Named Entity – Overview cont… Problem: “I hate Bush, but I like Obama” • Current approaches discards these sentences. • Our solutions – Named Entity, reduce the score of the sentence. • This will return the number of entities = 2. • Anything more than 1 returned by the named entity, the system will reduce the score, rather than removing the search results.
Overview • Introduction / Motivation • Problem Statement • Related Work • Framework • Sentiment Filters • Search and Trend Analysis • Experiments & Results • Conclusion & Future Work
Search and Trend Analysis Search Analysis • Queries are “boosted” for performance enhancement. Query : “George Bush” Results: “George Bush is a great guy” “George’s last name Bush is …” “I dislike Bush” “ I love George ” “Terms Together - high score” “Terms spacing up to 10 words – Medium score” “Either one of the terms – low score”
Search & Trend Analysis – Search screen shots Two Panel View
Search & Trend Analysis – Search screen shots cont… Four Panel View
Search & Trend Analysis – Search screen shots Cont… Polarity Distribution
Search and Trend Analysis cont… “ Top topics are terms that have always been in the point of discussion in blogosphere. ” (eg: Bush, Iraq, Bomb) • Terms are computed analyzing the frequencies of the words in index. • Top 100 English words, dates and numbers are screened out “ Hot topics are terms that have currently been in the point of discussion in blogosphere. ” (eg: Virginia, Immigration) • Computed by employing K-L Divergence. Dkl(P||Q) = ∑I P(i) log(P(i)/T)) Dkl - Kullback - Leibler Divergence P – True Value T – Target wordValue
Search & Trend Analysis – Search screen shots Cont… Top Term Analysis
Overview • Introduction / Motivation • Problem Statement • Related Work • Framework • Sentiment Filters • Search and Trend Analysis • Experiments & Results • Conclusion & Future Work
Effect of Pattern Matching Analysis Pattern Matching analysis Pattern matching does not capture most of the subjective sentences
Effect of Pattern Matching Analysis Cont… Problem: • Bloggers do not write in a formal manner. • Bloggers generally do not care about the grammar, spelling and punctuations in their blog • Less Pattern Dataset collected [95 Pos and 162 Neg]. • Examples that caused problems: • “ Bush suckz ” (Causes problems) Slang terms Possible Solutions: • Spelling checker is one of the ways to improve the results. • Requires more pattern dataset to improve the analysis.
Effect of Pattern Matching Analysis Cont… Confusion Matrix • Accuracy = 58% • True Positive Rate (Recall) = 18% • False Positive Rate (FP) = 2 % • True Negative Rate = 98% • False Negative Rate (FN) = 82% • Precision = 92% (Positive – Subjective, Negative – Objective Sentence)
Effect of Naive Bayes Analysis Unigram analysis • Unigram captures most of the subjective sentences.
Unigram vs Patterns • The graph shows that unigrams perform better than pattern matching techniques.
Effect of Naive Bayes Analysis cont… Confusion matrix (Unigrams) • Accuracy = 77% • True Positive Rate (Recall) = 63% • False Positive Rate (FP) = 10% • True Negative Rate = 90% • False Negative Rate (FN) = 37% • Precision (Positive) = 86%(Positive – Subjective, Negative – Objective Sentence)
Effect of Naive Bayes Analysis Bigram analysis • Bigrams perform better than Pattern matching. • Bigrams do not perform as well as unigrams. • Lack of domain independent dataset affects the results.
Effect of Naive Bayes Analysis cont… Confusion Matrix (Bigram) • Accuracy = 70% • True Positive Rate (Recall) = 50% • False Positive Rate (FP) = 10% • True Negative Rate = 91% • False Negative Rate (FN) = 50% • Precision (Positive) = 83%(Positive – Subjective, Negative – Objective Sentence)
Effect of Naive Bayes Analysis Unigram + Bigram analysis • Results similar to unigram – which implies that the addition of bigrams does not seem to make a significant difference
Effect of Naive Bayes Analysis cont… Confusion Matrix (Unigram + Bigram) • Accuracy = 77% • True Positive Rate (Recall) = 64% • False Positive Rate (FP) = 10% • True Negative Rate = 90% • False Negative Rate (FN) = 36% • Precision (Positive) = 86% (Positive – Subjective, Negative – Objective Sentence)
Effect of Pattern Matching Analysis Cont… Problem: • We used the Movie training dataset [1] along with the custom developed political dataset. • Possible Solutions: • More domain specific dataset should be collected for improvement of this technique. • Analysis on trigrams would be useful for comparisons. [1] http://www.cs.cornell.edu/People/pabo/movie-review-data/
Effect of Parts of Speech Analysis Parts of Speech analysis • Parts of speech does not perform as well as unigrams.
Effect of Parts of Speech Analysis Cont… Problem: • Currently, the training set data for this analysis is not blog specific, but is collected from the news articles, which follow a standard format and procedure. Possible Solutions: • Develop or obtain a blog specific training dataset. • Combining this with other filters could improve the results.
Effect of Parts of Speech Analysis Cont… Confusion matrix (Parts of Speech) • Accuracy = 73% • True Positive Rate (Recall) = 60% • False Positive Rate (FP) = 13% • True Negative Rate = 87% • False Negative Rate (FN) = 40% • Precision (Positive) = 82% (Positive – Subjective, Negative – Objective Sentence)
Results • “Unigram” & “Unigram + Bigram” outperform all other filter analysis. • Although, parts of speech tagging performs well, the precision is less when compared to other filter analysis. • Pattern matching technique can be improved by obtaining a larger dataset which is a non-trivial task.