160 likes | 324 Views
Sentiment and Textual analysis of Create-Debate data. EECS 595 – End Term Project. Poorva Potdar. EUREKA!! – Getting the Idea. Why sentiment analysis? Huge amount of opinionated Text on web Sentiment Analysis on web – popularity of a product, movie or a person as such. Idea :
E N D
Sentiment and Textual analysis of Create-Debate data EECS 595 – End Term Project Poorva Potdar
EUREKA!! – Getting the Idea • Why sentiment analysis? • Huge amount of opinionated Text on web • Sentiment Analysis on web – popularity of a product, movie or a person as such. • Idea: • Create Debate – online debating forum where people argue for/against some topic. • Mine for the salient text features for agreement/disagreement posts.
Creating the Haystack …. -14308 Debates…. -983800 Sentences! -178290 Posts, -9194 Users - Labeled dataset Neutral Agreement Disagreement Structural Analysis – Certain features of the language in the post that make it a high score agreement/disagreement post. • Math Behavioral Analysis – Aspects of User’s behavior that give him a high rank on the forum.
What's the gain? • Influence detection in a community • Sub-Group Detection • Stance Identification – Are there any visible groups with a particular stance? • Predict the Crowd Trend for a particular topic of interest? • Text Summarization
Experiment 1 : Polarity Measure • Intuition : Is the number of +ve/-ve words an indicative of how popular a post is? • Tool – Opinion Finder/ Wordnet. • Output of processed data by Opinion Finder. • <MPQASRC>It</MPQASRC> <MPQASD>think</MPQASD> it's <MPQAPOL autoclass="negative">wrong</MPQAPOL> to <MPQASD>assume</MPQASD> that in order to be a revolutionary thinker you have to be <MPQAPOL autoclass="negative">crazy</MPQAPOL> • MPQAPOL– Indicates the polarity of the word like “bad” • MPQASRC– Indicates the opinion source in the sentence like “It” • MPQASD– Direct subject expression in the sentence like “said” • Result : • No evident correlation between number of polar words and the rank of the post • Authors use equal distribution of positive and negative words while expressing agreement/disagreement.
Experiment 2 : Readability Measure • Intuition : Do the posts that are more readable/formal gain higher scores? • Tool – Flesch Toolkit to analyze the Flesch Readability measure for each post. • Calculated Pearson’s coefficient between the labeled score and Flesch score for each of the posts. • Result : High correlation - the more formal the language of a post, the more is the points associated with it. • Eg 1 : “good times . . .bring it back ! -------------=-=-=-=-=-=-=-=-=-=-=-=-==-=-=- ))))))))))))” [Flesch – 0, Labeled points - 1] • Eg 2 : “Vegetables is often seen as more healthy than eating meat.” [Flesch – 93.12, Labeled points – 29 (max)]
Experiment 3 : Emoticon analysis • Intuition : Do Emoticons in agreement/disagreement posts have any correlation with their labeled scores? • Tool – CMU Ark Tagger [Stanford Parser doesn’t scale well]. • Pearson’s coefficient between the labeled score and number of +ve/-ve emoticons for agreement/disagreement posts. • Result : High correlation between number of emoticons and rank of disagreement posts. • Analysis : authors tend to use expressive emoticons like smiles to give a sarcastic opinion regarding a particular argument. • “Hey! What’s that supposed to mean?;)” , • “Sure If you say so :P”.
Experiment 4 : Dependency Parse • Intuition : Do highly ranked agreement/disagreement posts depict a popular dependency pattern? • Agreement posts tend to express an agreement early on in the post, while disagreement is mild. • Tool – Stanford Parser – Syntactic and Dependency Parse of the posts. • Result: A lot of highly ranked agreement posts showed a popular dependency pattern as follows that begins with - • I->nsubj->+ve [I agree to, I like your point, I up-voted your argument] Stanford Parser + ExtractDependencies Code to traverse PRP to PRP$ Sentiwordnet
Which Authors get the highest rank? -1 • Intuition : To find if average number of times an author participates in a thread has a correlation with his ranking? • Result : • There is a pretty evident positive correlation of an author’s points to the number of times he participates in the discussion posts per thread.
Which Authors get the highest rank?-2 • Intuition : To find if authors who participate in some kind of discussion/ or start a new thread get a high rank ? • Result : • Rating of authors who agree > Rating of authors who disagree more > Rating of authors who start a new debate. • Authors who participate more in discussions are more popular.
Which Authors get the highest rank?-3 • Intuition : To find if a authors that participate early/late in discussion fetch more ranking? • Result : • Authors participating late in discussion are likely to have higher ranking. • By Intuition, authors who come late in discussion already know the opinion bias. • Participating early doesn’t help in ranking
Get the Ranking of Authors w.r.t features • Trained a linear regression model using Weka’s Libsvm and got a predicted ranking of all authors based on the features. • Got a correlation coefficient by comparing these rankings vs the gold standard rankings. • Result : • The feature vector set shows a decent correlation with the actual rankings.
Future Work • In this project, I essentially looked at some of the structural and behavioral features • The opinion finder tool also tells whether it is a subjective or objective. • One of the future Experiments – to find if there exists a correlation between subj/obj sentences and score of post? • Does the length of the post matter? • Going forward - consolidate all these features and results in the database and make it available as an open-source dataset