AVAYA: Sentiment Analysis in Twitter with Self-Training and Polarity Lexicon Expansion

SemEval 2013 Task 2 AVAYA: Sentiment Analysis in Twitter with Self-Training and Polarity Lexicon Expansion Lee Becker, George Erhart, David Skiba, and Valentine Matula June 16, 2013 Labs

Participation

Guiding Intuitions • Boost recall of positive/negative instances (A,B) • Don’t worry about neutral instances (A,B) • Encode polarity cues into features (A,B) • Exploit the context (A)

System Overview: Task B Constrained Polarity Lexicon Sentiment Labeled Tweets Constrained Model Feature Extraction

System Overview: Task B Unconstrained Unlabeled Tweets Auto Labeled Tweets Constrained Model Expanded PolarityLexicon Unconstrained Model Feature Extraction

Overview: Task A Models Polarity Lexicon Sentiment Labeled Contexts Feature Extraction Constrained Model Expanded Polarity Lexicon Sentiment Labeled Contexts Feature Extraction Unconstrained Model

Preprocessing • Normalization: • URLS • @Mentions • NLP Pipeline • Written in ClearTK framework • ClearNLP Wrappers • Tokenization – preserves emoticons and URLs • POS Tagging • Lemmatization • Dependency Parsing • PTB POS -> ArkTweet POS (Gimpel, et. al. 2011) • Dependencies -> Collapsed Dependencies

Resources • MPQA Subjectivity Lexicon (Wilson, Weibe and Hoffman, 2005) • Hand-Crafted Negation Word Dictionary • Hand-Crafted Emoticon Polarity Dictionary http://leebecker.com/resources/semeval-2013/

Task B Features • Polarized Bag-of-Words • Easy way to double the feature space (e.g. happy & NOT_happy) Negation Window I am not too happy about this, but I’m still pumped and thrilled for tomorrow. • Features: • Token • Token + PTB POS • Token + Simplified POS • Lemma • Lemma + PTB POS • Lemma + Simplified POS

Task B Features • Message Polarity Features • Word Sentiment Counts (pos|neg) • Emoticon Sentiment Counts (pos|neg) • Net word polarity • Net emoticon polarity • Microblogging Features • ALL CAPS word counts • Words with repeated characters (yaaaaay, booooo) counts • Emphasis (*yes*) • Winning Sports score (Nuggets 15-0) • PTB POS Tag counts • Collapsed Dependency Relations • Incorporated negation • Text-Text • Lemma+Simplified POS – Lemma+Simplified POS • POS - Lemma

Task B: Constrained Model • LIBLinear with Logistic Regression loss function • Heavily boosted negative-polarity instances • wpositive =1 • wnegative= 25 • wneutral = 1

Polarity Lexicon Expansion: Pointwise Mutual Information • Based on Semantic Orientation for Sentiment (Turney, 2002) • Intuition: Utilize co-occurrence statistics to measure words’ dependence/independence with a polarity. p(word, sentiment) p(word)p(sentiment) PMI(word, sentiment) = log2 polarity(word) = sgn(PMI(word, positive) – PMI(word, negative))

Polarity Lexicon Expansion:From tweets to lexicon • Differences from Turney (2002) • Classifier output instead of seed words • Words instead of word phrases • Procedure • Applied to ~475k Unlabeled Tweets • Filtered and balanced corpus via classifier confidence score thresholds • 50,789 positive instances ( > 0.9) • 59,029 negative instances ( > 0.7) • 70,601 neutral instances ( > 0.8) • Removed: • f(word) < 10 • neutral polarity words • single character words (‘a’, ‘j’, ‘I’, etc…) • numbers (1, 20, 1000) • punctuation • Merged with MPQA subjectivity lexicon Final lexicon size: 11,740 entries

Task B: Unconstrained Model • Self-trained model • ~470k constrained model produced instances • ~10k original instances • Expanded polarity lexicon • Heavily discounted neutral instances • wpositive=2 • wnegative= 5 • wneutral = 0.1

Task B Results

Task A: Features • Same as Task B • Polarized Bag of Words • Contextual Polarity Features • Word Sentiment Counts (pos|neg) • Emoticon Sentiment Counts (pos|neg) • Net word polarity • Net emoticon polarity • MicrobloggingFeatures • PTB POS tags • Additional Features: • Scoped Dependencies • Dependency Paths

Task A Features: Scoped Dependencies • OUT_neg_nsubj(want,you) • OUT_neg(want, not) • IN_xcomp(want, miss) • IN_aux(miss, to) • OUT_tmod(miss, tomorrow) root nsubj xcomp tmod neg aux You do not want to miss this tomorrow night.

Task A Features: Dependency Paths • POS Path: {NNP} dobj < {VBD} < conj {VBD} < root • Sentiment POS Path: {^/neutral} < {V/negative} < {V/negative} < {root} • In Subject: False • In Object: True root dobj conj Criminals killed Sadat and in the process they killed Egypt.

Task A Models • Constrained: MPQA Subjectivity Lexicon • Unconstrained: Expanded Polarity Lexicon • LIBLinear • wpositive=11 • wnegative= 2 • wneutral = 1

Task A Results

Discussion • Dictionary expansion via supervised sentiment models provides a relatively simple way to expand the feature space and expand coverage. • Dependency-Based features provide additional context and richer information • Future work • Ablation studies • Better tuning of self-training

Thank you! • Task 2 Organizers and Participants • SemEval 2013 Organizers • Anonymous Reviewers

AVAYA: Sentiment Analysis in Twitter with Self-Training and Polarity Lexicon Expansion