270 likes | 577 Views
Authors: Fotis Aisopos , George Papadakis , Theordor a Varvarigou Presenter: Konstantinos Tserpes National Technical University of Athens, Greece. Sentiment Analysis of Social Media Content using N-Gram Graphs. Social Media and Sentiment Analysis. Social Networks enable users to:
E N D
Authors: FotisAisopos, George Papadakis, TheordoraVarvarigou Presenter: Konstantinos Tserpes National Technical University of Athens, Greece Sentiment Analysis of Social Media Contentusing N-Gram Graphs
Social Media and Sentiment Analysis • Social Networks enable users to: • Chat about everyday issues • Exchange political views • Evaluate services and products • Useful to estimate average sentiment for a topic (e.g. social analysts) • Sentiments expressed • Implicitly (e.g. through emoticons, specific words) • Explicitly (e.g. the “Like” button in Facebook) In this work we focus on content-based patterns for detecting sentiments. International ACM Workshop on Social Media (WSM11)
Intricacies of Social Media Content Inherent characteristics that turn established, language-specific methods inapplicable: • Sparsity: each message comprises just 140 characters in Twitter • Multilinguality: many different languages and dialects • Non-standard Vocabularty: informal textual content (i.e., slang), neologisms (e.g. “gr8” instead of “great”) • Noise: misspelled words and incorrect use of phrases. Solution language-neutral method that is robust to noise International ACM Workshop on Social Media (WSM11)
Focus on Twitter We selected the Twitter micro-blogging service due to: • Popularity (200 million users, 1 billion posts per week) • Strict rules of social interaction (i.e., sentiments are expressed through short, self-contained text messages) • Data publicly available through a handy API International ACM Workshop on Social Media (WSM11)
Polarity Classification problem • Polarity: express of a non-neutral sentiment • Polarized tweets: tweets that express either a positive or a negative sentiment (polarity is explicitly denoted by the respective emoticons) • Neutral tweets: tweets lacking any polarity indicator • Binary Polarity Classification: decide for the polarity of a tweet with respect to a binary scale (i.e., negative or positive). • General Polarity Classification: decide for the polarity of a tweet with respect to three scales (i.e., negative, positive or neutral). International ACM Workshop on Social Media (WSM11)
Representation Model 1: Term Vector Model Aggregates the set of distinct words (i.e., tokens) contained in a set of documents. Each tweet ti is then represented as a vector: vti = (v1, v2, ..., vj) where vj is the TF-IDF value of the j-th term. The same model applies to polarity classes. Drawbacks: • It requires language-specific techniques that correctly identify semantically equivalent tokens (e.g., stemming, lemmatization, P-o-S tagging). • High dimensionality International ACM Workshop on Social Media (WSM11)
Representation Model 2: Character n-grams Each document and polarity class is represented as the set of substrings of length n of the original text. • for n = 2: bigrams, n = 3: trigrams, n = 4: fourgrams • example: “home phone" consists of the following tri-grams: {hom, ome, me , ph, pho, hon, one}. Advantages: • language-independent method. Disadvantages: • high dimensionality International ACM Workshop on Social Media (WSM11)
Representation Model 3: Character n-gram graphs Each document and polarity class are represented as graphs, where • the nodescorrespond to character n-grams, • the undirected edges connect neighboring n-grams (i.e., n-grams that co-occur in at least one window of n characters), and • the weight of an edge denotes the co-occurrence rate of the adjacent n-grams. Typical value space for n: n=2 (i.e., bigram graphs), n=3 (i.e., trigram graphs), and n=4 (i.e., four-gram graphs). International ACM Workshop on Social Media (WSM11)
Example of n-gram graphs. The phrase “home_phone” is represented as follows: International ACM Workshop on Social Media (WSM11)
Features of the n-gram graphs model To capture textual patterns, n-gram graphs rely on the following graph similarity metrics (computed between the polarity class graphs and the tweet graphs): • Containment Similarity (CS): portion of common edges, regardless of their weights • Size Similarity (SS): ratio of sizes of two graphs • Value Similarity (VS): portion of common edges, taking into account their weights • Normalized Value Similarity (NVS): value similarity without the effect of the relative graph size (i.e., NVS =VS/SS) International ACM Workshop on Social Media (WSM11)
Features Extraction • Create Gpos, Gneg(and Gneu) by aggregating half of the training tweets with the respective polarity. • For each tweet of the remaining training set: • create tweet n-gram graph Gti • derive a feature “vector” from graphs comparison • Same procedure for the testing tweets. International ACM Workshop on Social Media (WSM11)
Discretized Graph Similarities Discretized similarity values offer higher classification efficiency. They are created according to the following function: • Binary classification has three nominal features: • dsim(CSneg, CSpos) • dsim(NVSneg, NVSpos) • dsim(VSneg, VSpos) • General classification has six more nominal features: • dsim(CSneg, CSneu) • dsim(NVSneg, NVSneu) • dsim(VSneg, VSneu) • dsim(CSneu, CSpos) • dsim(NVSneu, NVSpos) • dsim(VSneu, VSpos) International ACM Workshop on Social Media (WSM11)
Data set • Initial dataset: • 475 million real tweets, posted by 17 million users • polarized tweets: • 6.12 million negative • 14.12 million positive • Data set for Binary Polarity Classification: Random selection of 1 million tweets from each polarity category. • Data set for General Polarity Classification: the above + random selection of 1 million neutral tweets. International ACM Workshop on Social Media (WSM11)
Experimental Setup • 10-fold cross-validation. • Classification algorithms (default configuration of Weka): • Naive Bayes Multinomial (NBM) • C4.5 decision tree classifier • Effectiveness Metric: classification accuracy (correctly_classified_documents/all_documents). • Frequency threshold for term vector and n-grams model: only features that appear in at least 1% of all documents were considered. International ACM Workshop on Social Media (WSM11)
Evaluation results • n-grams outperform Vector Model for n = 3, n = 4 in all cases (language-neutral, noise tolerant) • n-gram graphs: • low accuracy for NBM, higher values overall for C4.5 • n incremented by 1: performance increases by 3%-4% International ACM Workshop on Social Media (WSM11)
Efficiency Performance Analysis • n-grams involve the largest by far set of features -> high computational load • four-grams: less features than trigrams (their numerous substrings are rather rare) • n-gram graphs: significantly lower number of features in all cases (<10) -> much higher classification efficiency! International ACM Workshop on Social Media (WSM11)
Improvements (work under submission) • We lowered the frequency threshold to 0.1% for tokens and n-grams, to increase the performance of the term vector and n-grams model (at the cost of even lower efficiency). • We included in the training stage the tweets that were used for building the polarity classes. • Outcomes: • Higher performance for all methods. • N-gram graphs again outperform all other models. • Accuracy takes significantly higher values (>95%) International ACM Workshop on Social Media (WSM11)
Thank you! • SocIoS project: www.sociosproject.eu International ACM Workshop on Social Media (WSM11)