180 likes | 337 Views
Automatic Sentiment Analysis in On-line Text. Erik Boiy Pieter Hens Koen Deschacht Marie-Francine Moens CS & ICRI Katholieke Universiteit Leuven. Introduction. Goal: determine the sentiment of a person towards a topic Practical use Customer feedback Marketing research
E N D
Automatic Sentiment Analysis in On-line Text Erik Boiy Pieter Hens Koen Deschacht Marie-Francine Moens CS & ICRI Katholieke Universiteit Leuven
Introduction • Goal: determine the sentiment of a person towards a topic • Practical use • Customer feedback • Marketing research • Monitoring newsgroups and forums (flame detection) • Augmentation of search engines (e.g. Opinmind.com) • Opportunity • Blogs • Forums • Review sites Noisy texts
Overview • Introduction • Emotions • Machine learning (ML) techniques • Challenges • Experiments, results & discussion • Conclusion & future work
Concepts of emotions • “Sentiments are either emotions, or they are judgements or ideas prompted or coloured by emotions” • An emotion • Is usually caused by a person consciously or unconsciously evaluating an event, which is denoted appraisal in psychology • Gives priority for one or a few kind of actions to which it gives a sense of urgency
Emotions in written text • Appraisal: evaluation • e.g. It was an amazing show. • Direct expressions • e.g.I am delighted of the final results. • Elements of actions • e.g. I was grinning the whole way through it and laughing out loud more than once.
Overview • Introduction • Emotions • Machine learning (ML) techniques • Challenges • Experiments, results & discussion • Conclusion & future work
ML: Document representation (1) • Feature extraction • Features are used to represent a document as a vector • Values in the vector indicate frequency or presence of the feature at the corresponding index in a dictionary • The dictionary consists of all features encountered in the training documents
ML: Document representation (2) • Unigrams: all words • N-grams: all sets of N successive words • N = 1: unigrams, N = 2: bigrams, N = 3: trigrams • e.g. I love, not worth, returned it • Lemmas: basic dictionary form of all words • e.g. cars -> car, was -> be, better -> good • Opinion words: use only words from a pre-defined list as features • Adjectives: use only adjectives (about 7.5% of the text)
ML: Document representation (3) • Stopword removal • from list with determiners, prepositions, possessive pronouns, ... • Negation tagging • of each word following a negation until the first punctuation • e.g. I don't like this movie. -> I don't NOT_like NOT_this NOT_movie.
ML: Techniques • Classifiers successful for text classification • Support Vector Machines (SVM) • Naive Bayes Multinomial (NBM) • Maximum Entropy (Maxent)
Challenges (1) • Topic-sentiment relation • e.g. Competing with the vastly superiorCasino Royale for the same action-movie audience, Deja Vu will likely be brushed aside and quickly forgotten. • e.g. A Good Year is a well-acted well-written well-directed movie but it just wasnt my cup of tea. • Topic-neutral text • e.g. In the movie Bond can start to untangle a terror network if he wins this big poker game at Casino Royale in Montenegro.
Challenges (2) • Cross-domain classification • Training (and testing) was done on a mixture of movie and car reviews • Text quality • e.g. Nothing but a French kiss-off Search Recent Archives Web for (rm) else • • • • • • • • • • • • • • • • ONLINE EXTRAS SITE SERVICES Movie Listings Friday Nov 10 2006 Posted on Fri Nov. 10 2006 MOVIE REVIEW A Good Year a flat bouquet Nothing but a French kiss-off Gladiator collaborators seem defeated by light-weight love story.By ROBERT W.
Overview • Introduction • Emotions • Machine learning (ML) techniques • Challenges • Experiments, results & discussion • Conclusion & future work
Corpora • Pang and Lee's movie review corpus • 1000 positive and 1000 negative reviews • Reviews mix objective and subjective information • Often used in the literature • Our blog corpus • 759 positive, 205 negative and 3527 neutral sentences • Gathered from blogs, discussion boards and other websites • Extended with reviews from Customer Review Datasets corpus by Hu and Liu for balancing positive and negative
Evaluation measures • Accuracy • Precision: • Recall: • Other • Speed • Available resources
Results (1) • Pang and Lee's movie review corpus • N-grams + easy to extract + require no special tools − large feature vector size • NBM+fast
Results (2) • Our blog corpus • The baseline approach: uses basic ML techniques as described earlier • Our latest approach: achieves considerable improvements over the baseline
Conclusion & future work • Detection topic-sentiment relation far from perfect • Dirty texts are making the task even more difficult • Lack of training examples