Mining the peanut gallery: Opinion extraction and semantic classification of product reviews

Mining the peanut gallery: Opinion extraction and semantic classification of product reviews Kushal Dave Steve Lawrence David M. Pennock IBM Google Overture Work done at NEC Laboratories

Problem • Many reviews spread out across Web • Product-specific sites • Editorial reviews at C|net, magazines, etc. • User reviews at C|net, Amazon, Epinions, etc. • Reviews in blogs, author sites • Reviews in Google Groups (Usenet) • No easy aggregation, but would like to get numerous diverse opinions • Even at one site, synthesis difficult

Solution! • Find reviews on the web and mine them… • Filtering (find the reviews) • Classification (positive or negative) • Separation (identify and rate specific attributes)

Existing work Classification varies in granularity/purpose: • Objectivity • Does this text express fact or opinion? • Words • What connotations does this word have? • Sentiments • Is this text expressing a certain type of opinion?

Objectivity classification • Best features: relative frequencies of parts of speech (Finn 02) • Subjectivity is…subjective (Wiebe 01)

Word classification • Polarity vs. intensity • Conjunctions • (Hatzivassiloglou and McKeown, 97) • Colocations • (Turney and Littman 02) • Linguistic colocations • (Lin 98), (Pereira 93)

Sentiment classification • Manual lexicons • Fuzzy logic affect sets (Subasic and Huettener 01) • Directionality – block/enable (Hearst 92) • Common Sense and emotions (Liu et al 03) • Recommendations • Stocks in Yahoo (Das and Chen 01) • URLs in Usenet (Terveen 97) • Movie reviews in IMDb (Pang et al 02)

Applied tools • AvaQuest’s GoogleMovies • http://24.60.188.10:8080/demos/GoogleMovies/GoogleMovies.cgi • Clipping services • PlanetFeedback, Webclipping, eWatch, TracerLock, etc. • Commonly use templates or simple searches • Feedback analysis: • NEC’s Survey Analyzer • IBM’s Eclassifier • Targeted aggregation • BlogStreet, onfocus, AllConsuming, etc. • Rotten Tomatoes

Our approach

Train/test • Existing taggedcorpora! • C|net – two cases • even, mixed10-fold test • 4,480 reviews • 4 categories • messy, by category7-fold test • 32,970 reviews

Other corpora • Preliminary work with Amazon • (Potentially) easier problem • Pang et al IMDb movie reviews corpus • Our simple algorithm insignificantly worse 80.6 vs. 82.9 (unigram SVM) • Different sort of problem

Domain obstacles • Typos, user error, inconsistencies, repeated reviews • Skewed distributions • 5x as many positive reviews • 13,000 reviews of MP3 players,350 of networking kits • fewer than 10 reviewsfor ½ of products • Variety of language, sparse data • 1/5 of reviews have fewer than 10 tokens • More than 2/3 of terms occur in fewer than 3 documents • Misleading passages (ambivalence/comparison) • Battery life is good, but... Good idea, but...

Base Features • Unigrams • great, camera, support, easy, poor, back, excellent • Bigrams • can't beat, simply the, very disappointed, quit working • Trigrams, substrings • highly recommend this, i am very happy, not worth it, i sent it back, it stopped working • Near • greater would, find negative, refund this, automatic poor, company totally

Generalization • Replacing product names, (metadata) • the iPod is an excellent ==> the _productname is an excellent • domain-specific words, (statistical) • excellent [picture|sound] quality ==> excellent _producttypeword quality • rare words (statistical) • overshadowed by ==> _unique by

Generalization (2) • Finding synsets in WordNet • anxious ==> 2377770 • nervous ==> 2377770 • Stemming • was ==> be • compromised ==> compromis

Qualification • Parsing for colocation • this stupid piece of garbage ==> (piece(N):mod:stupid(A))... • this piece is stupid and ugly ==> (piece(N):mod:stupid(A))... • Negation • not good or useful ==>NOTgood NOTor NOTuseful

Optimal features • Confidence/precision tradeoff • Try to find best-length substrings • How to find features? • Marginal improvement when traversing suffix tree • Information gain worked best • How to score? • [p(C|w) – p(C’|w)] * df during construction • Dynamic programming or simple during use • Still usually short ( n < 5 ) • Although… “i have had no problems with”

Clean up • Thresholding • Reduce space with minimal impact • But not too much! (sparse data) • > 3 docs • Smoothing: allocate weight to unseen features, regularize other values • Laplace, Simple Good-Turing, Witten-Bell tried • Laplace helps Naïve Bayes

Scoring • SVMlight, naïve Bayes • Neither does better in both tests • Our scoring is simpler, less reliant on confidence,document length, corpus regularity • Give each word a score ranging –1 to 1 • Our metric: • Tried Fisher discriminant, information gain, odds ratio • If sum > 0, it’s a positive review • Other probability models – presence • Bootstrapping barely helps p(fi|C) – p(fi|C’) score(fi) = p(fi|C) + p(fi|C’)

Reweighting • Document frequency • df, idf, normalized df, ridf • Some improvement from logdf and gaussian (arbitrary) • Analogues for term frequency, product frequency, product type frequency • Tried bootstrapping, different ways of assigning scores

Test 1 (clean) Best Results Baseline 85.0 % Naïve Bayes + Laplace 87.0 % Weighting (log df) 85.7 % Weighting (Gaussian tf) 85.7 % Baseline 88.7 % + _productname 88.9 % Unigrams Trigrams

Test 2 (messy) Best Results Baseline 82.2 % Prob. after threshold 83.3 % Presence prob. model 83.1 % + Odds ratio 83.3 % Baseline 84.6 % + Odds ratio 85.4 % + SVM 85.8 % Baseline 85.1 % + _productname 85.3 % Unigrams Bigrams Variable

Extraction Can we use the techniques from classification to mine reviews?

Obstacles • Words have different implications in general usage • “main characters” is negatively biased in reviews, but has innocuous uses elsewhere • Much non-review subjective content • previews, sales sites, technical support, passing mentions and comparisons, lists

Results • Manual test set + web interface • Substrings better…bigrams too general? • Initial filtering is bad • Make “review” part of the actual search • Threshold sentence length, score • Still get non-opinions and ambiguous opinions • Test corpus, with red herrings removed, 76% accuracy in top confidence tercile

Attribute heuristics • Look for _producttypewords • Or, just look for things starting with “the” • Of course, some thresholding • And stopwords

Results • No way of judging accuracy, arbitrary • For example, Amstel Light • Ideally:taste, price, image... • Statistical method: beer, bud, taste, adjunct, other, brew, lager, golden, imported... • Dumb heuristic:the taste, the flavor, the calories, the best…

Future work • Methodology • More precise corpus • More tests, more precise tests • Algorithm • New combinations of techniques • Decrease overfitting to improve web search • Computational complexity • Ambiguity detection? • New pieces • Separate review/non-review classifier • Review context (source, time) • Segment pages

Mining the peanut gallery: Opinion extraction and semantic classification of product reviews