320 likes | 794 Views
Mining the peanut gallery: Opinion extraction and semantic classification of product reviews. Kushal Dave Steve Lawrence David M. Pennock IBM Google Overture. Work done at NEC Laboratories. Problem. Many reviews spread out across Web Product-specific sites
E N D
Mining the peanut gallery: Opinion extraction and semantic classification of product reviews Kushal Dave Steve Lawrence David M. Pennock IBM Google Overture Work done at NEC Laboratories
Problem • Many reviews spread out across Web • Product-specific sites • Editorial reviews at C|net, magazines, etc. • User reviews at C|net, Amazon, Epinions, etc. • Reviews in blogs, author sites • Reviews in Google Groups (Usenet) • No easy aggregation, but would like to get numerous diverse opinions • Even at one site, synthesis difficult
Solution! • Find reviews on the web and mine them… • Filtering (find the reviews) • Classification (positive or negative) • Separation (identify and rate specific attributes)
Existing work Classification varies in granularity/purpose: • Objectivity • Does this text express fact or opinion? • Words • What connotations does this word have? • Sentiments • Is this text expressing a certain type of opinion?
Objectivity classification • Best features: relative frequencies of parts of speech (Finn 02) • Subjectivity is…subjective (Wiebe 01)
Word classification • Polarity vs. intensity • Conjunctions • (Hatzivassiloglou and McKeown, 97) • Colocations • (Turney and Littman 02) • Linguistic colocations • (Lin 98), (Pereira 93)
Sentiment classification • Manual lexicons • Fuzzy logic affect sets (Subasic and Huettener 01) • Directionality – block/enable (Hearst 92) • Common Sense and emotions (Liu et al 03) • Recommendations • Stocks in Yahoo (Das and Chen 01) • URLs in Usenet (Terveen 97) • Movie reviews in IMDb (Pang et al 02)
Applied tools • AvaQuest’s GoogleMovies • http://24.60.188.10:8080/demos/GoogleMovies/GoogleMovies.cgi • Clipping services • PlanetFeedback, Webclipping, eWatch, TracerLock, etc. • Commonly use templates or simple searches • Feedback analysis: • NEC’s Survey Analyzer • IBM’s Eclassifier • Targeted aggregation • BlogStreet, onfocus, AllConsuming, etc. • Rotten Tomatoes
Train/test • Existing taggedcorpora! • C|net – two cases • even, mixed10-fold test • 4,480 reviews • 4 categories • messy, by category7-fold test • 32,970 reviews
Other corpora • Preliminary work with Amazon • (Potentially) easier problem • Pang et al IMDb movie reviews corpus • Our simple algorithm insignificantly worse 80.6 vs. 82.9 (unigram SVM) • Different sort of problem
Domain obstacles • Typos, user error, inconsistencies, repeated reviews • Skewed distributions • 5x as many positive reviews • 13,000 reviews of MP3 players,350 of networking kits • fewer than 10 reviewsfor ½ of products • Variety of language, sparse data • 1/5 of reviews have fewer than 10 tokens • More than 2/3 of terms occur in fewer than 3 documents • Misleading passages (ambivalence/comparison) • Battery life is good, but... Good idea, but...
Base Features • Unigrams • great, camera, support, easy, poor, back, excellent • Bigrams • can't beat, simply the, very disappointed, quit working • Trigrams, substrings • highly recommend this, i am very happy, not worth it, i sent it back, it stopped working • Near • greater would, find negative, refund this, automatic poor, company totally
Generalization • Replacing product names, (metadata) • the iPod is an excellent ==> the _productname is an excellent • domain-specific words, (statistical) • excellent [picture|sound] quality ==> excellent _producttypeword quality • rare words (statistical) • overshadowed by ==> _unique by
Generalization (2) • Finding synsets in WordNet • anxious ==> 2377770 • nervous ==> 2377770 • Stemming • was ==> be • compromised ==> compromis
Qualification • Parsing for colocation • this stupid piece of garbage ==> (piece(N):mod:stupid(A))... • this piece is stupid and ugly ==> (piece(N):mod:stupid(A))... • Negation • not good or useful ==>NOTgood NOTor NOTuseful
Optimal features • Confidence/precision tradeoff • Try to find best-length substrings • How to find features? • Marginal improvement when traversing suffix tree • Information gain worked best • How to score? • [p(C|w) – p(C’|w)] * df during construction • Dynamic programming or simple during use • Still usually short ( n < 5 ) • Although… “i have had no problems with”
Clean up • Thresholding • Reduce space with minimal impact • But not too much! (sparse data) • > 3 docs • Smoothing: allocate weight to unseen features, regularize other values • Laplace, Simple Good-Turing, Witten-Bell tried • Laplace helps Naïve Bayes
Scoring • SVMlight, naïve Bayes • Neither does better in both tests • Our scoring is simpler, less reliant on confidence,document length, corpus regularity • Give each word a score ranging –1 to 1 • Our metric: • Tried Fisher discriminant, information gain, odds ratio • If sum > 0, it’s a positive review • Other probability models – presence • Bootstrapping barely helps p(fi|C) – p(fi|C’) score(fi) = p(fi|C) + p(fi|C’)
Reweighting • Document frequency • df, idf, normalized df, ridf • Some improvement from logdf and gaussian (arbitrary) • Analogues for term frequency, product frequency, product type frequency • Tried bootstrapping, different ways of assigning scores
Test 1 (clean) Best Results Baseline 85.0 % Naïve Bayes + Laplace 87.0 % Weighting (log df) 85.7 % Weighting (Gaussian tf) 85.7 % Baseline 88.7 % + _productname 88.9 % Unigrams Trigrams
Test 2 (messy) Best Results Baseline 82.2 % Prob. after threshold 83.3 % Presence prob. model 83.1 % + Odds ratio 83.3 % Baseline 84.6 % + Odds ratio 85.4 % + SVM 85.8 % Baseline 85.1 % + _productname 85.3 % Unigrams Bigrams Variable
Extraction Can we use the techniques from classification to mine reviews?
Obstacles • Words have different implications in general usage • “main characters” is negatively biased in reviews, but has innocuous uses elsewhere • Much non-review subjective content • previews, sales sites, technical support, passing mentions and comparisons, lists
Results • Manual test set + web interface • Substrings better…bigrams too general? • Initial filtering is bad • Make “review” part of the actual search • Threshold sentence length, score • Still get non-opinions and ambiguous opinions • Test corpus, with red herrings removed, 76% accuracy in top confidence tercile
Attribute heuristics • Look for _producttypewords • Or, just look for things starting with “the” • Of course, some thresholding • And stopwords
Results • No way of judging accuracy, arbitrary • For example, Amstel Light • Ideally:taste, price, image... • Statistical method: beer, bud, taste, adjunct, other, brew, lager, golden, imported... • Dumb heuristic:the taste, the flavor, the calories, the best…
Future work • Methodology • More precise corpus • More tests, more precise tests • Algorithm • New combinations of techniques • Decrease overfitting to improve web search • Computational complexity • Ambiguity detection? • New pieces • Separate review/non-review classifier • Review context (source, time) • Segment pages