Problem Statement

CS 277 DataMiningProject PresentationInstructor: Prof. Dave NewmanTeam: Hitesh Sajnani, Vaibhav Saini, KusumKumarDonald Bren School of Information and Computer ScienceUniversity of California, Irvine

Problem Statement • Classify a given yelp review text into one or more relevant categories

Dataset • Reviews s • Reviews from Foodand Restaurantcategory • # Useful votes > 1 • Total 10,000 reviews • Classification categories • Identified categories using sample set of 400 random reviews • Refined categories using 200 more reviews • Final categories: 5 • Food, Ambience, Service, Deals/Discounts Worthiness

Data Annotation • 10,000 reviews divided into 5 bins (w/ repetition) • 6researchers manually annotated reviews • 225 man-hours of work! • Discrepancy in 981 ambiguous reviews -- removed from analysis • Total 9,019 reviews: split into 80% train and 20% test

Features – unigrams/bigrams/trigrams Total 703 textual features 375unigrams, 208 bigrams, 120 trigrams Frequency Unigrams/bigrams/trigrams

Features – User ratings 3 nominal features – Good, Moderate, Bad

Approach • Reviews can be classified into more than one categories • Not a binary classification problem. It is a multi-label classification!

Binary classifiers for each category • Learns one binary classifier for each category • Output is the union of predictions of all binary classifiers Original dataset Transformed datasets

Classifier for each subset of categories • Categories = {Food, Service, Ambience, Deals} • We consider each different “subset of categories” as a single category and learn a multi-class classifier Transformed dataset Reviews Categories Review 1 “1001” Review 2 “0011” Review 3 “1000” Review 4 “0111”

Ensemble of subset classifiers • Train a classifier for predicting only each subset of categories • Classifier 1 for (Food, Service) • Classifier 2 for (Food, Ambience) • Classifier 3 for (Food, Deals) • Classifier 4 for (Service, Ambience) • Classifier 5 for (Service, Deals • Classifier 6 for (Ambience, Deals) Total 6 classifiers for subset of size of 2 categories – 4C2

Ensemble of classifiers: Prediction • Ask each classifier to vote!

Ensemble of classifiers: Prediction • Final prediction: Majority vote (>= 2 classifiers)

Evaluation measures Notations: Let (x,Y) be a multi-label example, Y L Let h be a multi-label classifier Let Z = h(x) be the set of labels predicted by h for (x, Y) Precision: Recall:

Precision & Recall (Train)

Precision & Recall (Test)

Observation1: Ensemble gave the best results

Observation 2: Data Skew Normalized skew in training data by adding selective data

Precision & Recall (w & w/o category normalization)

Thanks! Check out our yelp submission http://www.ics.uci.edu/~vpsaini/ Feedback welcome!

Problem Statement