250 likes | 370 Views
CS 277 DataMining Project Presentation Instructor : Prof. Dave Newman Team : Hitesh Sajnani, Vaibhav Saini, Kusum Kumar Donald Bren School of Information and Computer Science University of California, Irvine. Problem Statement.
E N D
CS 277 DataMiningProject PresentationInstructor: Prof. Dave NewmanTeam: Hitesh Sajnani, Vaibhav Saini, KusumKumarDonald Bren School of Information and Computer ScienceUniversity of California, Irvine
Problem Statement • Classify a given yelp review text into one or more relevant categories
Dataset • Reviews s • Reviews from Foodand Restaurantcategory • # Useful votes > 1 • Total 10,000 reviews • Classification categories • Identified categories using sample set of 400 random reviews • Refined categories using 200 more reviews • Final categories: 5 • Food, Ambience, Service, Deals/Discounts Worthiness
Data Annotation • 10,000 reviews divided into 5 bins (w/ repetition) • 6researchers manually annotated reviews • 225 man-hours of work! • Discrepancy in 981 ambiguous reviews -- removed from analysis • Total 9,019 reviews: split into 80% train and 20% test
Features – unigrams/bigrams/trigrams Total 703 textual features 375unigrams, 208 bigrams, 120 trigrams Frequency Unigrams/bigrams/trigrams
Features – User ratings 3 nominal features – Good, Moderate, Bad
Approach • Reviews can be classified into more than one categories • Not a binary classification problem. It is a multi-label classification!
Binary classifiers for each category • Learns one binary classifier for each category • Output is the union of predictions of all binary classifiers Original dataset Transformed datasets
Classifier for each subset of categories • Categories = {Food, Service, Ambience, Deals} • We consider each different “subset of categories” as a single category and learn a multi-class classifier Transformed dataset Reviews Categories Review 1 “1001” Review 2 “0011” Review 3 “1000” Review 4 “0111”
Ensemble of subset classifiers • Train a classifier for predicting only each subset of categories • Classifier 1 for (Food, Service) • Classifier 2 for (Food, Ambience) • Classifier 3 for (Food, Deals) • Classifier 4 for (Service, Ambience) • Classifier 5 for (Service, Deals • Classifier 6 for (Ambience, Deals) Total 6 classifiers for subset of size of 2 categories – 4C2
Ensemble of classifiers: Prediction • Ask each classifier to vote!
Ensemble of classifiers: Prediction • Ask each classifier to vote!
Ensemble of classifiers: Prediction • Ask each classifier to vote!
Ensemble of classifiers: Prediction • Ask each classifier to vote!
Ensemble of classifiers: Prediction • Ask each classifier to vote!
Ensemble of classifiers: Prediction • Ask each classifier to vote!
Ensemble of classifiers: Prediction • Final prediction: Majority vote (>= 2 classifiers)
Evaluation measures Notations: Let (x,Y) be a multi-label example, Y L Let h be a multi-label classifier Let Z = h(x) be the set of labels predicted by h for (x, Y) Precision: Recall:
Observation 2: Data Skew Normalized skew in training data by adding selective data
Thanks! Check out our yelp submission http://www.ics.uci.edu/~vpsaini/ Feedback welcome!