1 / 25

Problem Statement

CS 277 DataMining Project Presentation Instructor : Prof. Dave Newman Team : Hitesh Sajnani, Vaibhav Saini, Kusum Kumar Donald Bren School of Information and Computer Science University of California, Irvine. Problem Statement.

kiri
Download Presentation

Problem Statement

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CS 277 DataMiningProject PresentationInstructor: Prof. Dave NewmanTeam: Hitesh Sajnani, Vaibhav Saini, KusumKumarDonald Bren School of Information and Computer ScienceUniversity of California, Irvine

  2. Problem Statement • Classify a given yelp review text into one or more relevant categories

  3. Dataset • Reviews s • Reviews from Foodand Restaurantcategory • # Useful votes > 1 • Total 10,000 reviews • Classification categories • Identified categories using sample set of 400 random reviews • Refined categories using 200 more reviews • Final categories: 5 • Food, Ambience, Service, Deals/Discounts Worthiness

  4. Data Annotation • 10,000 reviews divided into 5 bins (w/ repetition) • 6researchers manually annotated reviews • 225 man-hours of work! • Discrepancy in 981 ambiguous reviews -- removed from analysis • Total 9,019 reviews: split into 80% train and 20% test

  5. Features – unigrams/bigrams/trigrams Total 703 textual features 375unigrams, 208 bigrams, 120 trigrams Frequency Unigrams/bigrams/trigrams

  6. Features – User ratings 3 nominal features – Good, Moderate, Bad

  7. Approach • Reviews can be classified into more than one categories • Not a binary classification problem. It is a multi-label classification!

  8. Binary classifiers for each category • Learns one binary classifier for each category • Output is the union of predictions of all binary classifiers Original dataset Transformed datasets

  9. Classifier for each subset of categories • Categories = {Food, Service, Ambience, Deals} • We consider each different “subset of categories” as a single category and learn a multi-class classifier Transformed dataset Reviews Categories Review 1 “1001” Review 2 “0011” Review 3 “1000” Review 4 “0111”

  10. Ensemble of subset classifiers • Train a classifier for predicting only each subset of categories • Classifier 1 for (Food, Service) • Classifier 2 for (Food, Ambience) • Classifier 3 for (Food, Deals) • Classifier 4 for (Service, Ambience) • Classifier 5 for (Service, Deals • Classifier 6 for (Ambience, Deals) Total 6 classifiers for subset of size of 2 categories – 4C2

  11. Ensemble of classifiers: Prediction • Ask each classifier to vote!

  12. Ensemble of classifiers: Prediction • Ask each classifier to vote!

  13. Ensemble of classifiers: Prediction • Ask each classifier to vote!

  14. Ensemble of classifiers: Prediction • Ask each classifier to vote!

  15. Ensemble of classifiers: Prediction • Ask each classifier to vote!

  16. Ensemble of classifiers: Prediction • Ask each classifier to vote!

  17. Ensemble of classifiers: Prediction • Final prediction: Majority vote (>= 2 classifiers)

  18. Evaluation measures Notations: Let (x,Y) be a multi-label example, Y L Let h be a multi-label classifier Let Z = h(x) be the set of labels predicted by h for (x, Y) Precision: Recall:

  19. Precision & Recall (Train)

  20. Precision & Recall (Test)

  21. Observation1: Ensemble gave the best results

  22. Observation 2: Data Skew Normalized skew in training data by adding selective data

  23. Precision & Recall (w & w/o category normalization)

  24. Precision & Recall (w & w/o category normalization)

  25. Thanks! Check out our yelp submission http://www.ics.uci.edu/~vpsaini/ Feedback welcome!

More Related