180 likes | 438 Views
A Semantic, Supervised Classification Approach to Restaurant Reviews. Pavani Vantimitta. Problem definition. Reviews: important source of information on new businesses Use semantic information in the reviews to predict the rating assigned to a review.
E N D
A Semantic, Supervised Classification Approach to Restaurant Reviews Pavani Vantimitta
Problem definition • Reviews: important source of information on new businesses • Use semantic information in the reviews to predict the rating assigned to a review. • Use machine learning classifiers and MaxEnt classifier
Data Collection • Restaurant reviews from yelp.com of places around Palo Alto • Use “Web-harvest” a web extraction tool to convert the reviews into text files • Training data comprises of 61 restaurants and 1971 reviews, Validation data consists of 12 restaurants and 361 reviews , Test data comprises of 10 restaurants with 260 reviews.
Preprocessing • Removing multiple spaces between words, sentences, multiple punctuation marks • Inserting a space between a punctuation mark and the preceding word • The final data collected contains
Part-of-Speech Tagging • Stanford POS tagger
Semantic information • Extracting tags from words enables us to understand to some extent the tone of the review • Aim to use only adjectives (words tagged as ‘JJ’) for classification
Vocabulary • Full vocabulary (all words tagged as ‘JJ’) • Vocabulary cut short by the count of words { 4,10,50,100,500 } • Vocabulary cut short by comparing words appearing in different rating reviews • Stemming – Lovins Stemmer and Iterated Lovins Stemmer
Variations in classification • V1 : Each rating class as a different class • V2 : Rating one as a class and rating five as class • V3 : Rating 1,2,3 as a class and rating 4,5 as a class
MaxEnt Classifier: Variation 3: Best features set has 33 features
Future Work • Sentence Boundary • Incorporate N-gram models • Predicting review rating for each sentence in a review and then averaging the results for the full review. Takes into account conflicting tones.