120 likes | 431 Views
Extracting Local Understandings from User-Generated Reviews on City Guide Websites Andrea Moed IS256 Applied Natural Language Processing Professor Marti Hearst December 6, 2006 Overview Motivations Corpus Processing Nickname discovery Ongoing experiments Attraction extraction
E N D
Extracting Local Understandings from User-Generated Reviews on City Guide Websites Andrea Moed IS256 Applied Natural Language Processing Professor Marti Hearst December 6, 2006
Overview • Motivations • Corpus • Processing • Nickname discovery • Ongoing experiments • Attraction extraction • Review classification • Future work Andrea Moed | IS56 ANLP
Motivations • Local knowledge of well-known places… for locals • “Nobody goes there anymore, it’s too crowded” • Major draws (views, dishes, people…) • Best times/seasons/modes of transport? • Places to combine in one excursion • “A good place for X” vs. a Great Good Place* • *Ray Oldenburg, The Great Good Place: Cafes, Coffee Shops, Bookstores, Bars, Hair Salons, and Other Hangouts at the Heart of a Community, 1999 Andrea Moed | IS56 ANLP
Corpus • Yelp San Francisco • Social site organized around cities, launched 2004 • Thousands of SF places, reviews and reviewers • Largely local interest (Mass Media, Pets) • Some areas useful for visitors (Night Life, Shopping) • Writerly culture high structural and stylistic variation in the text • Categories: Restaurants, Night Life, Shopping, Active Life, Local Flavor • Destinations • Frequently reviewed places: 20+ reviews Andrea Moed | IS56 ANLP
Processing • Used Dappit to build page scrapers • Generated XML; parsed in Python • Place objects consisting of location info + reviews • Corpus collects place objects from various categories • Challenges of screen scraping • Tradeoff between more places and places with most reviews (optimization requires exhaustive search) • TripAdvisor proved too difficult • Analysis with Python and NLTK Lite Andrea Moed | IS56 ANLP
Place Nickname Discovery • Goal: Discover alternate search terms to surface more diverse local results in web search • Method: Regular expression matching Andrea Moed | IS56 ANLP
Place Nickname Discovery • Steps • Counted frequency of Yelp-given place name in reviews of that place • Tokenized name on whitespace • Rule-based generation of candidate nicknames: acronym, subsets of tokens • Compared frequencies of given name and each nickname • Potentially useful nicknames are those that occur at least half as often as the given name Andrea Moed | IS56 ANLP
Place Nickname Discovery • Results • From 61 places (Restaurants, Active Life, Local Flavor), 38 reviews each • 23 of 61 places appeared to have frequently used nicknames • BUT in 9 cases this was due to common words in names • First word most commonly used nickname in remaining cases • Hypothesis: Long tail of less predictable nicknames Andrea Moed | IS56 ANLP
Ongoing Work • Attraction extraction • TF/IDF calculation to find the concepts most widely associated with a place • Further text analysis to collect understandings of key concepts • Specificity • Sentiment • Temporality Andrea Moed | IS56 ANLP
Ongoing Work • Attraction extraction • TF/IDF calculation to find the concepts most widely associated with a place • Further text analysis to collect understandings around key concepts • Specificity • Sentiment • Temporality Andrea Moed | IS56 ANLP
Ongoing Work • Classification of reviews: recommendation vs. narrative • Recommendations help people “use” a city • Narrative is associated with memorable and unique locations • Features for classification • Verb tense distribution • Paragraph breaks • Opinion words at beginning and end (recommendation) • Memory and relationship words (narrative) Andrea Moed | IS56 ANLP
Future Work • Relating understanding about location features to external data (geocoding, weather) • Visualization of extracted concepts • Development of a training set for classification Andrea Moed | IS56 ANLP