390 likes | 544 Views
THE ANDREW W. MELLON FOUNDATION. Criticism Mining: Text Mining Experiments on Book, Movie and Music Reviews. Xiao Hu, J. Stephen Downie, M. Cameron Jones The International Music Information Retrieval Systems Evaluation Lab (IMIRSEL) University of Illinois at Urbana-Champaign. Agenda.
E N D
THE ANDREW W. MELLON FOUNDATION Criticism Mining: Text Mining Experiments on Book, Movie and Music Reviews Xiao Hu, J. Stephen Downie, M. Cameron Jones The International Music Information Retrieval Systems Evaluation Lab (IMIRSEL) University of Illinois at Urbana-Champaign
Agenda • Motivation • Customer reviews in epinions.com • Experimental Setup • Data set • Results • Conclusions & Future Work
Motivation • Critical consumer-generated reviews of humanities materials • a rich resource of reviewers’ opinions, and background / contextual information • self-organized: pave ways to automatic processing • Text mining: mature and ready to use • Criticism mining: provides a tool to assist humanities scholars • Locating • Organizing • Analyzing critical review content
Customer Reviews • Published on www.epinions.com • Focused on the book, movie and music • Each review associated with: • a genre label • a numerical quality rating
numerical rating associated used in our experiments
28 Major Genre Categories Jazz, Rock, Country, Classical, Blues, Gospel, Punk, .… Renaissance, Medieval, Baroque, Romantic, … Music Genres
Experimental Setup • to build and evaluate a prototype criticism mining system that could automatically : • predict thegenre of the work being reviewed • predict thequality rating assigned to the reviewed item • differentiate book reviews and movie reviews, especially for items in the same genre • differentiate fiction and non-fiction book reviews
Genre Taxonomy : Music • The genre labels and the rating information provided the ground truth for experiments
Data Preprocessing • HTML tags were stripped out; • Stop words were NOT stripped out; • Punctuation was NOT stripped out; • They may contain stylistic information • Tokens were stemmed
Categorization Model & Implementation • Naïve Bayesian (NB) Classifier • Computationally efficient • Empirically effective • Text-to-Knowledge (T2K) Toolkit • A text mining framework • Ready-to-use modules and itineraries • Natural Language Processing tools integrated • Supporting fast prototyping of text mining
NB itinerary in T2K Data Preprocessing NB Classifier
Genre Classification 5 fold random cross validation for book and movie reviews 3 fold random cross validation for music reviews
Rating Classification • Five-class classification • 1 star vs. 2 stars vs. 3 stars vs. 4 stars vs 5 stars • Binary Group classification • 1 star + 2 stars vs. 4 stars + 5 stars • ad extremis classification • 1 star vs. 5 stars 5 fold random cross validation for all experiments
Classification of Book and Movie Reviews 1 • Reviews on all available genres • Books : 9 genres; Movies : 11 genres • Reviews on individual, comparable genres
Classification of Book and Movie Reviews 2 • Eliminated words that can directly suggest the categories: • "book", "movie", "fiction", "film", "novel", "actor", "actress", "read", "watch", "scene" • Frequently occurred in each category, but not both • To make things harder / avoid oversimplifying • Results suggest stylistic difference in users’ criticisms on books and movies 5 fold random cross validation for all experiments
Classification of Fiction and Non-fiction Book Reviews 2 • Eliminated words that can directly suggest the categories: • "fiction", "non", "novel", "character", "plot", and "story" • Frequently occurred in each category, but not both • To make things harder / avoid oversimplifying • Results suggest stylistic difference in users’ criticisms on fiction books and non-fiction ones 5 fold random cross validation for all experiments
Conclusions • Customer reviews are an excellent resource for studying humanities materials • Successful experiments: • High classification precisions: Genres; Ratings; Book vs. movie reviews Fiction vs. non-fiction book reviews • Reasonable confusions • Text mining techniques can help find important information about the materials being reviewed Criticism Mining : make the ever-growing consumer-generated review resources useful to humanities scholars.
Future work • More text mining techniques • decision trees, frequent pattern mining • Other critical text • blogs, wikis, etc • Other facets of reviews • “usage” in music reviews • Feature studies • answer the “why” questions
References • Argamon, S., and Levitan, S. (2005). Measuring the Usefulness of Function Words for Authorship Attribution. Proceedings of the 17th Joined International Conference of ACH/ALLC. • Downie, J. S., Unsworth, J., Yu, B., Tcheng, D., Rockwell, G., and Ramsay, S. J. (2005). A Revolutionary Approach to Humanities Computing?: Tools Development and the D2K Data-Mining Framework. Proceedings of the 17th Joined International Conference of ACH/ALLC. • Hu, X., Downie, J. S., West, K., and Ehmann, A. (2005). Mining Music Reviews: Promising Preliminary Results. Proceedings of the Sixth International Conference on Music Information Retrieval (ISMIR). • Sebastiani, F. (2002). Machine Learning in Automated Text Categorization. ACM Computing Surveys, 34, 1. • Stamatatos, E., Fakotakis, N., and Kokkinakis, G. (2000). Text Genre Detection Using Common Word Frequencies. Proceedings of 18th International Conference on Computational Linguistics.
THE ANDREW W. MELLON FOUNDATION Questions? IMIRSEL Thank you!