CONTENT-BASED BOOK RECOMMENDING USING LEARNING FOR TEXT CATEGORIZATION

CONTENT-BASED BOOK RECOMMENDING USING LEARNING FOR TEXT CATEGORIZATION TRIVIKRAM BHAT UNIVERSITY OF TEXAS AT ARLINGTON DATA MINING CSE6362 BASED ON PAPER BY RAYMOND J. MOONEY AND LORIENE ROY UNIVERSITY OF TEXAS, AUSTIN

OVERVIEW • Introduction • Techniques • Drawbacks of Existing Systems • Advantages of Content Based Systems • LIBRA • System Description • Experimental Results • Future Work • Conclusions

INTRODUCTION General goal of a Recommender System • Make personalized suggestions based on previous examples of users likes and dislikes Types • Existing systems that use Social Filtering methods (base recommendations on other users preferences) • Content Based systems (use information about an item itself to make suggestions)

INTRODUCTION • Companies • Firefly • Net Perceptions • LikeMinds • Amazon ( Book Recommending ) • Barnes And Noble ( Book Recommending )

TECHNIQUES • Social / Collaborative Filtering • Maintain a Database of user preferences • Find other users whose known preferences correlate significantly with a given user • Content Based Filtering • Allows a system to uniquely characterize each user without having to match their interests to someone else’s • Items are recommended based on the information of the item itself

DRAWBACKS OF EXISTING SYSTEMS • Assume that a given user’s tastes are generally the same as another user • Assume that there are sufficient number of ratings • Tend to recommend popular titles • Need for sufficient information about other users which raises concerns about privacy and access to customer data

ADVANTAGES OF CONTENT BASED SYSTEMS • Items are recommended based on the content of the item rather than on other users preferences • Provides a way to list content features that caused the item to be recommended • Allows users to provide initial subject information to aid the system

LIBRA(Learning Intelligent Book Recommending Agent) • A database of book information extracted from web pages at Amazon.com • Users select a set of training books and rate them on a scale of 1-10 • System learns a profile of the user using a Bayesian learning algorithm • Produces a ranked list of the most recommended additional titles from the system catalog

SYSTEM DESCRIPTION Extracting information and building a database • Perform Amazon subject search • Download book description URL’s • Information Extraction using slots to get valuable information about each book • Current slots used are title, authors, published reviews and many more • A simple extraction system is sufficient as the layout of Amazon’s automatically generated pages is regular • Some preprocessing is done (author names into unique tokens of the form first_initial_last-name)

SYSTEM DESCRIPTION • Learning a Profile • User selects titles (maybe for a particular author) • - Need not perform a random scan of the entire database • Users rate the selected titles based on a scale of 1-10 • Naïve Bayesian text classifier is used to classify a book title as either positive(6-10) or negative(1-5) • N training books Be (1 <= e <= N) • Each has 2 real weights • Positive weight e1 = (r-1)/9 • Negative weight e0 = 1 - e1 • r = user rating (1 <= r <= 10)

SYSTEM DESCRIPTION Parameters • P(cj) =  ej / N • P(wk|cj, sm) =  ej nkem / L(cj, sm) • Where nkem = count of the number of times a word wk appears in example Be in slot sm • L(cj, sm) =  ej / dm denotes the total weighted length of the documents in category cj and slotsm • dm = vector of documents • Strength – It measures how much more likely a word in a slot is to appear in a positively rated book than a negatively rated book

Sample Positive Profile Features Slot Word Strength WORDS ZUBRIN 9.85 WORDS SMOLIN 9.39 WORDS TREFIL 8.77 WORDS DOT 8.67 SUBJECTS COMPARATIVE 8.39 AUTHOR D GOLDSMITH 8.04 WORDS ALH 7.97 WORDS MANNED 7.97 RELATEDTITLES SETTLE 7.91

SYSTEM DESCRIPTION Producing, Explaining and Revising Recommendations • Once a profile is learnt, it is used to predict the preferred ranking of the remaining books • Recommendations are reviewed by the user and the user may assign their own rating to the examples they believe to be incorrectly ranked • Retrain the system by repeating the above several times in order to produce the best results

EXPERIMENTAL RESULTS Data Collection • Several data sets were assembled (LIT1, LIT2, MYST, SCI, SF) • In order to present a quantitative picture of performance on a realistic sample, books were selected at random • If the user was not familiar with a book, the user was asked to give a rating based on the information provided by the Amazon page describing the book

EXPERIMENTAL RESULTS Performance Evaluation • Performed 10-fold cross validation on the examples • Various metrics were used to measure the performance • Classification accuracy (Acc): The percentage of examples correctly classified as positive or negative • Precision (Pr): The percentage of examples classified as positive which are positive

EXPERIMENTAL RESULTS Discussion • User-selected examples v/s Randomly selected examples • User-selected examples are better as the user can accurately rate the selection • Randomly selected examples tend to cover the complete dataset • Conclusion – Avoid prematurely committing to a specific methodology

EXPERIMENTAL RESULTS • Can Collaborative and Content-Based approaches be combined to produce better results? • Slots – related authors, related titles • When the above slots were removed, performance degraded Use of both approaches together produces better results

FUTURE WORK • Web-Based interface (with a larger body of users) • Compare LIBRA’s Content-Based Approach to a standard Collaborative Approach • Maximize the utility of the small training set by using various Machine Learning techniques • Unsupervised learning • Active learning (incremental approach) • One effective approach – provide highly rated examples, generate initial recommendations, review the results, provide low rating for bad items and retrain the system to get new recommendations

CONCLUSIONS • Content-Based Approach holds the promise of being able to effectively recommend items that have not been rated • Provides accurate information without any background knowledge of other users preferences • Combining Collaborative techniques does provide better results • www.cs.utexas.edu/users/ml/recommender.html • Partially supported by NSF

QUESTIONS??

CONTENT-BASED BOOK RECOMMENDING USING LEARNING FOR TEXT CATEGORIZATION