Movie Info Web Search & Classification

Movie Info Web Search & Classification Frankie Wu CS224N Final Project Spring 2008

Movie Info Search & Classification-Motivation • Monetary Reward! Netflix Prize Contest • $50,000 Incremental Prizes (annual) • $1,000,000 Grand Prize • Goal: predict how users will rate movies based on how they have rated other movies and how other users have rated all movies • Only Movie Info Given: Title and Year • Assumption: users will rate similar movies similarly • What is similar? One Possibility: Cast and Crew • Why not just use IMDB or Amazon’s DVD database? • Whole system must be commercially usable by Netflix. • Even barred from using Netflix movie database (oddly).

Movie Info Search & Classification-General Approach • Data Collection • Spider the web and collect web pages based on the movie title and year. • Hand annotate data to create training and test sets. • All new code. • Classification • Maximum Entropy Markov Model (MEMM) classifier to learn relative weights of hand-designed features on training set. • Viterbi decoder to find optimal label sequences on test set (and eventually “real” unannotated data). • Code starting point: CS224N PA3.

Movie Info Search & Classification-Data Collection • Yahoo! Web Search API to search web • Java program harness • 100 movies (first 100 of 17700 Netflix list) • 50 web pages per movie (or fewer if unavailable) • Save HTML files locally • Replace with own web crawler in production system • Data Annotation • Hand build information files for the 100 movies • ACTOR, DIRECTOR, SCREENWRITER, PRODUCER, COMPOSER • Programmatically annotate the 5000 movie web pages (imperfect)

Movie Info Search & Classification-Classification: Breadth vs. Depth • Initially wanted to use 80x50 files for the training set and 20x50 files for the test set. • Too much training data—computationally impractical. • Which is the better compromise? • Breadth: 80 movies x 10 files = 800 • Depth: 10 movies x 50 files = 500 • Speed: Depth faster than Breadth, 5m to 8m (expected) • Accuracy: Depth F-measure ~3x better than Breadth (surprising?)

Movie Info Search & Classification-Classification: Features • Features Hand Built • Word and Previous Label (a la PA3) • Bigrams and Trigrams • Name-Shaped Words (initial caps) • Name-Shaped Bigrams and Trigrams • Nearby strings: star, act, direct, produc, compos • Individual Feature Contribution • Determined by turning off features one at a time • Best and worst features? Still being determined at the time of this writing.

Movie Info Search & Classification-Results • Best results at the time of this writing: • ACTOR: • precision: 60.0% (161/268) • recall: 2.5% (161/6476) • f-measure: 4.8% • In general, disappointing result. • Highly skewed toward better precision than recall. • Likely due to extreme variance in data format—virtually free form.

Movie Info Web Search & Classification