150 likes | 164 Views
Explore modeling examples to predict fraud in advance based on customer behavior and demographic data. Learn techniques for heart attack prediction, cancer model testing, and more. Dive into customizing spam detection and image classification systems for practical applications.
E N D
Our goal is to build model to predict fraud in advance We can see associations between customer type and fraudulent behavior. Are they legitimate? Data leakage?
Predict whether someone will have a heart attack on the basis of demographic, diet and clinical measurements
Model test results for prostate cancer (lpsa), based on actual cancer volume (lcavol) and other clinical and demographic variables. ESL Chap1 - Introduction
Classify a recorded phoneme, based on a log-periodogram. A restricted model (red) does much better than an unrestricted one (jumpy black)
Customize an email spam detection system. X = which words appear and how much Y = Spam or not?
Identify the numbers in a handwritten zip code, from a digitized image X = color of each pixelY = which digit is it?
Classify a tissue sample into one of several cancer classes, based on a gene expression profile. X = expression levels of genesY = which cancer?
Classify the pixels in a LANDSAT image, according to usage:Y = {red soil, cotton, vegetation stubble, mixture, gray soil, damp gray soil, very damp gray soil}X = values of pixels in several wavelength bands
October 2006 Announcement of the NETFLIX Competition USAToday headline: “Netflix offers $1 million prize for better movie recommendations” Details: • Beat NETFLIX current recommender model ‘Cinematch’ by 10% based on absolute rating error prior to 2011 • $50K for the annual progress price (relative to baseline) • Data contains a subset of 100 million movie ratings from NETFLIX including 480,189 users and 17,770 movies • Performance is evaluated on holdout movies-users pairs • NETFLIX competition has attracted 45878 contestants on 37660 teams from 180 different countries • Tens of thousands of valid submissions from thousands of teams • Conclusion: in 2009, an international team attained the goal and won the prize! More later…
Data Overview: NETFLIX Internet Movie Data Base All movies (80K) 17K Selection unclear All users (6.8 M) 480 K At least 20 Ratings by end 2005 NETFLIX Competition Data 100 M ratings
NETFLIX data generation process User Arrival Movie Arrival 17K movies Training Data 1998 Time 2005 Qualifier Dataset 3M
Netflix in the class • We will demonstrate many of the methods we discuss on a simplified version of the Netflix dataset • The $1M was won in 2009 by a collaboration of several leading teams • The strongest team, which won both yearly $50K prizes, was founded at AT&T, with an Israeli participant (Yehuda Koren) • I have Yehuda’s presentation on their work, and if time allows it we will discuss it in class briefly • While I was at IBM Research, our team won a related competition in KDD-Cup 2007 (same data, more “standard” modeling tasks) • We may have a “case study” lecture on that as well
Project evolution and relevance to our course Business problem definition Modeling problem definition Statistical problem definition Modeling methodology design Targeting, Sales force mgmt. Wallet / opportunity estimation Quantile est.,Latent variable est. Quantile est.,Graphical model Outside scope Model generation & validation Implementation & application development Keep in mind Programming,Simulation,IBM Wallets OnTarget,MAP This is our domain!