210 likes | 398 Views
Project 1: KDD 2009 Orange Challenge. COMP 4332 Tutorial 3 Feb 2 Yin Zhu yinz@cse.ust.hk. All information on this website. http://www.kddcup-orange.com /. Record KDD Cup Participation. The story behind the challenge. French Telecom company Orange .
E N D
Project 1: KDD 2009 Orange Challenge COMP 4332 Tutorial 3 Feb 2 Yin Zhu yinz@cse.ust.hk
All information on this website • http://www.kddcup-orange.com/
The story behind the challenge • French Telecom company Orange. • Task: predict the propensity of customers to • switch provider (churn), • buy new products or services (appetency), or • buy upgrades or add-ons proposed to them to make the sale more profitable (up-selling) • Estimate the churn, appetency and up-sellingprobability of customers.
Data, constraints and requirements • Train and deploy requirements • About one hundred models per month • Fast data preparation and modeling • Fast deployment • Model requirements • Robust • Accurate • Understandable • Business requirement • Return of investment for the whole process • Input data • Relational databases • Numerical or categorical • Noisy • Missing values • Heavily unbalanced distribution • Train data • Hundreds of thousands of instances • Tens of thousand of variables • Deployment • Tens of millions of instances
scoring model In-house systemFrom raw data to scoring models • Data warehouse • Relational data base • Data mart • Star schema • Feature construction • PAC technology • Generates tens of thousands of variables • Data preparation and modeling • Khiops technology Data feeding PAC Khiops
Design of the challenge • Orange business objective • Benchmark the in-house system against state of the art techniques • Data • Data store • Not an option • Data warehouse • Confidentiality and scalability issues • Relational data requires domain knowledge and specialized skills • Tabular format • Standard format for the data mining community • Domain knowledge incorporated using feature construction (PAC) • Easy anonymization • Tasks • Three representative marketing tasks • Requirements • Fast data preparation and modeling (fully automatic) • Accurate • Fast deployment • Robust • Understandable
Data sets extraction and preparation • Input data • 10 relational table • A few hundreds of fields • One million customers • Instance selection • Resampling given the three marketing tasks • Keep 100 000 instances, with less unbalanced target distributions • Variable construction • Using PAC technology • 20000 constructed variables to get a tabular representation • Keep 15 000 variables (discard constant variables) • Small track: subset of 230 variables related to classical domain knowledge • Anonymization • Discard variable names, discard identifiers • Randomize order of variables • Rescale each numerical variable by a random factor • Recode each categorical variable using random category names • Data samples • 50 000 train and test instances sampled randomly • 5000 validation instances sampled randomly from the test set
Scientific and technical challenge • Scientific objective • Fast data preparation and modeling: within five days • Large scale: 50 000 train and test data, 15 000 variables • Hetegeneous data • Numerical with missing values • Categorical with hundreds of values • Heavily unbalanced distribution • KDD social meeting objective • Attract as many participants as possible • Additional small track and slow track • Online feedback on validation dataset • Toy problem (only one informative input variable) • Leverage challenge protocol overhead • One month to explore descriptive data and test submission protocol • Attractive conditions • No intellectual property conditions • Money prizes
Data • Each customer is a data instance with three labels: churn, appetency and up-selling(-1 or 1). • The feature vector for each customer has two versions: small (230 variables) large (15,000 variables sparse!) • For the large dataset, the first 14,740 variables are numerical and the last 260 are categorical. For the small dataset, the first 190 variables are numerical and the last 40 are categorical.
Training and Testing • Training: 50,000 samples with labels for churn, appetency and up-selling • Testing: 50,000 samples without labels • TASK: predicting a score for each customer in each task • Play with the data: DEMO in R
Binary classification. But predicting a Score??? http://www.kddcup-orange.com/evaluation.php
How AUC is calculated? • Sort the predicted scores: • -- score of i-th sample, and • -- true label (-1 or 1) of the i-th sample • For each , use it as a threshold: • For samples 1 to i, classify them as negative (-1) • For sample i+1 to n, classify them as positive (+1) • Calculate Sensitivity = tp/pos, Specificity = tn/neg and a point in the curve is obtained. • Calculate the area under the curve.
How to deal with Categorical values • Binarization: • { A, B, C } -> Create 3 binary variables • Ordinalization: • { A, B, C } -> {1, 2, 3}
Project 1 Requirement • Deadline: 15 March 2012 • Team: 1 or 2 students. • Do the competition • Register at http://www.kddcup-orange.com/register.php • Download the data • Try classifiers and ensemble methods, and submit your result • 50% score for your ranking on the website • 50% score for report and what you have tried and more importantly what you have found • Preprocessing steps • Classifiers • Ensemble methods
Assignment 1 • Deadline: 25 Feb 2012 • Data exploration and experiment plan • What you have found on the data, e.g. dataimbalance, various statistics over the data, data preprocessing methods you want to apply or have applied, etc. • A plan on what classification methods (svm, knn, naivebayes, etc.) and ensemble methods you want to try. You should be familiar with the tools and their I/O formats. • At least three-page report • Basically, Assignment 1 is a mid-term/progress report for the project.
Winning methods • Fast track: • IBM research, USA +: Ensemble of a wide variety of classifiers. Effort put into coding (most frequent values coded with binary features, missing values replaced by mean, extra features constructed, etc.) • ID Analytics, Inc., USA +: Filter+wrapper FS. TreeNet by Salford Systems an additive boosting decision tree technology, bagging also used. • David Slate & Peter Frey, USA: Grouping of modalities/discretization, filter FS, ensemble of decision trees. • Slow track: • University of Melbourne: CV-based FS targeting AUC. Boosting with classification trees and shrinkage, using Bernoulli loss. • Financial Engineering Group, Inc., Japan: Grouping of modalities, filter FS using AIC, gradient tree-classifier boosting. • National Taiwan University +: Average 3 classifiers: (1) Solve joint multiclass problem with l1-regularized maximum entropy model. (2) AdaBoost with tree-based weak leaner. (3) Selective Naïve Bayes. • (+: small dataset unscrambling)
Fact Sheets:Preprocessing&Feature Selection PREPROCESSING (overall usage=95%) Replacement of the missing values Discretization Normalizations Grouping modalities Other prepro Principal Component Analysis 0 20 40 60 80 Percent of participants FEATURE SELECTION (overall usage=85%) Feature ranking Filter method Other FS Forward / backward wrapper Embedded method Wrapper with search 0 10 20 30 40 50 60 Percent of participants
Fact Sheets:Classifier CLASSIFIER (overall usage=93%) Decision tree... Linear classifier Non-linear kernel • About 30% logistic loss, >15% exp loss, >15% sq loss, ~10% hinge loss. • Less than 50% regularization (20% 2-norm, 10% 1-norm). • Only 13% unlabeled data. Other Classif Neural Network Naïve Bayes Nearest neighbors Bayesian Network Bayesian Neural Network 0 10 20 30 40 50 60 Percent of participants
Fact Sheets:Model Selection MODEL SELECTION (overall usage=90%) 10% test K-fold or leave-one-out Out-of-bag est Bootstrap est Other-MS • About 75% ensemble methods (1/3 boosting, 1/3 bagging, 1/3 other). • About 10% used unscrambling. Other cross-valid Virtual leave-one-out Penalty-based Bi-level Bayesian 0 10 20 30 40 50 60 Percent of participants
>= 32 GB > 8 GB <= 8 GB <= 2GB Java Mac OS Other (R, SAS) Matlab Linux Unix Windows C C++ Fact Sheets:Implementation Run in parallel None Multi-processor Memory Parallelism Software Platform Operating System