370 likes | 614 Views
KDD Cup 2009. Fast Scoring on a Large Database Presentation of the Results at the KDD Cup Workshop June 28, 2008 The Organizing Team. KDD Cup 2009 Organizing Team. Project team at Orange Labs R&D: Vincent Lemaire Marc Boullé Fabrice Clérot Raphaël Féraud Aurélie Le Cam
E N D
KDD Cup2009 Fast Scoring on a Large Database Presentation of the Results at the KDD Cup Workshop June 28, 2008 The Organizing Team
KDD Cup 2009 Organizing Team • Project team at Orange Labs R&D: • Vincent Lemaire • Marc Boullé • Fabrice Clérot • Raphaël Féraud • Aurélie Le Cam • Pascal Gouzien • Beta testing and proceedings editor: • Gideon Dror • Web site design: • Olivier Guyon (MisterP.net, France) • Coordination (KDD cup co-chairs): • Isabelle Guyon • David Vogel
Thanks to our sponsors… • Orange • ACM SIGKDD • Pascal • Unipen • Google • Health Discovery Corp • Clopinet • Data Mining Solutions • MPS
Participation Statistics • 1299 registered teams • 7865 entries • 46 countries :
A worlwide operator • One of the main telecommunication operators in the world • Providing services to more than 170 millions customers over five continents • Including 120 millions under the Orange Brand
KDD Cup 2009 organized by OrangeCustomer Relationship Management (CRM) • Three marketing tasks: predict the propensity of customers • to switch provider: Churn • to buy new products or services: Appentency • to buy upgrades or new options proposed to them: Up-selling • Objective: improve the return of investments (ROI) of marketing campaigns • Increase the efficiency of the campaign given a campaign cost • Decrease the campaign cost for a given marketing objective • Better prediction leads to better ROI
Data, constraints and requirements • Train and deploy requirements • About one hundred models per month • Fast data preparation and modeling • Fast deployment • Model requirements • Robust • Accurate • Understandable • Business requirement • Return of investment for the whole process • Input data • Relational databases • Numerical or categorical • Noisy • Missing values • Heavily unbalanced distribution • Train data • Hundreds of thousands of instances • Tens of thousand of variables • Deployment • Tens of millions of instances
scoring model In-house systemFrom raw data to scoring models • Data warehouse • Relational data base • Data mart • Star schema • Feature construction • PAC technology • Generates tens of thousands of variables • Data preparation and modeling • Khiops technology Data feeding PAC Khiops
Design of the challenge • Orange business objective • Benchmark the in-house system against state of the art techniques • Data • Data store • Not an option • Data warehouse • Confidentiality and scalability issues • Relational data requires domain knowledge and specialized skills • Tabular format • Standard format for the data mining community • Domain knowledge incorporated using feature construction (PAC) • Easy anonymization • Tasks • Three representative marketing tasks • Requirements • Fast data preparation and modeling (fully automatic) • Accurate • Fast deployment • Robust • Understandable
Data sets extraction and preparation • Input data • 10 relational table • A few hundreds of fields • One million customers • Instance selection • Resampling given the three marketing tasks • Keep 100 000 instances, with less unbalanced target distributions • Variable construction • Using PAC technology • 20000 constructed variables to get a tabular representation • Keep 15 000 variables (discard constant variables) • Small track: subset of 230 variables related to classical domain knowledge • Anonymization • Discard variable names, discard identifiers • Randomize order of variables • Rescale each numerical variable by a random factor • Recode each categorical variable using random category names • Data samples • 50 000 train and test instances sampled randomly • 5000 validation instances sampled randomly from the test set
Scientific and technical challenge • Scientific objective • Fast data preparation and modeling: within five days • Large scale: 50 000 train and test data, 15 000 variables • Hetegeneous data • Numerical with missing values • Categorical with hundreds of values • Heavily unbalanced distribution • KDD social meeting objective • Attract as many participants as possible • Additional small track and slow track • Online feedback on validation dataset • Toy problem (only one informative input variable) • Leverage challenge protocol overhead • One month to explore descriptive data and test submission protocol • Attractive conditions • No intellectual property conditions • Money prizes
Business impact of the challenge • Bring Orange datasets to the data mining community • Benefit for community • Access to challenging data • Benefit for Orange • Benchmark of numerous competing techniques • Drive the research efforts towards Orange needs • Evaluate the Orange in-house system • High number of participants and high quality of the results • Orange in-house results: • Improved by a significant margin when leveraging all business requirements • Almost Parretto optimal when other criterions are considered (automation, very fast train and deploy, robustness and understandability) • Need to study the best challenge methods to get more insights
KDD Cup 2009: Result Analysis Best Result (period considered in the figure) In House System (downloadable : www.khiops.com) Baseline (Naïve Bayes)
Overall – Test AUC – Fast Best Results (on each dataset) Submissions Good Result Very Quickly
Overall – Test AUC – Fast Best Results (on each dataset) Submissions Good Result Very Quickly • In House (Orange) System: • No parameters • On 1 standard laptop (mono proc) • If deal as 3 different problems
Overall – Test AUC – Fast Very Fast Good Result Small improvement after the first day (83.85 84.93)
Overall – Test AUC – Slow Very Small improvement after the 5th day (84.93 85.2) Improvement due to unscrambling?
Overall – Test AUC – Submissions 23.24% of the submissions (>0.5)< Baseline 15.25% of the submissions (>0.5)> In House 84.75% of the submissions (>0.5)< In House
Overall – Test AUC'Correlation' Test / Train ? Random Values Submitted Boosting Method orTrain Target Submitted Over fitting
Overall – Test AUC Test AUC - 24 hours Test AUC - 12 hours Test AUC – 5 days Test AUC – 36 days
Overall – Test AUC • Difference between : • best result at the end of the first day and • best result at the end of the 36 days =1.35% Test AUC - 12 hours • time to adjust model parameters ? • time to train ensemble method ? • time to find more processors ? • time to test more methods • time to unscramble ? • … Test AUC – 36 days
Test AUC = f (time) Churn – Test AUC – day [0:36] Appetency – Test AUC – day [0:36] Up-selling– Test AUC – day [0:36] Harder ? Easier ?
Test AUC = f (time) Churn – Test AUC – day [0:36] Appetency – Test AUC – day [0:36] Up-selling– Test AUC – day [0:36] =1.84% =1.38% =0.11% Harder ? Easier ? • Difference between : • best result at the end of the first day and • best result at the end of the 36 days
CorrelationTest AUC / Valid AUC (5 days) Churn – Test/Valid – day [0:5] Appetency – Test/Valid – day [0:5] Up-selling– Test/Valid – day [0:5] Harder ? Easier ?
CorrelationTrain AUC / Valid AUC (36 days) Churn – Test/Train – day [0:36] Appetency – Test/Train – day [0:36] Up-selling– Test/Train – day [0:36] Difficulty to conclude something…
HistogramTest AUC / Valid AUC ([0:5] or ]5-36] days) Churn – Test AUC – day [0:36] Appetency – Test AUC – day [0:36] Up-selling– Test AUC – day [0:36] Knowledge (parameters?) found during 5 days helps after… ?
HistogramTest AUC / Valid AUC ([0:5] or ]5-36] days) Churn – Test AUC – day [0:36] Appetency – Test AUC – day [0:36] Up-selling– Test AUC – day [0:36] YES ! Knowledge (parameters?) found during 5 days helps after… ? Churn – Test AUC – day ]5:36] Appetency – Test AUC – day ]5:36] Up-selling– Test AUC – day ]5:36]
Fact Sheets:Preprocessing&Feature Selection PREPROCESSING (overall usage=95%) Replacement of the missing values Discretization Normalizations Grouping modalities Other prepro Principal Component Analysis 0 20 40 60 80 Percent of participants FEATURE SELECTION (overall usage=85%) Feature ranking Filter method Other FS Forward / backward wrapper Embedded method Wrapper with search 0 10 20 30 40 50 60 Percent of participants
Fact Sheets:Classifier CLASSIFIER (overall usage=93%) Decision tree... Linear classifier Non-linear kernel • About 30% logistic loss, >15% exp loss, >15% sq loss, ~10% hinge loss. • Less than 50% regularization (20% 2-norm, 10% 1-norm). • Only 13% unlabeled data. Other Classif Neural Network Naïve Bayes Nearest neighbors Bayesian Network Bayesian Neural Network 0 10 20 30 40 50 60 Percent of participants
Fact Sheets:Model Selection MODEL SELECTION (overall usage=90%) 10% test K-fold or leave-one-out Out-of-bag est Bootstrap est Other-MS • About 75% ensemble methods (1/3 boosting, 1/3 bagging, 1/3 other). • About 10% used unscrambling. Other cross-valid Virtual leave-one-out Penalty-based Bi-level Bayesian 0 10 20 30 40 50 60 Percent of participants
>= 32 GB > 8 GB <= 8 GB <= 2GB Java Mac OS Other (R, SAS) Matlab Linux Unix Windows C C++ Fact Sheets:Implementation Run in parallel None Multi-processor Memory Parallelism Software Platform Operating System
Winning methods • Fast track: • IBM research, USA +: Ensemble of a wide variety of classifiers. Effort put into coding (most frequent values coded with binary features, missing values replaced by mean, extra features constructed, etc.) • ID Analytics, Inc., USA +: Filter+wrapper FS. TreeNet by Salford Systems an additive boosting decision tree technology, bagging also used. • David Slate & Peter Frey, USA: Grouping of modalities/discretization, filter FS, ensemble of decision trees. • Slow track: • University of Melbourne: CV-based FS targeting AUC. Boosting with classification trees and shrinkage, using Bernoulli loss. • Financial Engineering Group, Inc., Japan: Grouping of modalities, filter FS using AIC, gradient tree-classifier boosting. • National Taiwan University +: Average 3 classifiers: (1) Solve joint multiclass problem with l1-regularized maximum entropy model. (2) AdaBoost with tree-based weak leaner. (3) Selective Naïve Bayes. • (+: small dataset unscrambling)
Conclusion • Participation exceeded our expectations. We thank the participants for their hard work, our sponsors, and Orange who offered: • A problem of real industrial interest with challenging scientific and technical aspects • Prizes. • Lessons learned: • Do not under-estimate the participants: five days were given for the fast challenge, only a few hours sufficed to some participants. • Ensemble methods are effective. • Ensemble of decision trees offer off-the-shelf solutions to problems with large numbers of samples and attributes, mixed types of variables, and lots of missing values.