RESULTS OF THE WCCI 2006 PERFORMANCE PREDICTION CHALLENGE Isabelle Guyon

RESULTS OF THE WCCI 2006 PERFORMANCE PREDICTION CHALLENGE Isabelle Guyon Amir Reza Saffari Azar Alamdari Gideon Dror

Part I INTRODUCTION

Model selection • Selecting models (neural net, decision tree, SVM, …) • Selecting hyperparameters (number of hidden units, weight decay/ridge, kernel parameters, …) • Selecting variables or features (space dimensionality reduction.) • Selecting patterns (data cleaning, data reduction, e.g by clustering.)

Performance prediction How good are you at predicting how good you are? • Practically important in pilot studies. • Good performance predictions render model selection trivial.

Why a challenge? • Stimulate research and push the state-of-the art. • Move towards fair comparisons and give a voice to methods that work but may not be backed up by theory (yet). • Find practical solutions to true problems. • Have fun…

History 1980 1990 2000 2001 2002 2003 2004 2005 • USPS/NIST. • Unipen (with Lambert Schomaker): 40 institutions share 5 million handwritten characters. • KDD cup, TREC, CASP, CAMDA, ICDAR, etc. • NIPS challenge on unlabeled data. • Feature selection challenge (with Steve Gunn): success! ~75 entrants, thousands of entries. • Pascal challenges. • Performance prediction challenge …

Challenge • Date started: Friday September 30, 2005. • Date ended: Monday March 1, 2006 • Duration: 21 weeks. • Estimated number of entrants: 145. • Number of development entries: 4228. • Number of ranked participants: 28. • Number of ranked submissions: 117.

Datasets Type Dataset Validation Examples Domain Feat-ures Training Examples Test Examples Dense ADA 415 Marketing 48 4147 41471 Dense GINA 315 Digits 970 3153 31532 Dense HIVA 384 Drug discovery 1617 3845 38449 Sparse binary NOVA 175 Text classif. 16969 1754 17537 Dense SYLVA 1308 Ecology 216 13086 130858 http://www.modelselect.inf.ethz.ch/

BER distribution Test BER

Results Overall winners for ranked entries: Ave rank: Roman Lutz with LB tree mix cut adapted Ave score: Gavin Cawley with Final #2 ADA:Marc Boulléwith SNB(CMA)+10k F(2D) tv or SNB(CMA) + 100k F(2D) tv GINA: Kari Torkkola & Eugene Tuv with ACE+RLSC HIVA:Gavin Cawleywith Final #3 (corrected) NOVA: Gavin Cawleywith Final #1 SYLVA:Marc Boulléwith SNB(CMA) + 10k F(3D) tv Best AUC:Radford Nealwith Bayesian Neural Networks

Part II PROTOCOL and SCORING

Protocol • Data split: training/validation/test. • Data proportions: 10/1/100. • Online feed-back on validation data. • Validation label release one month before end of challenge. • Final ranking on test data using the five last complete submissions for each entrant.

Performance metrics • Balanced Error Rate (BER): average of error rates of positive class and negative class. • Guess error:dBER = abs(testBER – guessedBER) • Area Under the ROC Curve (AUC).

0.7 0.6 0.5 0.4 Guessed BER 0.3 0.2 0.1 0 Test BER 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Optimistic guesses HIVA ADA GINA NOVA SYLVA

Scoring method E = testBER + dBER [1-exp(- gdBER/s)] • dBER = abs(testBER – guessedBER) • g=1 Challenge score Test BER Guessed BER Test BER

dBER/s E  testBER + dBER ADA HIVA dBER/s GINA NOVA SYLVA Test BER

Score E = testBER + dBER [1-exp(- gdBER/s)] E testBER+dBER testBER

Score (continued) ADA GINA SYLVA HIVA NOVA

Part III RESULT ANALYSIS

What did we expect? • Learn about new competitive machine learning techniques. • Identify competitive methods of performance prediction, model selection, and ensemble learning (theory put into practice.) • Drive research in the direction of refining such methods (on-going benchmark.)

Method comparison dBER Test BER

Danger of overfitting Full line: test BER Dashed line: validation BER 0.5 0.45 0.4 0.35 HIVA 0.3 BER 0.25 0.2 ADA 0.15 0.1 NOVA GINA 0.05 SYLVA 0 0 20 40 60 80 100 120 140 160 Time (days)

How to estimate the BER? • Statistical tests (Stats): Compute it on training data; compare with a “null hypothesis” e.g. the results obtained with a random permutation of the labels. • Cross-validation (CV): Split the training data many times into training and validation set; average the validation data results. • Guaranteed risk minimization (GRM): Use of theoretical performance bounds.

Stats / CV / GRM ???

Top ranking methods • Performance prediction: • CV with many splits 90% train / 10% validation • Nested CV loops • Model selection: • Use of a single model family • Regularized risk / Bayesian priors • Ensemble methods • Nested CV loops, computationally efficient with with VLOO

Other methods • Use of training data only: • Training BER. • Statistical tests. • Bayesian evidence. • Performance bounds. • Bilevel optimization.

Part IV CONCLUSIONS AND FURTHER WORK

Open problems Bridge the gap between theory and practice… • What are the best estimators of the variance of CV? • What should k be in k-fold? • Are other cross-validation methods better than k-fold (e.g bootstrap, 5x2CV)? • Are there better “hybrid” methods? • What search strategies are best? • More than 2 levels of inference?

Future work • Game of model selection. • JMLR special topic on model selection. • IJCNN 2007 challenge!

Benchmarking model selection? • Performance prediction: Participants just need to provide a guess of their test performance. If they can solve that problem, they can perform model selection efficiently. Easy and motivating. • Selection of a model from a finite toolbox: In principle a more controlled benchmark, but less attractive to participants.

CLOP • CLOP=Challenge Learning Object Package. • Based on the Spider developed at the Max Planck Institute. • Two basic abstractions: • Data object • Model object http://clopinet.com/isabelle/Projects/modelselect/MFAQ.html

CLOP tutorial At the Matlab prompt: • D=data(X,Y); • hyper = {'degree=3', 'shrinkage=0.1'}; • model = kridge(hyper); • [resu, model] = train(model, D); • tresu = test(model, testD); • model = chain({standardize,kridge(hyper)});

Conclusions • Twice as much volume of participation as in the feature selection challenge • Top methods as before (different order): • Ensembles of trees • Kernel methods (RLSC/LS-SVM, SVM) • Bayesian neural networks • Naïve Bayes. • Danger of overfitting. • Triumph of cross-validation?

RESULTS OF THE WCCI 2006 PERFORMANCE PREDICTION CHALLENGE Isabelle Guyon

RESULTS OF THE WCCI 2006 PERFORMANCE PREDICTION CHALLENGE Isabelle Guyon

Presentation Transcript

Simple Performance Prediction Methods

Genomic Prediction Results

RESULTS OF THE NIPS 2006 MODEL SELECTION GAME Isabelle Guyon, Amir Saffari, Gideon Dror, Gavin Cawley, Olivier Guyon,

NIPS 2001 Workshop on Feature/Variable Selection Isabelle Guyon BIOwulf Technologies

GPU Performance Prediction

Ch. 6 Human Performance Selected results 2006 - 2010

Results of the 2006

Agnostic Learning vs. Prior Knowledge challenge Isabelle Guyon, Amir Saffari, Gideon Dror, Gavin Cawley, Olivier Guyon

Soccer Games Results Prediction

Neural Prediction Challenge

Performance Results

The Grand Challenge of Space Weather Prediction

Can causal models be evaluated? Isabelle Guyon ClopiNet / ChaLearn

Gene Prediction: Computational Challenge

Eureko 2006 Interim Results “Strong financial performance sustained ”

Feature Selection and Bioinformatics Applications Isabelle Guyon

Performance Prediction Engineering

Enabling Prediction of Performance

Results of the Causality Challenge

Results of the LHCb experiment Data Challenge 2004

Competitions in machine learning: the fun, the art, and the science Isabelle Guyon