Validation Is (Not) Easy

Validation Is (Not) Easy Dmytro Panchenko Machine learning engineer, Altexsoft

What is validation? Validation is a way to select and evaluate our models. Two most common strategies: • train-validation-test split (holdout validation) • k-fold cross-validation + test holdout Source: https://towardsdatascience.com/time-series-nested-cross-validation-76adba623eb9

What do we expect? Validation: • To compare and select models

What do we expect? Validation: • To compare and select models Test: • To evaluate model’s performance

Is it easy?

What’s wrong? • Non-representative splits

What’s wrong? • Non-representative splits • Unstable validation

What’s wrong? • Non-representative splits • Unstable validation • Data leakages Source: https://www.kaggle.com/alexisbcook/data-leakage

Representative sampling The Good

Representative sampling The Bad

Representative sampling The Ugly

Adversarial validation • Merge train and test into a single dataset • Label train samples as 0 and test samples as 1 • Train classifier • Train samples with the highest error are the most similar to test distribution

Adversarial validation: usage • To detect discrepancy in distributions (ROC-AUC > 0.5) • To make train or validation close to test (by removing features or sampling most similar items) • To make test close to production (if we have an unlabeled set of real-world data) Examples: https://www.linkedin.com/pulse/winning-13th-place-kaggles-magic-competition-corey-levinson/ https://www.kaggle.com/c/home-credit-default-risk/discussion/64722 https://www.kaggle.com/c/human-protein-atlas-image-classification/discussion/77251

Representative sampling The Ugly

Kaggle example

Unstable validation

Reasons for instability • Not enough data

Reasons for instability • Not enough data • Bad stratification

Reasons for instability • Not enough data • Bad stratification • Noisy labels

Reasons for instability • Not enough data • Bad stratification • Noisy labels • Outliers in data

Case #1. Mercedes-Benz competition Competitors were working with a dataset representing different features of Mercedes-Benz cars to predict the time it takes to pass testing for a car. Metric: R2. Only 4k rows. Extreme outliers in target. Source: https://habr.com/ru/company/ods/blog/336168/

Case #1. Mercedes-Benz competition Gold medal solution: • Multiple k-folds (10x5 folds) to collect more fold statistics • Dependent Student’s t-test for paired samples to compare two models: where – number of folds, – metrics for each fold for models #1 and #2, – dispersion of elementwise differences. Source: https://habr.com/ru/company/ods/blog/336168/ Author: https://www.kaggle.com/daniel89

Case #2. ML BootCamp VI • 19M rows of logs • Adversarial validation gives 0.9+ ROC-AUC • Extremely unstable CV: unclear how to stratify Author: https://www.kaggle.com/sergeifironov/

Case #2. ML BootCamp VI First place solution: • Train model on stratified k-folds • Compute out-of-fold error for each sample • Stratify dataset by error • Optionally: go to step #1 again Author: https://www.kaggle.com/sergeifironov/

Data leakage Data leakage is the contamination of the training data by additional information that will not be available at the actual prediction time. Source: https://www.kaggle.com/alexisbcook/data-leakage

Case #1. HPA Classification Challenge Multiple shots from single experiment are available. If one shot is placed into train and another is placed into validation, you have a leakage.

Case #1. HPA Classification Challenge Solution: if you have data from several groups that share the target, always place whole group to a single set!

Case #2. Telecom Data Cup Where is the leakage? Client uses mobile provider services Client answers to engagement survey Survey result is written into DB All previous history is aggregated into a row in the dataset

Case #2. Telecom Data Cup Engagement survey call itself is accounted in the call history. Short call means everything was fine. Long conversation means complaining. Client uses mobile provider services Client answers to engagement survey LEAKAGE! Survey result is written into DB All previous history is aggregated into a row in the dataset

Case #2. Telecom Data Cup Solution: you must only use data that was available at the point when prediction should have been made. Client uses mobile provider services Client answers to engagement survey LEAKAGE! Survey result is written into DB All previous history is aggregated into a row in the dataset

Case #3. APTOS Blindness Detection Different classes were probably collected separately and artificially mixed into a single dataset, so aspect ratio, image size and crop type vary for different classes.

Case #3. APTOS Blindness Detection It leads network to learning arbitrary metafeatures of image instead of actual symptoms. Source: https://www.kaggle.com/dimitreoliveira/diabetic-retinopathy-shap-model-explainability

Case #3. APTOS Blindness Detection Solution: remove metafeatures that are not related to the task and thoroughly investigate all suspicious “data properties”.

Case #4. Airbus Ship Detection Challenge Original dataset consists of high-resolution images. They were cropped, augmented and after that divided into train and test.

Case #4. Airbus Ship Detection Challenge Solution: first split data into train and test, after that apply all preprocessing. If preprocessing is data-driven (e.g. target encoding), use only train data for that.

Summary • Always ensure that your validation is representative • Check that your validation scenario corresponds real-world prediction scenario • Good luck!

Thank you for your attention Questions are welcomed

Validation Is (Not) Easy

Validation Is (Not) Easy

Presentation Transcript

CONUS CODING VALIDATION

Autoclave Validation

What is Analytical Method Validation

Wisdom is not

NEWS IS NOT :

Source Validation

VALIDATION

Validation and verification

IS NOT

IRWIN IS NOT

NOT:

Not!

POC Validation Stats

XML Schema Validation: Not Exactly a Science?

Server-Side Validation

Translation Validation

Validation or Not?

Validation Rules

Validation Rules

Verification and Validation

Email Validation Checker