390 likes | 413 Views
Validation Is (Not) Easy. Dmytro Panchenko. Machine learning engineer, Altexsoft. What is validation?. Validation is a way to select and evaluate our models. Two most common strategies: train-validation-test split (holdout validation) k-fold cross-validation + test holdout.
E N D
Validation Is (Not) Easy Dmytro Panchenko Machine learning engineer, Altexsoft
What is validation? Validation is a way to select and evaluate our models. Two most common strategies: • train-validation-test split (holdout validation) • k-fold cross-validation + test holdout Source: https://towardsdatascience.com/time-series-nested-cross-validation-76adba623eb9
What do we expect? Validation: • To compare and select models
What do we expect? Validation: • To compare and select models Test: • To evaluate model’s performance
What’s wrong? • Non-representative splits
What’s wrong? • Non-representative splits • Unstable validation
What’s wrong? • Non-representative splits • Unstable validation • Data leakages Source: https://www.kaggle.com/alexisbcook/data-leakage
Representative sampling The Good
Representative sampling The Bad
Representative sampling The Ugly
Adversarial validation • Merge train and test into a single dataset • Label train samples as 0 and test samples as 1 • Train classifier • Train samples with the highest error are the most similar to test distribution
Adversarial validation: usage • To detect discrepancy in distributions (ROC-AUC > 0.5) • To make train or validation close to test (by removing features or sampling most similar items) • To make test close to production (if we have an unlabeled set of real-world data) Examples: https://www.linkedin.com/pulse/winning-13th-place-kaggles-magic-competition-corey-levinson/ https://www.kaggle.com/c/home-credit-default-risk/discussion/64722 https://www.kaggle.com/c/human-protein-atlas-image-classification/discussion/77251
Representative sampling The Ugly
Reasons for instability • Not enough data
Reasons for instability • Not enough data • Bad stratification
Reasons for instability • Not enough data • Bad stratification • Noisy labels
Reasons for instability • Not enough data • Bad stratification • Noisy labels • Outliers in data
Case #1. Mercedes-Benz competition Competitors were working with a dataset representing different features of Mercedes-Benz cars to predict the time it takes to pass testing for a car. Metric: R2. Only 4k rows. Extreme outliers in target. Source: https://habr.com/ru/company/ods/blog/336168/
Case #1. Mercedes-Benz competition Gold medal solution: • Multiple k-folds (10x5 folds) to collect more fold statistics • Dependent Student’s t-test for paired samples to compare two models: where – number of folds, – metrics for each fold for models #1 and #2, – dispersion of elementwise differences. Source: https://habr.com/ru/company/ods/blog/336168/ Author: https://www.kaggle.com/daniel89
Case #2. ML BootCamp VI • 19M rows of logs • Adversarial validation gives 0.9+ ROC-AUC • Extremely unstable CV: unclear how to stratify Author: https://www.kaggle.com/sergeifironov/
Case #2. ML BootCamp VI First place solution: • Train model on stratified k-folds • Compute out-of-fold error for each sample • Stratify dataset by error • Optionally: go to step #1 again Author: https://www.kaggle.com/sergeifironov/
Data leakage Data leakage is the contamination of the training data by additional information that will not be available at the actual prediction time. Source: https://www.kaggle.com/alexisbcook/data-leakage
Case #1. HPA Classification Challenge Multiple shots from single experiment are available. If one shot is placed into train and another is placed into validation, you have a leakage.
Case #1. HPA Classification Challenge Solution: if you have data from several groups that share the target, always place whole group to a single set!
Case #2. Telecom Data Cup Where is the leakage? Client uses mobile provider services Client answers to engagement survey Survey result is written into DB All previous history is aggregated into a row in the dataset
Case #2. Telecom Data Cup Engagement survey call itself is accounted in the call history. Short call means everything was fine. Long conversation means complaining. Client uses mobile provider services Client answers to engagement survey LEAKAGE! Survey result is written into DB All previous history is aggregated into a row in the dataset
Case #2. Telecom Data Cup Solution: you must only use data that was available at the point when prediction should have been made. Client uses mobile provider services Client answers to engagement survey LEAKAGE! Survey result is written into DB All previous history is aggregated into a row in the dataset
Case #3. APTOS Blindness Detection Different classes were probably collected separately and artificially mixed into a single dataset, so aspect ratio, image size and crop type vary for different classes.
Case #3. APTOS Blindness Detection It leads network to learning arbitrary metafeatures of image instead of actual symptoms. Source: https://www.kaggle.com/dimitreoliveira/diabetic-retinopathy-shap-model-explainability
Case #3. APTOS Blindness Detection Solution: remove metafeatures that are not related to the task and thoroughly investigate all suspicious “data properties”.
Case #4. Airbus Ship Detection Challenge Original dataset consists of high-resolution images. They were cropped, augmented and after that divided into train and test.
Case #4. Airbus Ship Detection Challenge Solution: first split data into train and test, after that apply all preprocessing. If preprocessing is data-driven (e.g. target encoding), use only train data for that.
Summary • Always ensure that your validation is representative • Check that your validation scenario corresponds real-world prediction scenario • Good luck!
Thank you for your attention Questions are welcomed