1 / 38

Validation Is (Not) Easy

Validation Is (Not) Easy. Dmytro Panchenko. Machine learning engineer, Altexsoft. What is validation?. Validation is a way to select and evaluate our models. Two most common strategies: train-validation-test split (holdout validation) k-fold cross-validation + test holdout.

amberp
Download Presentation

Validation Is (Not) Easy

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Validation Is (Not) Easy Dmytro Panchenko Machine learning engineer, Altexsoft

  2. What is validation? Validation is a way to select and evaluate our models. Two most common strategies: • train-validation-test split (holdout validation) • k-fold cross-validation + test holdout Source: https://towardsdatascience.com/time-series-nested-cross-validation-76adba623eb9

  3. What do we expect? Validation: • To compare and select models

  4. What do we expect? Validation: • To compare and select models Test: • To evaluate model’s performance

  5. Is it easy?

  6. What’s wrong? • Non-representative splits

  7. What’s wrong? • Non-representative splits • Unstable validation

  8. What’s wrong? • Non-representative splits • Unstable validation • Data leakages Source: https://www.kaggle.com/alexisbcook/data-leakage

  9. Representative sampling The Good

  10. Representative sampling The Bad

  11. Representative sampling The Ugly

  12. Adversarial validation • Merge train and test into a single dataset • Label train samples as 0 and test samples as 1 • Train classifier • Train samples with the highest error are the most similar to test distribution

  13. Adversarial validation: usage • To detect discrepancy in distributions (ROC-AUC > 0.5) • To make train or validation close to test (by removing features or sampling most similar items) • To make test close to production (if we have an unlabeled set of real-world data) Examples: https://www.linkedin.com/pulse/winning-13th-place-kaggles-magic-competition-corey-levinson/ https://www.kaggle.com/c/home-credit-default-risk/discussion/64722 https://www.kaggle.com/c/human-protein-atlas-image-classification/discussion/77251

  14. Representative sampling The Ugly

  15. Kaggle example

  16. Unstable validation

  17. Reasons for instability • Not enough data

  18. Reasons for instability • Not enough data • Bad stratification

  19. Reasons for instability • Not enough data • Bad stratification • Noisy labels

  20. Reasons for instability • Not enough data • Bad stratification • Noisy labels • Outliers in data

  21. Case #1. Mercedes-Benz competition Competitors were working with a dataset representing different features of Mercedes-Benz cars to predict the time it takes to pass testing for a car. Metric: R2. Only 4k rows. Extreme outliers in target. Source: https://habr.com/ru/company/ods/blog/336168/

  22. Case #1. Mercedes-Benz competition Gold medal solution: • Multiple k-folds (10x5 folds) to collect more fold statistics • Dependent Student’s t-test for paired samples to compare two models: where – number of folds, – metrics for each fold for models #1 and #2, – dispersion of elementwise differences. Source: https://habr.com/ru/company/ods/blog/336168/ Author: https://www.kaggle.com/daniel89

  23. Case #2. ML BootCamp VI • 19M rows of logs • Adversarial validation gives 0.9+ ROC-AUC • Extremely unstable CV: unclear how to stratify Author: https://www.kaggle.com/sergeifironov/

  24. Case #2. ML BootCamp VI First place solution: • Train model on stratified k-folds • Compute out-of-fold error for each sample • Stratify dataset by error • Optionally: go to step #1 again Author: https://www.kaggle.com/sergeifironov/

  25. Data leakage Data leakage is the contamination of the training data by additional information that will not be available at the actual prediction time. Source: https://www.kaggle.com/alexisbcook/data-leakage

  26. Case #1. HPA Classification Challenge Multiple shots from single experiment are available. If one shot is placed into train and another is placed into validation, you have a leakage.

  27. Case #1. HPA Classification Challenge Solution: if you have data from several groups that share the target, always place whole group to a single set!

  28. Case #2. Telecom Data Cup Where is the leakage? Client uses mobile provider services Client answers to engagement survey Survey result is written into DB All previous history is aggregated into a row in the dataset

  29. Case #2. Telecom Data Cup Engagement survey call itself is accounted in the call history. Short call means everything was fine. Long conversation means complaining. Client uses mobile provider services Client answers to engagement survey LEAKAGE! Survey result is written into DB All previous history is aggregated into a row in the dataset

  30. Case #2. Telecom Data Cup Solution: you must only use data that was available at the point when prediction should have been made. Client uses mobile provider services Client answers to engagement survey LEAKAGE! Survey result is written into DB All previous history is aggregated into a row in the dataset

  31. Case #3. APTOS Blindness Detection Different classes were probably collected separately and artificially mixed into a single dataset, so aspect ratio, image size and crop type vary for different classes.

  32. Case #3. APTOS Blindness Detection It leads network to learning arbitrary metafeatures of image instead of actual symptoms. Source: https://www.kaggle.com/dimitreoliveira/diabetic-retinopathy-shap-model-explainability

  33. Case #3. APTOS Blindness Detection Solution: remove metafeatures that are not related to the task and thoroughly investigate all suspicious “data properties”.

  34. Case #4. Airbus Ship Detection Challenge Original dataset consists of high-resolution images. They were cropped, augmented and after that divided into train and test.

  35. Case #4. Airbus Ship Detection Challenge Solution: first split data into train and test, after that apply all preprocessing. If preprocessing is data-driven (e.g. target encoding), use only train data for that.

  36. Summary • Always ensure that your validation is representative • Check that your validation scenario corresponds real-world prediction scenario • Good luck!

  37. Thank you for your attention Questions are welcomed

More Related