1 / 37

Characterising Learning Transfer: When Training and Test Distributions are Different

This research paper explores characterizing learning transfer when there are differences between the training and test distributions. It covers various cases of dataset shift and suggests an inclusive framework to address the general problem. The paper provides practical insights and can be formalized.

warnerr
Download Presentation

Characterising Learning Transfer: When Training and Test Distributions are Different

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. when training and test distributions are different characterising learning transfer Amos Storkey, School of Informatics.

  2. acknowledgements • Joint work with Masashi Sugiyama, Jon Clayden and Mark Bastin Amos Storkey, School of Informatics.

  3. characterising learning transfer • Learning transfer • Covers many current cases of dataset shift • Will benefit from an inclusive framework that characterises the general problem • Can be formalised • Is practical Amos Storkey, School of Informatics.

  4. Predictive Generative ? Test dataset shift Training Amos Storkey, School of Informatics.

  5. real life • Characterising the change • Simple covariate shift • Prior probability shift • Sample selection bias • Imbalanced data • Domain shift • Source component shift • Focus on the prediction problem: Given X predict Y Amos Storkey, School of Informatics.

  6. simple covariate shift • Learnt conditional predictive model • Change: • Distribution of X changes • P(Y|X) does not • Modelling implication: • None (given suitable modelling class) X Y y x Amos Storkey, School of Informatics.

  7. no modellingimplication? y x Amos Storkey, School of Informatics.

  8. prior probability shift • Learnt generative model • Change: • Distribution of Y changes • P(X|Y) does not • Modelling implication: • Use different P(Y) in Bayes Rule Y X x2 y x x1 Amos Storkey, School of Informatics.

  9. X Y V sample selection bias • Learnt conditional predictive model • Change: • Sample selection rule V determines what samples occur in data. • Modelling implication: • Sample selection estimation y X Y V = covariate shift x Amos Storkey, School of Informatics.

  10. imbalanced data • Learn conditional classification model on balanced data • Change: • Training data: V rejects many samples for common class • Test on full imbalanced data (special case of sample selection bias) • Modelling implication: • Adapt classification probability thresholds to account for change. X Y V X2 X1 Amos Storkey, School of Informatics.

  11. domain shift F • Learn conditional classification model on balanced data • Change: • Dynamic X. Xnew=f(Xold) • Y(Xnew)=Y(f(Xold)) • Modelling implication: • Need to learn functional map f X Xo Y Amos Storkey, School of Informatics.

  12. source component shift • Various sources for data • Change: • Proportions of different source components vary between datasets • Within source conditional models are same • Modelling implication: • Estimate sources and proportion changes • Learn mixture of experts model X Y R y x Amos Storkey, School of Informatics.

  13. sample selection v source component • sample selection bias as source component shift: • Let R index rejection-equiprobable regions. • P(X,Y|R) gives distributions for those regions: consistent for both training and test. • P(R) varies to account for rejection in training. X Y V X Y R Amos Storkey, School of Informatics.

  14. modelling source component shift P11(x) P12(x) P13(x) P21(x) P22(x) P23(x) D1i T1i D2i P1(y|x) P2(y|x) i Amos Storkey, School of Informatics.

  15. EM for source component shift • Effectively a Gaussian mixture model with shared components, and different priors. • Can use EM algorithm: • Compute responsibilities for components • Learn parameters of Gaussians • Learn parameters for regressors. • All subject to constraints on what data point can be generated from what model. Amos Storkey, School of Informatics.

  16. Amos Storkey, School of Informatics.

  17. tests • 1D linear, sample from prior form, BIC model selection, 100 tests. Amos Storkey, School of Informatics.

  18. tests • 4D nonlinear, auto-mpg data, Gaussian process regressors, BIC. • Trained on one origin of car • Tested on 2 other origins Amos Storkey, School of Informatics.

  19. issues • Single training dataset • No targets for new domain • Semi-supervised: a few target values might help to distinguish between different potential shift models. • Dataset shift  Transfer Learning Amos Storkey, School of Informatics.

  20. from here... • Tranference • Dealing with the more general problem of multiple datasets multiple domains • Topic modelling and multilevel topic modelling • What is a domain or dataset anyway? Structured data. • More general than regression. Varying fields. Missing data. Semi-supervised learning. • Characterising the general case. • Mixtures and mixing • Dataset production • Non-parametric methods and local minima reduction Amos Storkey, School of Informatics.

  21. interim • Transference is really structure modelling • Dataset shift implies unsupervised learning! • Using conditional models implies a particular full generative model under dataset shift scenarios. • But in unsupervised learning people have been dealing with dataset shift for a long time… by modelling for it. • e.g.Intra versus inter subject variability. • In real life, modelling for the variability is the most common approach. Never simple. Amos Storkey, School of Informatics.

  22. Diffusion Tensor Imaging • Brain MRI imaging technique looking at the anisotropy of water diffusion in the brain. Amos Storkey, School of Informatics.

  23. the white matter Amos Storkey, School of Informatics.

  24. diffusion tensor • The diffusion of water at each voxel is commonly modelled as a three dimensional second order tensor, D. • Think of it as an ellipsoid with some principal direction. Amos Storkey, School of Informatics.

  25. The problem • “White matter integrity” matters in studies of ageing. • But to study white matter integrity, we have to compare across subjects, and within subjects. • But subjects brains are different anyway. • Need to account for shifts between brains in mapping results. • Use diffusion tensor imaging. Currently: Use FA. Amos Storkey, School of Informatics.

  26. Tractography • Would like to combine local direction components into consistent “tracts”. • But the measurements are noisy… • Set up a Markov Random Field Amos Storkey, School of Informatics.

  27. Behrens et al • And then sample streamlines from the random field. Can either work with streamline samples, • or compute marginals: P(tract goes through X| same tract goes through SEED). Amos Storkey, School of Informatics.

  28. Seed points as hypotheses • Single seed point is more specific than a seeding region • But tract reconstruction is highly sensitive to seed placement • Neighbourhood tractography (NT) treats a group of “candidate” seed points as hypotheses • Uses tract shape and length to find best resulting match to a reference tract Clayden et al., NeuroImage, 2006 Amos Storkey, School of Informatics.

  29. Bayesian model comparison • Given some reference tract from one brain. • Is this tract in a second brain the same tract as the reference tract? • Compare P(tract) with P(tract|reference tract) • but • Want consistency! The reference tract is just any other tract. Need a model with P(tract)= Amos Storkey, School of Informatics.

  30. Model choice • Model Comparison or Model choice? • In fact we have a number of candidate matches. • Presume at most one is right. Could be that none match. • Compute P(this is right match). Amos Storkey, School of Informatics.

  31. median tract spline fit • Work with streamlines. Reduce to Median Tract. • Fit a B-spline to the 3D • median tract. • Adjust knot point positions to constrain error on reference tract. Seed point Amos Storkey, School of Informatics.

  32. v1 v0 Two models: P(cos[]) and P(cos[], cos[r] | cos[])= P(cos[])P(cos[r]| cos[], cos[]) Derive second from assumption v1* symmetric about v1. Amos Storkey, School of Informatics.

  33. model • cos() is uniform if direction is uniform on unit sphere. • Use a Beta distribution + uniform component to model probabilities. • Compute using hand labelled training data. model whole tract as product of individual step probabilities. 2 cases: unmatched, matched. Amos Storkey, School of Informatics.

  34. results Amos Storkey, School of Informatics.

  35. match quality • Posterior probabilities for the second and third subjects: •   1: 0.332   2: 0.344   3: 0.822   4: 0.588 5: 0.877 • For the first subject, the best match (top): 0.464, next best (middle): 0.116. • Three tracts >0.1, five >0.05 (all plausible matches). • This is out of 220 candidate seeds. • The posterior for the “central seed” (bottom)was 5.28 x 10-6. Amos Storkey, School of Informatics.

  36. Now we can compare like with like across brains: compute tract integrity measures. Major improvement in comparative results. Clayden J.D., A.J. Storkey, S. Munoz Maniega and M.E. Bastin (2009) Reproducibility of tract segmentation between sessions using an unsupervised modelling-based approach. Neuroimage 45, 377-385. Bastin, M., J.P. Piatowski, A.J. Storkey, L.J. Brown, A.M. Maclullich and J.D. Clayden (2008) Tract shape modelling provides evidence of topological change in corpus callosum genu during normal ageing. Neuroimage 43: 20-28 Bastin M.E. , S. Muñoz Maniega, K.J. Ferguson, L.J. Brown, J.M. Wardlaw, A.M. MacLullich & J.D. Clayden (2010). Quantifying the effects of normal ageing on white matter structure using unsupervised tract shape modelling. NeuroImage51(1):1-10. Penke L., S. Muñoz Maniega, L.M. Houlihan, C. Murray, A.J. Gow, J.D. Clayden, M.E. Bastin, J.M. Wardlaw & I.J. Deary (2010). White matter integrity in the splenium of the corpus callosum is related to successful cognitive aging and partly mediates the protective effect of an ancestral polymorphism in ADRB2. Behavior Genetics40(2):146-156. Use match Amos Storkey, School of Informatics.

  37. Conclusions • Dataset shift happens all the time • There are some common generic causes • Modelling involves a full generative understanding. • In many realistic scenarios accommodating shifts is non-trivial. • Model for likely changes. Amos Storkey, School of Informatics.

More Related