1 / 38

Michael Biehl Mathematics and Computing Science University of Groningen / NL

P rototype- based classifiers and their applications in the life-sciences. Michael Biehl Mathematics and Computing Science University of Groningen / NL. www.cs.rug.nl/~ biehl. LVQ and Relevance Learning f requently asked q uestions and rarely given answers. Michael Biehl

myra
Download Presentation

Michael Biehl Mathematics and Computing Science University of Groningen / NL

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Prototype-basedclassifiers andtheirapplications in thelife-sciences Michael Biehl Mathematics and Computing Science University of Groningen / NL www.cs.rug.nl/~ biehl

  2. LVQ andRelevance Learning frequentlyaskedquestionsandrarelygivenanswers Michael Biehl Mathematics and Computing Science University of Groningen / NL www.cs.rug.nl/~ biehl

  3. frequently asked questions So, why do you still do this LVQ-stuff ?

  4. frequently asked questions basics: distancebasedclassifiers, relevancelearning Whataboutthecurseofdimensionality ? How do you find a gooddistancemeasure? example: Generalized Matrix LVQ Whataboutover-fitting ? Istherelevancematrixunique ? Isituseful in practice ? application: bio-medicaldata adrenaltumors outlook: What‘snext?

  5. K-NN classifier a simple distance-based classifier • store a set of labeled examples • classify a query according to the • label of the Nearest Neighbor • (or the majority of K NN) ? • piece-wise linear class borders • parameterized by all examples feature space conceptually simple, no training required, one parameter (K) + - expensive storage and computation, sensitivity to “outliers” can result in overly complex decision boundaries

  6. prototype based classification a prototype based classifier • represent the data by one or • several prototypes per class • classify a query according to the • label of the nearest prototype • (or alternative schemes) ? • piece-wise linear class borders • parameterized by prototypes feature space + less sensitive to outliers, lower storage needs, little computational effort in the working phase - training phase required in order to place prototypes, model selection problem: number of prototypes per class, etc.

  7. Nearest Prototype Classifier set of prototypes carrying class-labels nearest prototype classifier (NPC): based on dissimilarity / distance measure given - determine the winner with - assign to class minimal requirements: standard example: squared Euclidean

  8. Learning Vector Quantization N-dimensional data, feature vectors ∙identification of prototype vectors from labeled example data ∙distance based classification (e.g. Euclidean) competititve learning: LVQ1 [Kohonen, 1990] •initialize prototype vectors for different classes • present a single example • identify the winner (closest prototype) •move the winner -closertowards the data (same class) -away from the data (different class)

  9. Learning Vector Quantization N-dimensional data, feature vectors ∙identification of prototype vectors from labeled example data ∙distance based classification (e.g. Euclidean) ∙distance-based classification [here: Euclidean distances] ∙tesselation of feature space [piece-wise linear]   ∙aim: discrimination of classes ( ≠ vector quantization or density estimation ) ∙generalization ability correct classification of new data

  10. What about the curse of dimensionality ? concentration ofnorms/distances for large N „distance based methods are bound to fail in high dimensions“ ? LVQ: - prototypes are not just random data points - carefullyselectedrepresentatives of the data - distances of a given data point to prototypes are compared projection to non-trivial low-dimensional subspace! see also: [Ghosh et al., 2007, Witoelar et al., 2010] models of LVQ training, analytical treatment in the limit successful training needs training examples

  11. cost function based LVQ one example: Generalized LVQ [Sato & Yamada, 1995] minimize two winning prototypes: linear E favors large margin separation of classes, e.g. sigmoidal(linear for small arguments), e.g. E approximates number of misclassifications small , large E favors class-typical prototypes

  12. cost function based LVQ There is nothing objective about objective functions J. McClelland

  13. GLVQ training = optimization with respect to prototype position, e.g. single example presentation, stochastic sequence of examples, update of two prototypes per step based on non-negative, differentiable distance requirement: update decreases , increases

  14. GLVQ training = optimization with respect to prototype position, e.g. single example presentation, stochastic sequence of examples, update of two prototypes per step based on non-negative, differentiable distance

  15. GLVQ training = optimization with respect to prototype position, e.g. single example presentation, stochastic sequence of examples, update of two prototypes per step based on Euclidean distance moves prototypes towards / away from sample with prefactors

  16. What is a good distance measure ? fixed distance measures: - select distance measures according to prior knowledge - data driven choice in a preprocessing step - compare performance of various measures example: divergence based LVQ (DLVQ) Mwebaze et al., Neurocomputing(2011) relevance learning: - employ a parameterized distance measure - update its parameters in the training process together with prototype training - adaptive, data driven dissimilarity example: Matrix Relevance LVQ

  17. Relevance Matrix LVQ [Schneider et al., 2009] generalized quadratic distancein LVQ: variants: one global, several local, class-wise relevance matrices Λ(j) → piecewise quadratic decision boundaries diagonal matrices: single feature weights [Bojer et al., 2001] [Hammer et al., 2002] rectangular discriminative low-dim. representation e.g. for visualization[Bunte et al., 2012] possible constraints: rank-control, sparsity, …

  18. Relevance Matrix LVQ prototypes and distance measure optimization of Generalized Matrix-LVQ (GMLVQ)

  19. heuristic interpretation summarizes - the contribution of the original dimension - the relevance of original features for the classification standard Euclidean distance for linearly transformed features interpretation assumes implicitly: features have equal order of magnitude e.g. after z-score-transformation → (averages over data set)

  20. Classification of adrenal tumors Petra Schneider Han Stiekema Michael Biehl Johann Bernoulli Institute for Mathematics and Computer Science University of Groningen WiebkeArlt , Angela Taylor Dave J. Smith, Peter Nightingale P.M. Stewart, C.H.L. Shackleton et al. School of Medicine Queen Elizabeth Hospital University of Birmingham/UK (+ several centers in Europe) • [Arlt et al., J. Clin. Endocrinology & Metabolism, 2011] • [Biehl et al., Europ. Symp. Artficial Neural Networks (ESANN), 2012]

  21. adrenocortical tumors • ∙ adrenocorticaltumors, difficult differential diagnosis: • ACC: adrenocortical carcinomas • ACA: adrenocortical adenomas • ∙ idea: steroid metabolomics • tumor classification based on urinary steroid excretion • 32 candidate steroid markers:

  22. adrenocortical tumors Generalized Matrix LVQ , ACCvs. ACAclassification • data set: 24 hrs. urinary steroid excretion • 102 patients with benign ACA • 45 patients with malignant ACC ∙ determine prototypes typical profiles (1 per class) ∙ data divided in 90% training, 10% test set ∙ adaptive generalized quadratic distance measure parameterized by ∙ apply classifier to test data evaluate performance (error rates, ROC) ∙ repeat and average over many random splits

  23. prototypes ACA log-transformed steroid excretion in ACA/ACC rescaled using healthy control group values ACC

  24. adrenocortical tumors Relevance matrix diagonal elements off-diagonal subset of 9 selected steroids ↔ technical realization (patented, University of Birmingham/UK)

  25. adrenocortical tumors TH-Doc (12) highly discriminative combination of markers! weakly discriminative markers 5a-THA (8)

  26. adrenocortical tumors ROC characteristics clear improvement due to adaptive distances GRLVQ (sensitivity) AUC 0.87 0.93 0.97 Euclidean 8 diagonal rel. full matrix GMLVQ (1-specificity)

  27. frequently asked questions What about over-fitting ? matrices introduce O(N2) additional adaptive parameters! Is the relevance matrix unique ? - uniqueness of parameterization (Ωfor given Λ) ? - uniqueness of the relevance matrix Λ ? How relevant are the relevances ? -interpretation of relevance matrix ( uniqueness)

  28. What about over-fitting ? observation: low rank ofresulting relevance matrix effective # of degrees of freedom ~ N eigenvalues in ACA/ACC classification mathematics: stationarity conditions columns of stationary ΩT are vectors in the eigenspace associated with the smallest eigenvalue of the pseudo-covariance • Γ • not necessarily positive • depends on Ω itself • cannot be determined • prior to training Biehl et al. Machine Learning Reports (2009) In preparation (forever)

  29. by-product: low-dim. representation control benign malignant projection on 2nd eigenvector projection on first eigenvector of Λ

  30. Is the relevance matrix unique ? (I) uniqueness of Ω, given Λ irrelevant rotations, reflections, symmetries…. canonical representation in terms of eigen-decomposition of Λ: • - pos. semi-definite • symmetric matrix square root is not unique

  31. Is the relevance matrix unique ? (II) uniqueness given transformation: is possible if the rows of are in the null-space of → identical mapping of examples, different for is singular if features are correlated, dependent possible to extend by prototypes

  32. regularization training process yields with eigenvectors and eigenvalues determine regularization: (K=J ) removes null-space contributions, unique solution with minimal Euclidean norm of row vectors (K>J ) retains the eigenspace corresponding to largest eigenvalues removes also span of small non-zero eigenvalues

  33. regularization regularized mapping after/during training mapped feature space fixed K prototypes yet unknown pre-processing of data (PCA-like) Strickert, Hammer, Villmann, Biehl, IEEE SCCI 2013 Regularization and improved interpretation of linear data mappings and adaptive distance measures retains original features flexible K may include prototypes

  34. illustrative example high medium alcohol content low infra-red spectral data: 124 wine spamples 256 wavelengths 30 training data 94 test spectra GMLVQ classification

  35. GMLVQ over-fitting effect null-space correction P=30 dimensions best performance 7 dimensions remaining

  36. raw relevance matrix original regularized posterior regularization regularization - enhances generalization - smoothens relevance profile/matrix - removes ‘false relevances’ - improves interpretability of Λ

  37. What next ? just two (selected) on-going projects MIWOCI poster session • Improved interpretation of linear mappings • with B. Frenay, D. Hofmann, A. Schulz, B. Hammer • minimal / maximal feature relevances by • null-space contributions at constant • (minimal) L1-norm of Ω rows • Optimization of Receiver Operating Characteristics • with M. Kaden, P. Stürmer, T. Villmann • statistical interpretation of AUC (ROC) allows for direct • optimization based on pairs of examples (one from each class)

  38. links Matlab collection: Relevance and Matrix adaptation in Learning Vector Quantization (GRLVQ, GMLVQ and LiRaM LVQ) http://matlabserver.cs.rug.nl/gmlvqweb/web/ Pre/re-prints etc.: http://www.cs.rug.nl/~biehl/

More Related