Michael Biehl Mathematics and Computing Science University of Groningen / NL

Prototype-basedclassifiers andtheirapplications in thelife-sciences Michael Biehl Mathematics and Computing Science University of Groningen / NL www.cs.rug.nl/~ biehl

LVQ andRelevance Learning frequentlyaskedquestionsandrarelygivenanswers Michael Biehl Mathematics and Computing Science University of Groningen / NL www.cs.rug.nl/~ biehl

frequently asked questions So, why do you still do this LVQ-stuff ?

frequently asked questions basics: distancebasedclassifiers, relevancelearning Whataboutthecurseofdimensionality ? How do you find a gooddistancemeasure? example: Generalized Matrix LVQ Whataboutover-fitting ? Istherelevancematrixunique ? Isituseful in practice ? application: bio-medicaldata adrenaltumors outlook: What‘snext?

K-NN classifier a simple distance-based classifier • store a set of labeled examples • classify a query according to the • label of the Nearest Neighbor • (or the majority of K NN) ? • piece-wise linear class borders • parameterized by all examples feature space conceptually simple, no training required, one parameter (K) + - expensive storage and computation, sensitivity to “outliers” can result in overly complex decision boundaries

prototype based classification a prototype based classifier • represent the data by one or • several prototypes per class • classify a query according to the • label of the nearest prototype • (or alternative schemes) ? • piece-wise linear class borders • parameterized by prototypes feature space + less sensitive to outliers, lower storage needs, little computational effort in the working phase - training phase required in order to place prototypes, model selection problem: number of prototypes per class, etc.

Nearest Prototype Classifier set of prototypes carrying class-labels nearest prototype classifier (NPC): based on dissimilarity / distance measure given - determine the winner with - assign to class minimal requirements: standard example: squared Euclidean

Learning Vector Quantization N-dimensional data, feature vectors ∙identification of prototype vectors from labeled example data ∙distance based classification (e.g. Euclidean) competititve learning: LVQ1 [Kohonen, 1990] •initialize prototype vectors for different classes • present a single example • identify the winner (closest prototype) •move the winner -closertowards the data (same class) -away from the data (different class)

Learning Vector Quantization N-dimensional data, feature vectors ∙identification of prototype vectors from labeled example data ∙distance based classification (e.g. Euclidean) ∙distance-based classification [here: Euclidean distances] ∙tesselation of feature space [piece-wise linear]   ∙aim: discrimination of classes ( ≠ vector quantization or density estimation ) ∙generalization ability correct classification of new data

What about the curse of dimensionality ? concentration ofnorms/distances for large N „distance based methods are bound to fail in high dimensions“ ? LVQ: - prototypes are not just random data points - carefullyselectedrepresentatives of the data - distances of a given data point to prototypes are compared projection to non-trivial low-dimensional subspace! see also: [Ghosh et al., 2007, Witoelar et al., 2010] models of LVQ training, analytical treatment in the limit successful training needs training examples

cost function based LVQ one example: Generalized LVQ [Sato & Yamada, 1995] minimize two winning prototypes: linear E favors large margin separation of classes, e.g. sigmoidal(linear for small arguments), e.g. E approximates number of misclassifications small , large E favors class-typical prototypes

cost function based LVQ There is nothing objective about objective functions J. McClelland

GLVQ training = optimization with respect to prototype position, e.g. single example presentation, stochastic sequence of examples, update of two prototypes per step based on non-negative, differentiable distance requirement: update decreases , increases

GLVQ training = optimization with respect to prototype position, e.g. single example presentation, stochastic sequence of examples, update of two prototypes per step based on non-negative, differentiable distance

GLVQ training = optimization with respect to prototype position, e.g. single example presentation, stochastic sequence of examples, update of two prototypes per step based on Euclidean distance moves prototypes towards / away from sample with prefactors

What is a good distance measure ? fixed distance measures: - select distance measures according to prior knowledge - data driven choice in a preprocessing step - compare performance of various measures example: divergence based LVQ (DLVQ) Mwebaze et al., Neurocomputing(2011) relevance learning: - employ a parameterized distance measure - update its parameters in the training process together with prototype training - adaptive, data driven dissimilarity example: Matrix Relevance LVQ

Relevance Matrix LVQ [Schneider et al., 2009] generalized quadratic distancein LVQ: variants: one global, several local, class-wise relevance matrices Λ(j) → piecewise quadratic decision boundaries diagonal matrices: single feature weights [Bojer et al., 2001] [Hammer et al., 2002] rectangular discriminative low-dim. representation e.g. for visualization[Bunte et al., 2012] possible constraints: rank-control, sparsity, …

Relevance Matrix LVQ prototypes and distance measure optimization of Generalized Matrix-LVQ (GMLVQ)

heuristic interpretation summarizes - the contribution of the original dimension - the relevance of original features for the classification standard Euclidean distance for linearly transformed features interpretation assumes implicitly: features have equal order of magnitude e.g. after z-score-transformation → (averages over data set)

Classification of adrenal tumors Petra Schneider Han Stiekema Michael Biehl Johann Bernoulli Institute for Mathematics and Computer Science University of Groningen WiebkeArlt , Angela Taylor Dave J. Smith, Peter Nightingale P.M. Stewart, C.H.L. Shackleton et al. School of Medicine Queen Elizabeth Hospital University of Birmingham/UK (+ several centers in Europe) • [Arlt et al., J. Clin. Endocrinology & Metabolism, 2011] • [Biehl et al., Europ. Symp. Artficial Neural Networks (ESANN), 2012]

adrenocortical tumors • ∙ adrenocorticaltumors, difficult differential diagnosis: • ACC: adrenocortical carcinomas • ACA: adrenocortical adenomas • ∙ idea: steroid metabolomics • tumor classification based on urinary steroid excretion • 32 candidate steroid markers:

adrenocortical tumors Generalized Matrix LVQ , ACCvs. ACAclassification • data set: 24 hrs. urinary steroid excretion • 102 patients with benign ACA • 45 patients with malignant ACC ∙ determine prototypes typical profiles (1 per class) ∙ data divided in 90% training, 10% test set ∙ adaptive generalized quadratic distance measure parameterized by ∙ apply classifier to test data evaluate performance (error rates, ROC) ∙ repeat and average over many random splits

prototypes ACA log-transformed steroid excretion in ACA/ACC rescaled using healthy control group values ACC

adrenocortical tumors Relevance matrix diagonal elements off-diagonal subset of 9 selected steroids ↔ technical realization (patented, University of Birmingham/UK)

adrenocortical tumors TH-Doc (12) highly discriminative combination of markers! weakly discriminative markers 5a-THA (8)

adrenocortical tumors ROC characteristics clear improvement due to adaptive distances GRLVQ (sensitivity) AUC 0.87 0.93 0.97 Euclidean 8 diagonal rel. full matrix GMLVQ (1-specificity)

frequently asked questions What about over-fitting ? matrices introduce O(N2) additional adaptive parameters! Is the relevance matrix unique ? - uniqueness of parameterization (Ωfor given Λ) ? - uniqueness of the relevance matrix Λ ? How relevant are the relevances ? -interpretation of relevance matrix ( uniqueness)

What about over-fitting ? observation: low rank ofresulting relevance matrix effective # of degrees of freedom ~ N eigenvalues in ACA/ACC classification mathematics: stationarity conditions columns of stationary ΩT are vectors in the eigenspace associated with the smallest eigenvalue of the pseudo-covariance • Γ • not necessarily positive • depends on Ω itself • cannot be determined • prior to training Biehl et al. Machine Learning Reports (2009) In preparation (forever)

by-product: low-dim. representation control benign malignant projection on 2nd eigenvector projection on first eigenvector of Λ

Is the relevance matrix unique ? (I) uniqueness of Ω, given Λ irrelevant rotations, reflections, symmetries…. canonical representation in terms of eigen-decomposition of Λ: • - pos. semi-definite • symmetric matrix square root is not unique

Is the relevance matrix unique ? (II) uniqueness given transformation: is possible if the rows of are in the null-space of → identical mapping of examples, different for is singular if features are correlated, dependent possible to extend by prototypes

regularization training process yields with eigenvectors and eigenvalues determine regularization: (K=J ) removes null-space contributions, unique solution with minimal Euclidean norm of row vectors (K>J ) retains the eigenspace corresponding to largest eigenvalues removes also span of small non-zero eigenvalues

regularization regularized mapping after/during training mapped feature space fixed K prototypes yet unknown pre-processing of data (PCA-like) Strickert, Hammer, Villmann, Biehl, IEEE SCCI 2013 Regularization and improved interpretation of linear data mappings and adaptive distance measures retains original features flexible K may include prototypes

illustrative example high medium alcohol content low infra-red spectral data: 124 wine spamples 256 wavelengths 30 training data 94 test spectra GMLVQ classification

GMLVQ over-fitting effect null-space correction P=30 dimensions best performance 7 dimensions remaining

raw relevance matrix original regularized posterior regularization regularization - enhances generalization - smoothens relevance profile/matrix - removes ‘false relevances’ - improves interpretability of Λ

What next ? just two (selected) on-going projects MIWOCI poster session • Improved interpretation of linear mappings • with B. Frenay, D. Hofmann, A. Schulz, B. Hammer • minimal / maximal feature relevances by • null-space contributions at constant • (minimal) L1-norm of Ω rows • Optimization of Receiver Operating Characteristics • with M. Kaden, P. Stürmer, T. Villmann • statistical interpretation of AUC (ROC) allows for direct • optimization based on pairs of examples (one from each class)

links Matlab collection: Relevance and Matrix adaptation in Learning Vector Quantization (GRLVQ, GMLVQ and LiRaM LVQ) http://matlabserver.cs.rug.nl/gmlvqweb/web/ Pre/re-prints etc.: http://www.cs.rug.nl/~biehl/

Michael Biehl Mathematics and Computing Science University of Groningen / NL