Introduction to Bioinformatics: Lecture VIII Classification and Supervised Learning

Introduction to Bioinformatics: Lecture VIIIClassification and Supervised Learning Jarek Meller Division of Biomedical Informatics, Children’s Hospital Research Foundation & Department of Biomedical Engineering, UC JM - http://folding.chmcc.org

Outline of the lecture • Motivating story: correlating inputs and outputs • Learning with a teacher • Regression and classification problems • Model selection, feature selection and generalization • k-nearest neighbors and some other classification algorithms • Phenotype fingerprints and their applications in medicine JM - http://folding.chmcc.org

Web watch: an on-line biology textbook by JW Kimball Dr. J. W. Kimball's Biology Pages http://users.rcn.com/jkimball.ma.ultranet/BiologyPages/ Story #1: B-cells and DNA editing, Apolipoprotein B and RNA eiditing http://users.rcn.com/jkimball.ma.ultranet/BiologyPages/R/RNA_Editing.html#apoB_gene Story #2: ApoB, cholesterol uptake, LDL and its endocytosis http://users.rcn.com/jkimball.ma.ultranet/BiologyPages/E/Endocytosis.html#ldl Complex patterns of mutations in genes related to cholesterol transport and uptake (e.g. LDLR, ApoB) may lead to an elevated level of LDL in the blood. JM - http://folding.chmcc.org

Correlations and fingerprints Instead of often difficult to decipher underlying molecular model, one may simply try to find correlations between inputs and outputs. If measurements on certain attributes correlate with molecular processes, underlying genomic structures, phenotypes, disease states etc., one can use such attributes as indicators of such “hidden” states and to make predictions for new cases. Consider for example the elevated levels of the low density lipoprotein (LDL) particles in the blood, as an indicator (fingerprint) of the atherosclerosis. JM - http://folding.chmcc.org

Correlations and fingerprints: LDL example Healthy cases: blue; heart attack or stroke within 5 years from the exam: red (simulated data); x – LDL; y - HDL; z – age (see study by Westendorp et. al., Arch Intern Med. 2003, 163(13):1549 JM - http://folding.chmcc.org

LDL example: 2D projection JM - http://folding.chmcc.org

LDL example: regression with binary output and 1D projection for classification JM - http://folding.chmcc.org

Unsupervised vs. supervised learning In case of unsupervised learning the goal is to “discover” the structure in the data and group (cluster) similar objects, given a similarity measure. In case of supervised learning (or learning with a teacher) a set of examples with class assignments (e.g. healthy vs. diseased) is given and the goal is to find a representation of the problem in some feature (attribute) space that provides a proper separation of the imposed classes. Such representations With the resulting decision boundaries may be subsequently used to make prediction for new cases. Class 3 Class 1 Class 2 JM - http://folding.chmcc.org

Choice of the model, problem representation and feature selection: another simple example adults children F weight estrogen M heights testosterone JM - http://folding.chmcc.org

Gene expression example again: JRA clinical classes Picture: courtesy of B. Aronow

Advantages of prior knowledge, problems with class assignment (e.g. in clinical practice) on the other hand GLOBINS FixL No sequence similarity ?? PYP Prior knowledge – the same class despite low sequence similarity; suggestion that distance based on sequence similarity is not sufficient – adding structure derived features might help (“good model” question again). JM - http://folding.chmcc.org

Three phases in supervised learning protocols • Training data: examples with class assignment are given • Learning: i) appropriate model (or representation) of the problem needs to be selected in terms of attributes, distance measure and classifier type; ii) adaptive parameters in the model need to optimized to provide correct classification of training examples (e.g. minimizing the number of misclassified training vectors) • Validation: cross-validation, independent control sets and other measure of “real” accuracy and generalization should be used to assess the success of the model and the training phase (finding trade off betweenaccuracy and generalization is not trivial) JM - http://folding.chmcc.org

Training set: LDL example again • A set of objects (here patients) xi , i=1, …, Nis given. For each patient a set of features (attributes and the corresponding measurements on these attributes) are given too. Finally, for each patient we are given the class Ck , k=1, …, K, he/she belongs to. Age LDL HDL Sex Class 41 230 60 F healthy (0) 32 120 50 M stroke within 5 years (1) 45 90 70 M heart attack within 5 years (1) { xi , Ck }i=1, …, N JM - http://folding.chmcc.org

Optimizing adaptable parameters in the model • Find a model y(x;w) that describes the objects of each class as a function of the features and adaptive parameters (weights) w. • Prediction, given x (e.g. LDL=240, age=52, sex=male) assign the class C=?, (e.g. if y(x,w)>0.5 then C=1, i.e. likely to suffer from a stroke or heart attack in the next 5 years) y(x;w) JM - http://folding.chmcc.org

Examples of machine learning algorithms for classification and regression problems • Linear perceptron, Least Squares • LDA/FDA (Linear/Fisher Discriminate Analysis) (simple linear cuts, kernel non-linear generalizations) • SVM (Support Vector Machines) (optimal, wide margin linear cuts, kernel non-linear generalizations) • Decision trees (logical rules) • k-NN (k-Nearest Neighbors) (simple non-parametric) • Neural networks (general non-linear models, adaptivity, “artificial brain”) JM - http://folding.chmcc.org

Training accuracy vs. generalization JM - http://folding.chmcc.org

Model complexity, training set size and generalization JM - http://folding.chmcc.org

Similarity measures JM - http://folding.chmcc.org

k-nearest neighbors as a simple algorithm for classification • Given a training set of N objects with known class assignment and k<N find an assignment of new objects (not included in the training) to one of the classes based on the assignment of its k neighbors • A simple, non-parametric method that works surprisingly well, especially in case of low dimensional problems • Note however that the choice of the distance measure may again have a profound effect on the results • The optimal k is found by trial and error JM - http://folding.chmcc.org

k-nearest neighbor algorithm Step 1: Compute pairwise distances and take k closest neighbors Step2: Assign class based on a simple majority voting, the new point belongs to the class with most neighbors in this class JM - http://folding.chmcc.org

Introduction to Bioinformatics: Lecture VIII Classification and Supervised Learning