1 / 20

Introduction to Bioinformatics: Lecture VIII Classification and Supervised Learning

Introduction to Bioinformatics: Lecture VIII Classification and Supervised Learning. Jarek Meller Division of Biomedical Informatics, Children’s Hospital Research Foundation & Department of Biomedical Engineering, UC. Outline of the lecture.

craig
Download Presentation

Introduction to Bioinformatics: Lecture VIII Classification and Supervised Learning

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Introduction to Bioinformatics: Lecture VIIIClassification and Supervised Learning Jarek Meller Division of Biomedical Informatics, Children’s Hospital Research Foundation & Department of Biomedical Engineering, UC JM - http://folding.chmcc.org

  2. Outline of the lecture • Motivating story: correlating inputs and outputs • Learning with a teacher • Regression and classification problems • Model selection, feature selection and generalization • k-nearest neighbors and some other classification algorithms • Phenotype fingerprints and their applications in medicine JM - http://folding.chmcc.org

  3. Web watch: an on-line biology textbook by JW Kimball Dr. J. W. Kimball's Biology Pages http://users.rcn.com/jkimball.ma.ultranet/BiologyPages/ Story #1: B-cells and DNA editing, Apolipoprotein B and RNA eiditing http://users.rcn.com/jkimball.ma.ultranet/BiologyPages/R/RNA_Editing.html#apoB_gene Story #2: ApoB, cholesterol uptake, LDL and its endocytosis http://users.rcn.com/jkimball.ma.ultranet/BiologyPages/E/Endocytosis.html#ldl Complex patterns of mutations in genes related to cholesterol transport and uptake (e.g. LDLR, ApoB) may lead to an elevated level of LDL in the blood. JM - http://folding.chmcc.org

  4. Correlations and fingerprints Instead of often difficult to decipher underlying molecular model, one may simply try to find correlations between inputs and outputs. If measurements on certain attributes correlate with molecular processes, underlying genomic structures, phenotypes, disease states etc., one can use such attributes as indicators of such “hidden” states and to make predictions for new cases. Consider for example the elevated levels of the low density lipoprotein (LDL) particles in the blood, as an indicator (fingerprint) of the atherosclerosis. JM - http://folding.chmcc.org

  5. Correlations and fingerprints: LDL example Healthy cases: blue; heart attack or stroke within 5 years from the exam: red (simulated data); x – LDL; y - HDL; z – age (see study by Westendorp et. al., Arch Intern Med. 2003, 163(13):1549 JM - http://folding.chmcc.org

  6. LDL example: 2D projection JM - http://folding.chmcc.org

  7. LDL example: regression with binary output and 1D projection for classification JM - http://folding.chmcc.org

  8. Unsupervised vs. supervised learning In case of unsupervised learning the goal is to “discover” the structure in the data and group (cluster) similar objects, given a similarity measure. In case of supervised learning (or learning with a teacher) a set of examples with class assignments (e.g. healthy vs. diseased) is given and the goal is to find a representation of the problem in some feature (attribute) space that provides a proper separation of the imposed classes. Such representations With the resulting decision boundaries may be subsequently used to make prediction for new cases. Class 3 Class 1 Class 2 JM - http://folding.chmcc.org

  9. Choice of the model, problem representation and feature selection: another simple example adults children F weight estrogen M heights testosterone JM - http://folding.chmcc.org

  10. Gene expression example again: JRA clinical classes Picture: courtesy of B. Aronow

  11. Advantages of prior knowledge, problems with class assignment (e.g. in clinical practice) on the other hand GLOBINS FixL No sequence similarity ?? PYP Prior knowledge – the same class despite low sequence similarity; suggestion that distance based on sequence similarity is not sufficient – adding structure derived features might help (“good model” question again). JM - http://folding.chmcc.org

  12. Three phases in supervised learning protocols • Training data: examples with class assignment are given • Learning: i) appropriate model (or representation) of the problem needs to be selected in terms of attributes, distance measure and classifier type; ii) adaptive parameters in the model need to optimized to provide correct classification of training examples (e.g. minimizing the number of misclassified training vectors) • Validation: cross-validation, independent control sets and other measure of “real” accuracy and generalization should be used to assess the success of the model and the training phase (finding trade off betweenaccuracy and generalization is not trivial) JM - http://folding.chmcc.org

  13. Training set: LDL example again • A set of objects (here patients) xi , i=1, …, Nis given. For each patient a set of features (attributes and the corresponding measurements on these attributes) are given too. Finally, for each patient we are given the class Ck , k=1, …, K, he/she belongs to. Age LDL HDL Sex Class 41 230 60 F healthy (0) 32 120 50 M stroke within 5 years (1) 45 90 70 M heart attack within 5 years (1) { xi , Ck }i=1, …, N JM - http://folding.chmcc.org

  14. Optimizing adaptable parameters in the model • Find a model y(x;w) that describes the objects of each class as a function of the features and adaptive parameters (weights) w. • Prediction, given x (e.g. LDL=240, age=52, sex=male) assign the class C=?, (e.g. if y(x,w)>0.5 then C=1, i.e. likely to suffer from a stroke or heart attack in the next 5 years) y(x;w) JM - http://folding.chmcc.org

  15. Examples of machine learning algorithms for classification and regression problems • Linear perceptron, Least Squares • LDA/FDA (Linear/Fisher Discriminate Analysis) (simple linear cuts, kernel non-linear generalizations) • SVM (Support Vector Machines) (optimal, wide margin linear cuts, kernel non-linear generalizations) • Decision trees (logical rules) • k-NN (k-Nearest Neighbors) (simple non-parametric) • Neural networks (general non-linear models, adaptivity, “artificial brain”) JM - http://folding.chmcc.org

  16. Training accuracy vs. generalization JM - http://folding.chmcc.org

  17. Model complexity, training set size and generalization JM - http://folding.chmcc.org

  18. Similarity measures JM - http://folding.chmcc.org

  19. k-nearest neighbors as a simple algorithm for classification • Given a training set of N objects with known class assignment and k<N find an assignment of new objects (not included in the training) to one of the classes based on the assignment of its k neighbors • A simple, non-parametric method that works surprisingly well, especially in case of low dimensional problems • Note however that the choice of the distance measure may again have a profound effect on the results • The optimal k is found by trial and error JM - http://folding.chmcc.org

  20. k-nearest neighbor algorithm Step 1: Compute pairwise distances and take k closest neighbors Step2: Assign class based on a simple majority voting, the new point belongs to the class with most neighbors in this class JM - http://folding.chmcc.org

More Related