1 / 29

Classification of Microarray Data

Classification of Microarray Data. The DNA Array Analysis Pipeline. Question Experimental Design. Array design Probe design. Sample Preparation Hybridization. Buy Chip/Array. Image analysis. Normalization. Expression Index Calculation. Comparable Gene Expression Data.

zmcneill
Download Presentation

Classification of Microarray Data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Classification of Microarray Data

  2. The DNA Array Analysis Pipeline Question Experimental Design Array design Probe design Sample Preparation Hybridization Buy Chip/Array Image analysis Normalization Expression Index Calculation Comparable Gene Expression Data Statistical Analysis Fit to Model (time series) Advanced Data Analysis Clustering PCA Classification Promoter Analysis Meta analysis Survival analysis Regulatory Network

  3. Outline • Example: Childhood leukemia • Classification • Method selection • Feature selection • Cross-validation • Training and testing • Example continued: Results from a leukemia study

  4. Childhood Leukemia • Cancer in the cells of the immune system • Approx. 35 new cases in Denmark every year • 50 years ago – all patients died • Today – approx. 78% are cured • Risk groups: • Standard, Intermediate, High, Very high, Extra high • Treatment • Chemotherapy • Bone marrow transplantation • Radiation

  5. Prognostic Factors

  6. Risk Classification Today Patient: Clinical data Immuno-phenotype Morphology Genetic measurements Prognostic factors: Immunophenotype Age Leukocyte count Number of chromosomes Translocations Treatment response Risk group: Standard Intermediate High Very high Extra High Microarray technology

  7. Study of Childhood Leukemia • Diagnostic bone marrow samples from leukemia patients • Platform: Affymetrix Focus Array • 8793 human genes • Immunophenotype • 18 patients with precursor B immunophenotype • 17 patients with T immunophenotype • Outcome 5 years from diagnosis • 11 patients with relapse • 18 patients in complete remission

  8. Reduction of data • Dimension reduction • PCA • Feature selection (gene selection) • Significant genes: t-test • Selection of a limited number of genes

  9. Microarray Data

  10. Outcome: PCA on all Genes

  11. Outcome Prediction: CC against the Number of Genes

  12. Linear discriminant analysis • Assumptions: • Data e.g.. Gaussian distributed • Variance and covariance the same for all classes • Calculation of class probability

  13. Nearest Centroid • Calculation of a centroid for each class • Calculation of the distance between a test sample and each class centroid • Class prediction by the nearest centroid method

  14. K-Nearest Neighbor • Based on distance measure • For example Euclidian distance • Parameter k = number of nearest neighbors • k=1 • k=3 • k=... • Always an odd number • Prediction by majority vote

  15. Support Vector Machines • Machine learning • Relatively new and highly theoretic • Works on non-linearly separable data • Finding a hyperplane between the two classes by minimizing of the distance between the hyperplane and closest points

  16. Neural Networks Gene1 Gene2 Gene3 ... B T

  17. Comparison of Methods

  18. Cross-validation Given data with 100 samples Cross-5-validation: Training: 4/5 of data (80 samples) Testing: 1/5 of data (20 samples) - 5 different models and performance results Leave-one-out cross-validation (LOOCV) Training: 99/100 of data (99 samples) Testing: 1/100 of data (1 sample) - 100 different models

  19. Validation • Definition of • True and False positives • True and False negatives Actual class B B T T Predicted class B T T B TP FN TN FP

  20. Sensitivity and Specificity Sensitivity: the ability to detect "true positives" TP TP + FN Specificity: the ability to avoid "false positives" TN TN + FP Positive Predictive Value (PPV) TP TP + FP

  21. Accuracy • Definition: TP + TN TP + TN + FP + FN • Range: [0 … 1]

  22. Matthews correlation coefficient • Definition: TP·TN - FN·FP √(TN+FN)(TN+FP)(TP+FN)(TP+FP) • Range: [-1 … 1]

  23. Testing of classifier Independent test set Overview of Classification Expression data Subdivision of data for cross-validation into training sets and test sets Feature selection (t-test) Dimension reduction (PCA) • Training of classifier: • using cross-validation • choice of method • - choice of optimal parameters

  24. Important Points • Avoid over fitting • Validate performance • Test on an independent test set • Use cross-validation • Include feature selection in cross-validation Why? • To avoid overestimation of performance! • To make a general classifier

  25. Multiclass classification • Same prediction methods • Multiclass prediction often implemented • Often bad performance • Solution • Division in small binary (2-class) problems

  26. Study of Childhood Leukemia: Results • Classification of immunophenotype (precursorB and T) • 100% accuracy • During the training • When testing on an independent test set • Simple classification methods applied • K-nearest neighbor • Nearest centroid • Classification of outcome (relapse or remission) • 78% accuracy (CC = 0.59) • Simple and advanced classification methods applied

  27. Summary • When do we use classification? • Reduction of dimension often necessary • T-test, PCA • Methods • K nearest neighbor • Nearest centroid • Support vector machines • Neural networks • Validation • Cross-validation • Accuracy and correlation coefficient

  28. Coffee break Next: Exercise

  29. Risk classification in the future ? Patient: Clincal data Immunopheno-typing Morphology Genetic measurements Prognostic factors: Immunophenotype Age Leukocyte count Number of chromosomes Translocations Treatment response Risk group: Standard Intermediate High Very high Ekstra high Microarray technology Customized treatment

More Related