150 likes | 265 Views
Diagnosis of multiple cancer types by shrunken centroids of gene expression. By Robert Tibshirani, Trevor Hastie, Balasubramanian Narasimhan, and Gilbert Chu. Course: 550.635 Topics in Bioinformatics Presenter: Ting Yang Teacher: Professor Geman. Nearest Centroid Classification.
E N D
Diagnosis of multiple cancer types by shrunken centroids of gene expression By Robert Tibshirani, Trevor Hastie, Balasubramanian Narasimhan, and Gilbert Chu Course: 550.635 Topics in Bioinformatics Presenter: Ting Yang Teacher: Professor Geman
Nearest Centroid Classification Example: small round blue cell tumors of childhood • 63 training samples, 25 testing samples • 4 classes: BL, EWS, NB, RMS • Figure 1 • Nearest centroid classification • Disadvantage
Nearest shrunken Centroids • A modification of the nearest centroid method • Idea: First normalize class centroids by the within-class standard deviation for each gene, shrink each class centroid towards the overall centroid.
Details: Mean expression value in class k for gene i ith component of the overall centroid Pooled within class standard deviation for gene i
It measures the difference between the gene i in class k and gene i in all classes combined. • Idea: a gene that discriminates one class from the rest will have a statistic of large absolute value.
Shrink it toward zero to eliminate the genes that do not provide sufficient information. • ‘De-noising’ step
Choosing the amount of shrinkage • Shrinkage amount is allowed to vary over a wide range. • 10-fold cross validation ( choose the one that has the smallest error rate) • Divide the set of samples (at random)into 10 equal size parts. (classes were distributed proportionally among each of the 10 parts) • Fit the model on 90% of the samples and then predict the class label of the remaining 10% (test samples). • Repeat 10 times, add together the error (overall error). • Figure 2 • Figure 1
More Figures • Figure 3 • Figure 4
Classification • A new sample is classified by comparing its expression profile with each shrunken centroid, over those 43 active genes. • Distance function: prior information included.
Statistical details: • t-statistic • Estimates of the class probabilities (Figure 5)