410 likes | 419 Views
This article explores the use of gene selection for Support Vector Machines (SVMs) in microarray technology and cancer genetics research.
E N D
Gene (Feature) Selection for Support Vector MachinesSayan MukherjeeWhitehead Institute Center for Genome Research Center for Biological and Computational LearningMassachusetts Institute of Technology
Outline • Microarray technology • Learning from examples paradigm • Regularization and Support Vector Machines • Gene selection for SVMs • IV. How many genes ?
The point of microarray technology Proteins: state of cell Gene: codes for a protein mRNA: helps assemble a protein #mRNA ~ #proteins #mRNA: gene expression Microarray technology: measure expression of thousands of genes at once. Typical experiment: Measure expression of genes under different conditions and ask what is different at a molecular level and why.
RNA Biological Sample Test Sample Test Sample Reference PE Cy3 Cy5 ARRAY ARRAY Oligonucleotide Synthesis cDNA Clone (LIBRARY) PCR Product Microarray technology Ramaswamy and Golub, JCO
Microarray technology Oligonucleotide cDNA Lockhart and Winzler 2000
Challenge When the science is not well understood, resort to statistics: Infer cancer genetics by analyzing microarray data from tumors Ultimate goal: discover the genetic pathways of cancers Immediate goal: models that discriminate tumor types or treatment outcomes and determine genes used in model Basic difficulty: few examples 20-100, high-dimensionality 7,000-16,000 genes measured for each sample, ill-posed problem Damnation of dimensionality: Far too few examples for so many dimensions to predict accurately
Prediction Fcn. Relevant genes Statistical Algorithm Prediction Examples Newsample Learning from examples paradigm
Statistical learning theory • Nonasymptotic theory, theory based on finite samples. • Gives tradeoff between complexity of model and amount of data. • Need to control model complexity, a standard approach is regularization. • Algorithms that control complexity perform well with few examples and high dimensional data. Suppot Vector Machine (SVM) classification is one regularization algorithm.
Empirical error tradeoff Complexity (Tikhonov) Regularization Find the function that minimizes the following • Solution exists • Solution is unique • Stable (strongly) to perturbation of training data • Generalizes well
Binary classification Class 1 Class 2 x2 x1
SVMs are a form of regularization Empirical error tradeoff Complexity
Map data to higher dimensional space, feature space Construct linear classifier in this space Which can be written as Nonlinear decision boundaries
Two gene example: two genes measuring Sonic Hedgehog and TrkC Gene expression and coregulation Coregulation: the expression of two genes must be correlated for a protein to be made, so we need to look at pairwise correlations as well as individual expression Size of feature space: if there are 7,000 genes, feature space is about 24 million features, so the fact that feature space is never computed is important
Cancer classification 38 examples of Myeloid and Lymphoblastic leukemias (Golub et al, 1999) Affymetrix human 6800, (7128 genes including control genes) 34 examples to test classifier d: distance from hyperplane Test data
Gene coregulation ? Nonlinear SVM helps when the most informative genes are removed, Informative as ranked using Signal to Noise (Golub et al). • Genes removed errors • 1st order 2nd order 3rd order polynomials • 0 1 1 1 • 10 2 1 1 • 20 3 2 1 • 30 3 3 2 • 40 3 3 2 • 50 3 2 2 • 100 3 3 2 • 200 3 3 3 • 1500 7 7 8
Gene selection ? SVMs as stated use all genes/features Molecular biologists/oncologists seem to be convinced that only a small subset of genes are responsible for particular biological properties, so they want the genes most important in discriminating Practical reasons, a clinical device with thousands of genes is not financially practical Possible performance improvement Wrapper method for gene/feature selection
d: distance fromhyperplane d: distance fromhyperplane Test data Test data Test data Results with gene selection AML vs ALL: 40 genes 34/34 correct, 0 rejects. 5 genes 31/31 correct, 3 rejects of which 1 is an error.
Two feature selection algorithms Recursive feature elimination (RFE): based upon perturbation analysis, eliminate genes that perturb the margin the least Optimize leave-one out (LOO): based upon optimization of leave-one out error of a SVM, leave-one out error is unbiased
Basic idea Use leave-one out (LOO) criterion or upper bound on LOO to select features by searching over all possible subsets of n features for the ones that minimizes the criterion. When such a search is impossible because of too many possibilities, scale each feature by a real value variable and compute this scaling via gradient descent on the leave-one out bound. One can then keep the features corresponding to the largest scaling variables.
R2/M2 =1 R2/M2 >1 R M = R M x2 x2 x1 Pictorial illustration Rescale features to minimize the LOO bound R2/M2
Three upper bounds on LOO Radius margin bound: simple to compute, continuous very loose but often tracks LOO well Jaakkola Haussler bound: somewhat tighter, simple to compute, discontinuous so need to smooth, valid only for SVMs with no b term Span bound: tight as a Britney Spears outfit complicated to compute, discontinuous so need to smooth
We add a scaling parameter s to the SVM, which scales genes, genes corresponding to small sj are removed. The SVM function has the form: Classification function with gene selection
Toy data Linear problem with 6 relevant dimensions of 202 Nonlinear problem with 2 relevant dimensions of 52 error rate error rate number of samples number of samples
Molecular classification of cancer • Hierarchy of difficulty: • Histological differences: normal vs. malignant, skin vs. brain • Morphologies: different leukemia types, ALL vs. AML • Lineage B-Cell vs. T-Cell, folicular vs. large B-cell lymphoma • Outcome: treatment outcome, elapse, or drug sensitivity.
p-val = 0.00039 p-val = 0.0015 Outcome classification Error rates ignore temporal information such as when a patient dies. Survival analysis takes temporal information into account. The Kaplan-Meier survival plots and statistics for the above predictions show significance. Lymphoma Medulloblastoma
Number of genes selected • A standard question is how many genes should be used in the classification • function ? • Another way to phrase this question is are my results with n1, n2, n3, • n4 genes significant ? • This can be addressed by a statistical test based upon permutations: • Run the algorithm with ni genes and note error rate Ei • Construct many datasets by permuting the labels of the datasets • and measure the error rates for these datasets Ei1,…,Eil • Construct an empirical cumulative distribution distribution function (cdf) • from the error rates Ei1,…,Eil • Compute the cdf value for Ei, this gives us a significance value • So we have an error rate for a model with n1 genes and a confidence rate for • this model.
Distribution function and p-value P(.1) = 0
Work in progress Put survival analysis in a classification framework Look at Pnorm algorithms for sparser classifiers Extract subtaxonomies with independent gene sets using item sets and combinations of classifiers Estimate error rate and statistical significance as a function function of sample size err(n) p(n)
Collaborators Center of Biological and Computational Learning: R. Rifkin, G. Yeo, A. Rahklin and T. Poggio Whitehead Institute: S. Ramaswamy, P. Tamayo, J. Mesirov, and T. Golub AT&T/Biowulf: O. Chapelle, J. Weston, O. Bousquet and V. Vapnik