260 likes | 270 Views
Supervised gene expression data analysis using SVMs and MLPs. e-mail: valenti@disi.unige.it. Giorgio Valentini. Outline. A real problem: Lymphoma gene expression data analysis by machine learning methods: Diagnosis of tumors using a supervised approach
E N D
Supervised gene expression data analysis using SVMs and MLPs e-mail: valenti@disi.unige.it Giorgio Valentini
Outline • A real problem: Lymphoma gene expression data analysis by machine learning methods: • Diagnosis of tumors using a supervised approach • Discovering groups of genes related to carcinogenic processes • Discovering subgroups of diseases using gene expression data.
DNA microarray • DNA hybridization microarrays supply information about gene expression through measurements of mRNA levels of large amounts of genes in a cell • They offer a snapshot of the overall functional status of a cell: virtually all differences in cell type or state are related with changes in the mRNA levels of many genes. • DNA microarrays have been used in mutational analyses, genetic mapping studies, in genome monitoring of gene expression, in pharmacogenomics, in metabolic pathway analysis.
A DNA microarray image (E. coli) • Each spot corresponds to the expression level of a particular gene • Red spots correspond to over expressed genes • Green spots to under expressed genes • Yellow spots correspond to intermediate levels of gene expression
Unsupervised approach No or limited a priori knowledge. Clustering algorithms are used to group together similar expression patterns : grouping sets of genes grouping different cells or different functional status of the cell. Example: hierarchical clustering, fuzzy or possibilistic clustering, self-organizing maps. Supervised approach “A priori” biological and medical knowledge on the problem domain. Learning algorithms with labeled examples are used to associate gene expression data with classes: separating normal form cancerous tissues classifying different classes of cells on functional basis Prediction of the functional class of unknown genes. Example: multi-layer perceptrons, support vector machines, decision trees, ensembles of classifiers. Analyzing microarray data by machine learning methods The large amount of gene expression data requires machine learning methods to analyze and extract significant knowledge from DNA microarray data
A real problem: A gene expression analysis of lymphoma Machine learning methods Biological problems • 1. Separating cancerous and normal tissues using the overall information available. 1. - Support Vector Machines (SVM) : linear, RBF and polynomial kernels - Multi Layer Perceptron (MLP) - Linear Perceptron (LP) 2. Identifying groups of genes specifically related to the expression of two different tumour phenotypes through expression signatures. • 2. Two step method: • A priori knowledge and unsupervised methods to select “candidate” subgroups • SVM or MLP identify the most correlated subgroups
The data • Data of a specialized DNA microarray, named "Lymphochip", developed at the Stanford University School of Medicine: 4026 different genes preferentially expressed in lymphoid cells or with known roles in processes important in immunology or cancer High dimensional data 96 tissue samples from normal and cancerous populations of human lymphocytes Small sample size A challenging machine learning problem
Type of tissue Number of samples Normal lymphoid cells 24 DLBCL 46 FL 9 CLL 11 TCL 6 Types of lymphoma • Three main classes of lymphoma: • Diffuse Large B-Cell Lymphoma (DLBCL), • Follicular Lymphoma (FL) • Chronic Lymphocytic Leukemia (CLL) • Transformed Cell Lines (TCL) • and normal lymphoid tissues
The first problem:Separating normal from cancerous tissues. Our first task consists in distinguishing cancerous from normal tissues using the overall information available, i.e. all the gene expression data. From a machine learning standpoint it is a dichotomic problem. • Data characteristics: • Small sample size • High dimension • Missing values • Noise • Main applicative goal: • Supporting functional-molecular diagnosis of tumors and polygenic diseases
Supervised approaches to molecular classification of diseases Several supervised methods have been applied to the analysis of cDNA microarrays and high density oligonucleotide chips: • Decision trees • Fisher linear discriminant • Multi-Layer Perceptrons • Nearest-Neighbours classifiers • Linear discriminant analysis • Parzen windows • Support Vector Machines Proposed by different authors: Golub et al. (1999), Pavlidis et al. (2001), Khan et al. (2001), Furey et al. (2000), Ramaswamy et al. (2001), Yeang et al. (2001), Dudoit et al. (2002).
Why using Support Vector Machines ? • “General” motivations • SVM are two-class classifiers theoretically founded on Vapnik' s Statistical Learning Theory. • They act as linear classifiers in a high dimensional feature space originated by a projection of the original input space. • The resulting classifier is in general non linear in the input space. • SVM achieves good generalization performances maximizing the margin between the classes. • SVM learning algorithm has no local minima • “Specific” motivations • Kernel are well-suited to working with high dimensional data. • Small sample sizes require algorithms with good generalization capabilities. • Automatic diagnosis of tumors requires high sensitivity and very effective classifiers. • SVM can identify mis-labeled data (i.e. incorrect diagnosis). • We could design specific kernel to incorporate “a priori” knowledge about the problem.
SVM to classify cancerous and normal cells • We consider 3 standard SVM kernels: • Gaussian • Polynomial • Dot-product • Varying: • Values of the the kernel parameters • The regularization factor C • Estimation of the generalization error through: • 10-fold cross-validation • leave-one-out • Comparing them with: • MLP • LP • Varying: • Number of hidden units • Backpropagation parameters
Learning machine model Gen. error St. dev. Prec. Sens. SVM-linear 1.04 3.16 98.63 100.0 SVM-poly 4.17 5.46 94.74 100.0 SVM-RBF 25.00 4.48 75.00 100.0 MLP 2.08 4.45 98.61 98.61 LP 9.38 10.24 95.65 91.66 Results • 10-fold cross-validation ~ leave-one-out estimation of error • SVM-linear achieves the best results. • High sensitivity, no matter what type of kernel function is used. • Radial basis SVM high misclassification rate and high estimated VC dimension
ROC analysis • The ROC curve of the SVM-linear is ideal • The polynomial SVM also achieves a reasonably good ROC curve • The SVM-RBF show a diagonal ROC curve: the highest sensitivity is achieved only when it completely fails to correctly detect normal cells. • The ROC curve of the MLP is also nearly optimal • Linear perceptron shows a worse ROC curve, but with reasonable values lying on the highest and leftmost part of the ROC plane.
Summary of the results on the first problem • Using hierarchical clustering 14,6% of the examples are misclassified (Alizadeh, 2000), against the 1.04% of the SVM, the 2.08% of the MLP and the 9.38% of the LP. • Supervised methods exploit a priori biological knowledge (i.e. labeled data), while clustering methods use only gene expression data to group together different tissues, without any labeled data. • Linear SVM achieve the best results, but also MLP and 2nd degree polynomial show a relatively low generalization error. • Linear SVM and MLP can be used to build classifiers with a high-sensitivity and a low rate of false positives. • These results must be considered with caution because the size of the available data set is too small to infer general statements about the performances of the proposed learning machines.
The second problem: Identifying DLBCL subgroups It starts froman hypothesis of Alizadeh et al. about the existence of two distinct functional types of lymphoma inside DLBCL. Actually, we consider two problems: • 1. Validation of Alizadeh’s hypothesis • They identified two subgroups of molecularly distinct DLBCL: germinal centre B-like (GCB-like) and activated B-like cells (AB-like). • These two classes correspond to patients with very different prognosis. 2. Finding groups of genes mostly related to this separation Different subsets of genes could be responsible for the distinction of these two DLBCL subgroups: the expression signatures Proliferation, T-cell, Lymphnode and GCB (Lossos,2000).
A feature selection approachbased on “a priori” knowledge Finding the most correlated genes involves an exponential combination of genes (2n-1), where n is usually of the order of thousands. We need greedy algorithms and heuristic methods. Can we exploit “a priori” biological knowledge about the problem ?
An heuristic method (1) • A two-stage approach: • I. Select groups of coordinately expressed genes. • II. Identify among them the ones mostly correlated to the disease. • We do not consider single genes. • We consider only groups of coordinately expressed genes.
An heuristic method (2) • I. Selecting groups of coordinately expressed genes: • Use“a priori” biological and medical knowledge about groups of genes with known or suspected roles in carcinogenic processes • And/or • Use unsupervised methods such as clustering algorithms to identify coordinately expressed sets of genes • II. Identify subgroups of genes mostly related to thedisease: • Train a set of classifiers using only the subgroups of genes selected in the first stage. • Evaluate and rank the performance of the trained classifiers. • Select the subgroups by which the corresponding classifiers achieve the best ranking.
Applying the heuristic method • 1. Selecting “candidate” subgroups of genes: • We used biological knowledge and hierarchical clustering algorithms to select four subgroups: • Proliferation: sets of genes involved the biological process of proliferation • T-cell: genes preferentially expressed in T-cells • Lymphnode: Sets of genes normally expressed in lymphnodes • GCB: genes that distinguish germinal centre B-cells from other stages in B-cell ontogeny • 2. Identify subgroups of genes most related to the the separation GCB-like / AB-like • Training of SVM, MLP and LP as classifiers using each subgroup of genes and all the subgroups together (All) 5 classification tasks • Leave-one-out methods used with gaussian, polynomial and linear SVM • 10-fold cross-validation with gaussian, polynomial and linear SVM, MLP and LP.
GCB signature Learn. machine model Gen. error St. dev. Prec. Sens. SVM-linear 10.50 11.16 90.00 90.00 SVM-poly 8.70 14.54 96.67 88.33 SVM-RBF 4.50 9.55 100.0 90.00 MLP 8.70 10.50 90.90 90.90 LP 8.70 10.50 90.90 90.90 All signatures Learn. machine model Gen. error St. dev. Prec. Sens. SVM-linear 15.00 11.16 85.00 85.00 SVM-poly 14.00 18.97 93.33 76.67 SVM-RBF 10.00 10.54 100.00 76.67 MLP 8.70 13.28 95.00 86.36 LP 10.87 14.28 86.96 90.90
The second problem: summary • The results support the hypothesis of Alizadeh about the existence of two distinct subgroups in DLBCL. • The heuristic method identifies the GCB signature as a cluster of coordinately expressed genes related to the separation between the GCB-like and AB-like DLBCL subgroups.
I. Methods to discover subclasses of tumors on molecular basis. Integrating “a priori” biological knowledge, supervised machine learning methods and unsupervised clustering methods • II . Methods to identify small subsets of genes correlated to tumors • Refinements of the proposed heuristic method using clustering algorithms with semi-automatic selection of the number of the significant subgroups of genes. • Greedy algorithms based on mutual information measures. Developments Stratifying patients into molecularly relevant categories, enhancing the discrimination power and precision of clinical trials Automatic diagnosis of tumors using DNA microchips Discovery of new subclasses of tumors Enhancing biological knowledge about tumoral processes New perspectives on the development of new cancer therapeutics based on a molecular understanding of the cancer phenotype.
SVM for gene expression data analysis: a bibliography • M. Brown et al. Knowledge-base analysis of microarray gene expression data by using support vector machines. PNAS, 97(1):262--267, National Academy of Sciences Washington DC, 2000. • T.S. Furey, N.Cristianini, N.Duffy, D.Bednarski, M.Schummer, and D.Haussler. Support vector machine classification and validation of cancer tissue samples using microarray expression data. Bioinformatics, 16(10):906--914, 2000. • P. Pavlidis, J.Weston, J.Cai, and W.N. Grundy. Gene functional classification from heterogenous data. In Fifth International Conference on Computational Molecular Biology, ACM, Montreal, Canada, 2001. • C. Yeang, S. Ramaswamy, P. Tamayo, S. Mukherjee, R. Rifkin, M Angelo, M. Reich, E. Lander, J. Mesirov, and T. Golub. Molecular classification of multiple tumor types. ISMB 2001, Proceedings of the 9th International Conference on Intelligent Systems for Molecular Biology, pages 316--322, Copenaghen, Denmark. Oxford University Press, 2001. • S. Ramaswamy et al., Multiclass cancer diagnosis using tumor gene expression signatures, PNAS, 98(26), 15149--15154, 2001. • I.Guyon, J.Weston, S.Barnhill, and V.Vapnik. Gene Selection for Cancer Classification using Support Vector Machines. Machine Learning, 46(1/3):389--422, 2002. • J. Weston, F. Perez-Cruz, O. Bousquet, O. Chapelle, A. Elisseeff, and B. Scholkopf, Feature selection and transduction for prediction of molecular bioactivity for drug design, Bioinformatics, 1(1), 2002. • G. Valentini, Gene expression data analysis of human lymphoma using support vector machines and output coding ensembles, Artificial Intelligence in Medicine, 26(3):283--306, 2002. • G. Valentini, M. Muselli and F. Ruffino, Bagged Ensembles of SVMs for Gene Expression Data Analysis, The IEEE-INNS-ENNS International Joint Conference on Neural Networks, Portland, USA, 2003.