320 likes | 462 Views
An Interactive Self-Supervised Learning Framework for Classifying Microarray Gene Expression Data. Presenter: Yijuan Lu. Outline. Background Discriminant-EM (DEM) Kernel Discriminant-EM (KDEM) Experiments Relevance Feedback in Microarray Analysis
E N D
An Interactive Self-Supervised Learning Framework for Classifying Microarray Gene Expression Data Presenter: Yijuan Lu
Outline • Background • Discriminant-EM (DEM) • Kernel Discriminant-EM (KDEM) • Experiments • Relevance Feedback in Microarray Analysis • Interactive self-supervised learning system for gene classification and retrieval • Conclusion and Discussion
Microarray chips Images scanned by laser Gene Value D26528_at 193 D26561_cds1_at -70 D26561_cds2_at 144 D26561_cds3_at 33 D26579_at 318 D26598_at 1764 D26599_at 1537 D26600_at 1204 D28114_at 707 Background • Microarray technology enables us to simultaneously observe the expression levels of many thousands of genes on the transcription level. These expression data showsdistinct patterns. • Microarray: • chip with a matrix of thousands of spots printed on to it • Each spot binds to a specific gene
Background • Two Major Characters : (1) high dimension (usually from tens to hundreds) Features: different genes expression measurements (2) small sample size (insufficient labeled training data) In general, most of genes’ function are not known. For example: a simple leukemia microarray dataset Training set: 38acute leukemia samples two types: ALL and AML Feature: 50 informative genes expression measurements
Background • Based on the premise that genes of correlated functions tend to exhibit similar expression patterns. • Various machine learning methods have been applied to capture these specific patterns in microarray data. • Two major characters become two challenges: High dimension: the machine learning is affected by the “curse of dimensionality” as the search space grows exponentially with the dimension. Small sample size: pure machine learning methods, such as SVM cannot give stable or meaningful results with small sample size. • Therefore, an approach that is relatively unaffected by these problems will allow us to get more from less. • Discriminant-EM (DEM) [1] is a self-supervised learning algorithm proposed for such a purpose.
Discriminant-EM (DEM) • DEM algorithm [1] is a self-supervised learning algorithm and it has been successfully used in content-based image retrieval (CBIR). • DEM alleviates small sample size problem by compensating a small set of labeled data with a large set of unlabeled data. The hybrid data set , Assume the hybrid data set is drawn from a mixture density distribution of C components (each component corresponds to one class), which are parameterized by • The mixture model can be represented as: • The parameters can be estimated by maximizing a posteriori probability , which can be calculated by EM algorithm. • Predict the label of unlabeled data by
Discriminant-EM (DEM) • Find a mapping such that the data are clustered in the mapped data space, in which the probabilistic structure could be simplified and captured by simpler Gaussian mixtures. • MDA is a traditional multi-class discriminant analysis which helps to find a mapping such that the data are better clustered in the reduced feature space. The goal is to maximize the ratio of (1). (1) • Here, W denotes the weight vector of a linear feature extractor (i.e., for a sample x, the feature is given by the projections (WT·x). Between-class variance measures the separability of class centers and within-class variance measures the separability of class centers and samples within that class. • However, when the available labeled data are not enough, it is difficult to expect MDA to output good results.
Discriminant-EM (DEM) • By combining MDA with the EM framework, DEM supplies MDA enough labeled data by identifying some “similar” samples in the unlabeled data set to enlarge the labeled data set. • And DEM can provide EM a projected space, which makes it easier to select the structure of Gaussian mixture models.
Kernel Discriminant-EM (KDEM) • Since the discriminating step is linear, it is difficult for DEM to handle nonlinearly separable data. • We extend the linear algorithm in DEM to a nonlinear kernel one and generalize the Kernel Discriminant-EM algorithm (KDEM) [2], in which the data are first projected nonlinearly into a high dimensional feature space F: where the data are better linearly separated. After that, the original MDA algorithm is applied in a kernel feature space F. • Using superscript to denote quantities in the new space, we have the objective function in the following form: with and as between-class and within-class scatter matrices respectively, , , and N is the total number of samples.
Kernel Discriminant-EM (KDEM) • However, there is no other way to express the solution , because F is too high or infinite dimension. • But we know [2] that any column of the solution must lie in the span of all training samples in F, i.e., . Thus, for some expansion coefficients where . • We can therefore project a data point onto one coordinate of the linear subspace of F as follows: , Here, we use kernel notation
Kernel Discriminant-EM (KDEM) • Similarly, we can project each of the class means onto an axis of the subspace of feature space F using only products: Hence, where andwhere • The goal has been changed to find where , and are matrices which require only kernel computations on the training samples.
Kernel Discriminant-EM (KDEM) • KDEM can be initialized by selecting all labeled data as kernel vectors and by training a weak classifier based on only labeled samples. Then, the three steps of KDEM are iterated until some appropriate criterion is satisfied: • E-step: set • D-step: set , and project a data point x to a linear subspace of feature space F. • M-Step: set Here, Z={label, weight}.
KDEM on yeast cell cycle regulation dataset • The sequencing and the functional annotation of the whole S. cerevisiae genome has been complete, and thus it serves as an ideal test bed to estimate the accuracy of proposed methods. • We focused on five representative functional classes (Table I) that have been previously analyzed and demonstrated to be learnable by [4]. TABLE I Functional classes and distribution of member genes used in our evaluation.
KDEM on yeast cell cycle regulation dataset • In a well-cited study [4], the use of SVM, two decision tree learners (C4.5 and MOC1) and Parzen windows etc. have been investigated in gene classification to the same dataset. • Results: SVM with kernel functions significantly outperformed the other algorithms. • We focused on the comparison of KDEM with SVM using the same polynomial and radial basis kernel (RBF) functions. • Polynomial kernel functions: , RBF functions: . was set to be a widely used value, distances from each positive example to the nearest negative example [4].
KDEM on yeast cell cycle regulation dataset • We performed a two-class classification with positive genes from one functional class and the negative genes from the remaining classes. • The yeast gene dataset is an imbalanced dataset, in which the number of negative genes is much larger than the number of positive genes. • In this case, accuracy and single precision are not good evaluation metrics because FN is more important than FP [4]. Thus, we chose to use f_measure. • For each class, we randomly selected 2/3 positive genes and 2/3 negative genes as training set and the remaining gene data as testing set to do classification. This procedure was repeated for 100 times.
Experiment • One parameter for KDEM: the number of kernel vectors used. • There was no good way on how to choose it until now. In our experiments, we tested KDEM under different number of kernel vectors as an example and the best one is the one which showed the highest f_measure for most classes. • Fig. 1 shows the average f_measure in percentage of KDEM with RBF under varying number of kernel vectors used on yeast data. (could be deleted)
KDEM on yeast cell cycle regulation dataset Table II • Comparison of precision, recall, f_measure for various classification methods on yeast cell cycle regulation data set.
KDEM on yeast cell cycle regulation dataset • Given a small sample size, SVM could hardly find sufficient labeled data to train classifiers well. By contrast, DEM and KDEM released the pain of small sample size problem by incorporating a large number of unlabeled data. • Figure 2 proves our expectation by showing the performance of KDEM, DEM, and SVM on class Histone/Chromosome as the size of training samples drops from 2/3 to 1/7 of the total samples.
KDEM on yeast cell cycle regulation dataset • Compared to DEM, KDEM has a better capacity to separate linearly non-separable data. • Fig. 3. Data distribution in the projected subspace: the left is DEM and the right is KDEM. Different samples are more separated and clustered in the nonlinear subspace by KDEM. (*: Cytoplasmic ribsome class; o: Proteasome class
KDEM on Malaria Plasmodium falciparum microarray dataset • Malaria is one of the most devastating infectious diseases. Approximately 500 million cases present annually and about 2 million people die. The causative agent of the most burdensome form of human malaria is a protozoan parasite Plasmodium falciparum. • The whole genome sequencing of P. falciparum predicted over 5,400 genes [5], of which, about 60% functions unknown. Table III. Functional classes and number of member genes.
KDEM on Malaria Plasmodium falciparum microarray dataset • For each class, we randomly selected 2/3 positive genes and 2/3 negative genes as training set and the remaining gene data as testing set to do classification. This procedure was repeated for 100 times. Table IV Comparision of f_measure for various classification methods onP. falciparum data set
Putative genes of specific functional classes identified by KDEM
Relevance Feedback (RF) • RF was transformed and introduced into content-based image retrieval (CBIR), during early and mid 1990’s [5]. • A challenge in content-based image retrieval is the semantic gap between the high-level semantics in human mind and the low-level features (such as color, texture, and shape) computed by the machine. • Users seek semantic similarity (e.g., airplane and bird are very similar in terms of high level features such as shape), but the machine can only provide similarity by data processing. • To bridge the gap, RF with human in the loop was introduced to Microarray analysis.
Relevance Feedback • There is a gap between the low-level expression data and its high-level function. In the analysis of microarray data, people are interested to find genes which have some kind of function. Yet, machines can only search for genes that have similar patterns of expression. • Two main problems (gaps) in this challenge: (1) how can we produce gene expression pattern from expression data? (2) how can we go from expression pattern to function? • To bridge this gap, we introduce RF to microarray analysis and propose an interactive self-supervised learning framework for gene classification.
Relevance Feedback in microarray analysis • Step 1. Machine provides first classification results with initial training set, through query class. • Step 2. Users provide feedback on the classification result as to whether, and to what degree, they belong to that class, based on their knowledge such as Gene Ontology classification and functional annotation. • Step 3. Machine updates the training set based on feedback and produces new classification results with the updated training set. Go to step 2.
Interactive self-supervised learning system for gene classification and retrieval
Improved Learning by Relevance Feedback • This system achieves an improved performance by RF. In our experiments, after a simple trial of correcting four ambiguous training examples (PF14_0601, PF14_0104, PF13_0178, and PFI1020c) based on Gene Ontology predictions, the classification accuracy increases from 84.5% to 87.2%. • It offers a powerful means for an annotation feedback. For instance, two oligonucleotide probes, both were predicted to correspond to the same gene; however, these two probes display apparently different developmental profiles: one is positively classified into Group 1, whereas the other is classified as negative. • This discrepancy is probably due to the error in gene model. In other words, these two probes may represent two different genes rather than one. Our system clearly pinpointed these errors.
Conclusion and Discussion • KDEM extends the linear DEM to a nonlinear one, such that nonlinear discriminating features could be identified and training data could be better classified in a nonlinear feature subspace. • KDEM is applied on gene classification on the yeast and P. falciparumdataset, and compared to the state-of-the-art algorithm SVM with polynomial and RBF kernel functions. KDEM outperforms SVM in the extensive tests. • Some unknown genes in the P. falciparum dataset are identified with the agreement from both gene ontology and KDEM.
Conclusion and Discussion • In order to bridge the gap between gene expressions and the associated functions, an interactive learning framework Relevance Feedback is also introduced for microarray analysis and a real-time demo system is constructed for gene classification and retrieval. • This system can pinpoint annotation errors and it appears to improve learning significantly after a few iterations in RF, which exhibits the advantage of human in the loop very well.
Reference • Y. Wu, Q. Tian, and T. S. Huang, “Discriminant EM algorithm with application to image retrieval,” Proc. of IEEE Conf. Computer Vision and Pattern Recognition, 2000. • Q. Tian, Y. Wu, J. Yu, and T.S. Huang, "Self-Supervised Learning Based on Discriminative Nonlinear Features for Image Classification,” Pattern Recognition, Vol. 38, No. 6, pp. 903-917, 2005 • B. Schölkopf, and A. J. Smola, “Learning with Kernels,” Mass: MIT Press, 2002 • M. P. Brown, W. N. Grundy, D. Lin, et al. “Knowledge-based analysis of microarray gene expression data by using support vector machines,” Proc. Natl. Acad. Sci. USA, 97(1), pp. 262-267, 2000. • X. Zhou, and T.S. Huang, “Relevance feedback in image retrieval: a comprehensive review,” ACM Multimedia Systems Journal, special issue on CBIR, 2003, 8(6), pp: 536-544.