130 likes | 298 Views
Recursive Partitioning for Tumor Classification with Gene Expression Microarray Data. Heping Zhang, Chang-Yung Yu, Burton Singer, Momian Xiong Presented by Weihua Huang.
E N D
Recursive Partitioning for Tumor Classification with Gene Expression Microarray Data Heping Zhang, Chang-Yung Yu, Burton Singer, Momian Xiong Presented by Weihua Huang
Expression profiles of 2,000 genes using an Affymetrix oligonucleotide array in 22 normal and 40 colon cancer tissuesThe response is binary indicating normal or cancer tissue and the predictor variables are the 2000 genes Data used in the article
Classification Tree Using Recursive Partitioning Goal: To partition the feature space into disjoint regions by growing a tree so that the group in the same region are homogeneous in terms of response. Algorithm: Start with a root node containing the study sample and split it into smaller and smaller nodes according to whether a particular selected predictor is above a chosen cutoff value. At each splitting step, the selected predictor and its corresponding level are chosen to maximize the reduction in node impurity ΔI= P(A)I(A) –P(AL)I(AL) –P(AR)I(AR)
Classification Tree using Recursive Partitioning Node impurity: One example of node impurity is measured by entropy function: - P log(P) - (1-P) log(1-P), where P is the probability of a tissue being normal within the node • Minimum impurity ( =0 ) • When all tissues are of the same type within the node ( P = 0 or 1) • Maximum impurity ( = log2) • When half normal tissues and half cancer tissues are within the node (P=0.5)
Results From Classification Tree on the DataFig 1. Classification tree for tissue types by using expression data from three genes ( M26383, R15447, M28214)
Another Way to Visualize the Recursive PartitioningFig 3. A scatterplot of expression data from R15447 and M28214 for a subset of tissues (node 3 in Fig. 1).
Results from Recursive partitioning Quality of the tree-based classification: Using localized 5-fold cross validation error rate: • The same genes to the same nodes • Randomly divide the 40 cancer tissues into 5 subsamples of 8, and the 22 normal tissues into 5 subsamples of 4,4,4,5, and 5; four subsamples each from the cancer and normal tissues were used to choose the cutoff values for the three splits. The remaining samples were used to count the misclassified tissues as a result of new cutoff values. • The error rate is between 6-8% from two runs of cross validation, which is much better than that obtained by existing analysis.
Correlation Analysis on Genes Functional expressions from various genes are correlated. Examine the correlation patterns of the three selected genes in Fig. 1.
Correlation Between the Three Selected Genes and the Remaining Expression Data
Another Tree Based on a Different Set of Three GenesFig. 6. Classification tree for tissue types using expression data from three genes (R87126, T62947, X15183)
1. Efficient with large number of genes2. Automatically selects valuable and user-friendly genes as predictors3. More precise than some other classification methods such as support vector machine and linear discriminant analysis Advantages of the Classification Tree
1. It is likely that the information contained in a large number of genes can be captured by a small optimal set of genes without significant loss of information.2. The precision of classification of recursive partitioning is important for clinical application. Conclusions: