210 likes | 229 Views
This research explores Rough Sets for effective gene selection in cancer classification using gene expression data. Introduction discusses the significance of gene expression data, challenges, and feature selection methods. Rough Sets Based Feature Selection explains the fundamental concepts, such as decision tables and reducts. The Rough Sets Based Gene Selection Method is detailed for cancer classification, including formalizing the gene expression data set and the selection procedure. Experimental results of applying the method to acute leukemia data are presented. Key gene X95735 is highlighted through various selection methods for its classification significance.
E N D
Efficient Gene Selection with Rough Sets From Gene Expression Data Lijun sun Duoqian Miao Hongyun Zhang
Introduction • Rough Sets Based Feature Selection • Rough Sets Based Gene Selection Method • Experimental Results • Conclusion
Introduction-1 • different cells or a cell under different conditions yield different microarray results • comparisons of gene expression data derived from microarray results between normal and tumor cells can provide the important information for tumor classification
Introduction-2 • gene expression data set has very unique characteristics which are very different from all the previous data used for classification. • Most publicly available gene expression data has the following properties: • high dimensionality: up to tens of thousands of genes, • very small data set size: less than 100 • most genes are not related to cancer classification. • Problems: • Noise • Dimensionality • Cost of classification algorithms
Introduction-3 • only a very small fraction of genes are informative for a certain task • Base the classification on only a subset of the genes • Reduce dimensionality – for convenience, decrease running time • Drop noisy/irrelevant genes – for accuracy • For biological insight
Introduction-4 • feature ranking approach is most commonly used for feature selection • Filter approaches remove irrelevant features according to general characteristics of the data. • (1)Use a filter to rank all the genes in the data. • (2) Choose the first n − 1 genes as the best feature subset • simple, easy, better generalization • problems: Feature sets so obtained have certain redundancy because genes in similar pathways probably allhave very similar scores
Rough Sets Based Feature Selection-1 • Basic conception of rough sets • decision table: is denoted by T = (U, C U{d}), • where U is universe of discourse, • C is called condition attribute sets • {d} is decision feature. • Rows of the decision table correspond to objects, and columns correspond to attributes
Rough Sets Based Feature Selection-2 • Indiscernibility Relation . Let aA, P ⊆ A. A binary relation IND(P), called the indiscernibility relation, is defined as the following:
Rough Sets Based Feature Selection-3 • Indispensable and Dispensable Attribute Anattribute C is a indispensable attribute if Anattribute cC is a dispensable attribute if
Rough Sets Based Feature Selection-4 • Reduct The subset of attributes is a reduct of attribute C if And
Rough Sets Based Feature Selection-5 • Core. The set of all indispensable features in C is ,where is the set of all reducts of C with respect to D.
Rough Sets Based Feature Selection-6 • Rough sets based feature selection • An optimal feature subset selection based on the rough set theory can be viewed as finding such a reduct R , with the best classifying properties. R will be used to instead of C in a rule discovery algorithm.
Rough Sets Based Gene Selection Method-1 • Our learning problem is to select high discriminate genes for cancer classification from gene expression data • Gene expression data setcan be formalized as a decision system T = (U, C U{d}), • where universe U = {x1, x2, ……, xm} is a set of tumors. • The conditional attributes set C = {g1 , g2 ,……, gn} contains each gene, • the decision attribute d corresponds to class label of each sample. • Each attribute giC is represented by a vector gi = {x1,i, x2,i, ……, xm,i}, i=1,2,……,n, where xk,iis the expression level of gene i at sample k, k=1,2,……m.
Rough Sets Based Gene Selection Method-2 • Two steps of our method • Step 1 :use filter kind of method to obtain a feature subset • T-test is used as the filter, • Assuming that there are two classes of samples in a gene expression data set, the t-value for gene g is given by:
Rough Sets Based Gene Selection Method-3 • Step 2 :Use rough set attribute reduction to find a minimal reduct • information entropy is using as the heuristic information • Given the partition by D, U/IND(D), of U, the entropy based on the partition by aC,U / IND(a), of U, is given by
Experiment-1 • Data Set: The acute leukemia data of Golub et al. (1999) http://www.genome.wi.mit.edu/MPR • Consists of samples from two different types of acute leukemia, acute lymphoblastic leukemia (ALL) and acute myeloid leukemia (AML). • The training data set has 38 bone marrow samples (27 ALL and 11 AML). Each sample has expression patterns of 7129 genes measured by the Affymetrix oligonucleotide microarray. • The test data set consists of 24 bone marrow and 10 peripheral blood samples (20 ALL and 14 AML).
Experiment -2 • Only One gene is selected in the reduct X95735 • X95735 is also selected by many other methods • X95735 is the only gene identifed by J48 pruned tree And emerging patterns algorithm • X95735 is also selected by voting machine, SVM , Deb's NSGA-algorithm and Cho's work
Experiment -3 • Two rules are derived: • if the expression level of X95735 >938 then the sample is classifed as AML; • If the expression level of X95735 <938 then the sample is classifed as ALL • 31 of 34 samples in test data set are correct classified
The Comparison of Feature Selection and Classifcation Results
conclusion • the expression level of X95735 plays an important role in distinguishing two types of acute leukemia. Role of X95735 in discerning between two types of acute leukemia samples is also verified by biological researchers • Rough set based method can find informative gene for classification • Need verify our method on more data sets