Efficient Gene Selection with Rough Sets From Gene Expression Data

Efficient Gene Selection with Rough Sets From Gene Expression Data Lijun sun Duoqian Miao Hongyun Zhang

Introduction • Rough Sets Based Feature Selection • Rough Sets Based Gene Selection Method • Experimental Results • Conclusion

Introduction-1 • different cells or a cell under different conditions yield different microarray results • comparisons of gene expression data derived from microarray results between normal and tumor cells can provide the important information for tumor classification

Introduction-2 • gene expression data set has very unique characteristics which are very different from all the previous data used for classification. • Most publicly available gene expression data has the following properties: • high dimensionality: up to tens of thousands of genes, • very small data set size: less than 100 • most genes are not related to cancer classification. • Problems: • Noise • Dimensionality • Cost of classification algorithms

Introduction-3 • only a very small fraction of genes are informative for a certain task • Base the classification on only a subset of the genes • Reduce dimensionality – for convenience, decrease running time • Drop noisy/irrelevant genes – for accuracy • For biological insight

Introduction-4 • feature ranking approach is most commonly used for feature selection • Filter approaches remove irrelevant features according to general characteristics of the data. • (1)Use a filter to rank all the genes in the data. • (2) Choose the first n − 1 genes as the best feature subset • simple, easy, better generalization • problems: Feature sets so obtained have certain redundancy because genes in similar pathways probably allhave very similar scores

Rough Sets Based Feature Selection-1 • Basic conception of rough sets • decision table: is denoted by T = (U, C U{d}), • where U is universe of discourse, • C is called condition attribute sets • {d} is decision feature. • Rows of the decision table correspond to objects, and columns correspond to attributes

Rough Sets Based Feature Selection-2 • Indiscernibility Relation . Let aA, P ⊆ A. A binary relation IND(P), called the indiscernibility relation, is defined as the following:

Rough Sets Based Feature Selection-3 • Indispensable and Dispensable Attribute Anattribute C is a indispensable attribute if Anattribute cC is a dispensable attribute if

Rough Sets Based Feature Selection-4 • Reduct The subset of attributes is a reduct of attribute C if And

Rough Sets Based Feature Selection-5 • Core. The set of all indispensable features in C is ,where is the set of all reducts of C with respect to D.

Rough Sets Based Feature Selection-6 • Rough sets based feature selection • An optimal feature subset selection based on the rough set theory can be viewed as finding such a reduct R , with the best classifying properties. R will be used to instead of C in a rule discovery algorithm.

Rough Sets Based Gene Selection Method-1 • Our learning problem is to select high discriminate genes for cancer classification from gene expression data • Gene expression data setcan be formalized as a decision system T = (U, C U{d}), • where universe U = {x1, x2, ……, xm} is a set of tumors. • The conditional attributes set C = {g1 , g2 ,……, gn} contains each gene, • the decision attribute d corresponds to class label of each sample. • Each attribute giC is represented by a vector gi = {x1,i, x2,i, ……, xm,i}, i=1,2,……,n, where xk,iis the expression level of gene i at sample k, k=1,2,……m.

Rough Sets Based Gene Selection Method-2 • Two steps of our method • Step 1 :use filter kind of method to obtain a feature subset • T-test is used as the filter, • Assuming that there are two classes of samples in a gene expression data set, the t-value for gene g is given by:

Rough Sets Based Gene Selection Method-3 • Step 2 :Use rough set attribute reduction to find a minimal reduct • information entropy is using as the heuristic information • Given the partition by D, U/IND(D), of U, the entropy based on the partition by aC,U / IND(a), of U, is given by

Experiment-1 • Data Set: The acute leukemia data of Golub et al. (1999) http://www.genome.wi.mit.edu/MPR • Consists of samples from two different types of acute leukemia, acute lymphoblastic leukemia (ALL) and acute myeloid leukemia (AML). • The training data set has 38 bone marrow samples (27 ALL and 11 AML). Each sample has expression patterns of 7129 genes measured by the Affymetrix oligonucleotide microarray. • The test data set consists of 24 bone marrow and 10 peripheral blood samples (20 ALL and 14 AML).

Experiment -2 • Only One gene is selected in the reduct X95735 • X95735 is also selected by many other methods • X95735 is the only gene identifed by J48 pruned tree And emerging patterns algorithm • X95735 is also selected by voting machine, SVM , Deb's NSGA-algorithm and Cho's work

Experiment -3 • Two rules are derived: • if the expression level of X95735 >938 then the sample is classifed as AML; • If the expression level of X95735 <938 then the sample is classifed as ALL • 31 of 34 samples in test data set are correct classified

The Comparison of Feature Selection and Classifcation Results

conclusion • the expression level of X95735 plays an important role in distinguishing two types of acute leukemia. Role of X95735 in discerning between two types of acute leukemia samples is also verified by biological researchers • Rough set based method can find informative gene for classification • Need verify our method on more data sets

Thanks!

Efficient Gene Selection with Rough Sets From Gene Expression Data

Efficient Gene Selection with Rough Sets From Gene Expression Data

Presentation Transcript

Finding Transcription Modules from large gene-expression data sets

Gene Expression

Gene Expression: From Gene to Protein

Effective Enrichment of Gene Expression Data Sets

Clustering Gene Expression Data

Gene Expression

Gene Expression

Gene Expression

Gene Expression: From Gene to Protein

Classification with Gene Expression Data

Gene Expression

Gene Expression From gene to protein

Gene Expression: From Gene to Protein

Gene Expression: From Gene to Protein

Clustering Gene Expression Data

Gene expression From Gene to Protein

GENE EXPRESSION

Gene Expression Data

Gene Expression

Gene expression From Gene to Protein

Clustering Gene Expression Data

Clustering Gene Expression Data