A Gene Selection Method for Microarray Data based on Sampling

A Gene Selection Method for Microarray Data based on Sampling Yungho Leu*, Chien-Pang Lee, Hui-Yi Tsai National Taiwan University of Science and Technology

Agenda • Introduction • Problem Def. • The proposed method • Experimental results • Conclusion Mobile Computing & Data Mining Lab.

Introduction to Microarray Reference sample Test sample 2

An Introduction to Microarray 3

An Introduction to Microarray 4

Microarray data is represented as a matrix • Each sample is associated with a class—normal or abnormal (disease) 5

Problem Definition • We want to find a set of genes that can differentiate between normal and abnormal samples • The number of genes in the set should be minimal • The classification accuracy should be high • A problem called gene selection Mobile Computing & Data Mining Lab.

The proposed method • A three-stage approach • The first stage: initial gene reduction • The second stage: generate gene subsets by probability sampling • The third stage: select important genes Mobile Computing & Data Mining Lab.

Stage I: Initial gene reduction • Use T-test to drop genes whose expression levels are the same at both classes (normal C1 and abnormal C2): Expression levels at class C1 Expression levels at class C2 Mobile Computing & Data Mining Lab.

Forming gene groups • For a 2-class samples data • μ1: denotes the average expression level of a gene in C1 samples • μ2: denotes the average expression level of a gene in C2 samples • Group 1 contains genes with μ1 > μ2 • Group 2 contains genes with μ1 < μ2 • Group 3 contains genes whose μ1is not significantly different fromμ2 Mobile Computing & Data Mining Lab.

Two population T test H0：μ1=μ2 H1：μ1≠μ2 μ1 μ2

Two population T test • For example • μ1 is the average of sample 1 to sample 5, while μ2 is the average of sample 6 to sample 10 • Significant levelα=0.05，the p-value for the test of gene1 is 0.012 • Reject H0：μ1=μ2，Accept H1：μ1≠μ2 • Furthermore, since μ1 > μ2 , gene 1 is classified into group 1 Mobile Computing & Data Mining Lab.

Drop useless group • Performs the test on gene 2~gene 10 • Group 3 is not useful for differentiate samples between C1 and C2 Mobile Computing & Data Mining Lab.

Results of the gene reduction • Gene reduction by T-test: Mobile Computing & Data Mining Lab.

Stage 2: generate gene subsets by probability sampling • Sampling genes from Group 1 and Group 2 according to probabilities of Group 1 and Group 2 • The probability represents the ability to differentiate samples between class C1 and C2 • Three steps in this stage • Step1calculate sampling probabilities • Step2geneate gene subsets • Step3filter gene subsets Mobile Computing & Data Mining Lab.

Step 1: Calculate sampling probabilities • Calculate aggregate t–values of each group • Define sampling probabilities based on aggregate t-values • An example Mobile Computing & Data Mining Lab.

t-statistic of gene 1： • t-statistics of Group 1 and Group 2： Mobile Computing & Data Mining Lab.

Sampling probabilities • Define sampling probabilities based on aggregate t-statistics Mobile Computing & Data Mining Lab.

Step 2: generate gene subsets by sampling Group 2 Group 1 Sg1 gene1 gene8 gene9 gene5gene6 sg1 gene2gene7 gene1 sg2 gene9 gene8 sg3 P1:54.87% P2:45.13% K times ．．．．． Draw without replacement sgk A gene subset Mobile Computing & Data Mining Lab.

Filter useless gene subsets • Drop the gene subsets with low classification accuracy • Use KNN classifier sg1 SS1 sg2 SS2 sg3 ．．．．． Accuracy≧threshold ．．．．． Generated 1000gene subsets Keep only 378gene subset SSn sgk SSi represent the selected subsets，n≦k Mobile Computing & Data Mining Lab.

Stage 3: Gene selection based on 2 -test • Important genes will appear in most of the gene subsets • Sort the genes according to their occurrence frequencies in the gene subsets; select the top-ranked genes as the final gene subset • Use 2 -test to determine the cut-off threshold Mobile Computing & Data Mining Lab.

Final gene selection based on 2 -test for homogeneity • An example: SS1 SS2 ordering ．．．．． SSn 378 left-overgene subsets Mobile Computing & Data Mining Lab.

2 -Test for homogeneity 2x2 Contingency table 22 An example for gene 7 and gene 8 Mobile Computing & Data Mining Lab.

Step2: (Continued) • Significant level α= 0.05， 2- value=3.84 • The last pair of gene with significant different frequencies are gene 8 and gene 2 • gene 7 and gene 8 are selected as the final subset of genes Mobile Computing & Data Mining Lab.

Results and comparisons • For the Leukemia(白血病)、 colon(結腸癌)、 lymphoma(淋巴癌) data set Mobile Computing & Data Mining Lab.

Comparisons * Classification accuracy []No. of selected genes ( )year of publication Mobile Computing & Data Mining Lab.

Conclusion • Proposed a sampling based gene selection method • Use simple t-statistics, 2-test and probability sampling • Achieve high classification accuracy with less number of selected genes Mobile Computing & Data Mining Lab.

A Gene Selection Method for Microarray Data based on Sampling

A Gene Selection Method for Microarray Data based on Sampling

Presentation Transcript

Gene Selection For Discriminant Microarray Data Analyses

Sample Size Selection for Microarray based Gene Expression Studies

Clustering analysis of microarray gene expression data

Gene Expression meets Gene Ontology: A novel statistical method for Microarray analysis

Statistical Methods for Analyzing Ordered Gene Expression Microarray Data

A Kolmogorov -Smirnov Correlation-Based Filter for Microarray Data

A self-organizing method for WSN s based on natural selection.

Microarray Gene Expression Data Analysis

Recursive partitioning for tumor classification with gene microarray data

Microarray Data Analysis Differential Gene Expression

Gene expression: Microarray data analysis

ArrayExpress - a Public Repository for Microarray Based Gene Expression Data

Classification of Microarray Gene Expression Data

Probability-based imputation method for fuzzy cluster analysis of gene expression microarray data

A Gene Expression Barcode for Microarray Data

ArrayExpress – a public database for microarray gene expression data

Bayesian Models for Gene expression With DNA Microarray Data

A genetic algorithm-based method for feature subset selection

Classification of Microarray Gene Expression Data

Clustering analysis of microarray gene expression data

Eigensolvers for analysis of microarray gene expression data