290 likes | 480 Views
A Gene Selection Method for Microarray Data based on Sampling. Yungho Leu*, Chien-Pang Lee, Hui-Yi Tsai National Taiwan University of Science and Technology. Agenda. Introduction Problem Def. The proposed method Experimental results Conclusion. Introduction to Microarray.
E N D
A Gene Selection Method for Microarray Data based on Sampling Yungho Leu*, Chien-Pang Lee, Hui-Yi Tsai National Taiwan University of Science and Technology
Agenda • Introduction • Problem Def. • The proposed method • Experimental results • Conclusion Mobile Computing & Data Mining Lab.
Introduction to Microarray Reference sample Test sample 2
Microarray data is represented as a matrix • Each sample is associated with a class—normal or abnormal (disease) 5
Problem Definition • We want to find a set of genes that can differentiate between normal and abnormal samples • The number of genes in the set should be minimal • The classification accuracy should be high • A problem called gene selection Mobile Computing & Data Mining Lab.
The proposed method • A three-stage approach • The first stage: initial gene reduction • The second stage: generate gene subsets by probability sampling • The third stage: select important genes Mobile Computing & Data Mining Lab.
Stage I: Initial gene reduction • Use T-test to drop genes whose expression levels are the same at both classes (normal C1 and abnormal C2): Expression levels at class C1 Expression levels at class C2 Mobile Computing & Data Mining Lab.
Forming gene groups • For a 2-class samples data • μ1: denotes the average expression level of a gene in C1 samples • μ2: denotes the average expression level of a gene in C2 samples • Group 1 contains genes with μ1 > μ2 • Group 2 contains genes with μ1 < μ2 • Group 3 contains genes whose μ1is not significantly different fromμ2 Mobile Computing & Data Mining Lab.
Two population T test H0:μ1=μ2 H1:μ1≠μ2 μ1 μ2
Two population T test • For example • μ1 is the average of sample 1 to sample 5, while μ2 is the average of sample 6 to sample 10 • Significant levelα=0.05,the p-value for the test of gene1 is 0.012 • Reject H0:μ1=μ2,Accept H1:μ1≠μ2 • Furthermore, since μ1 > μ2 , gene 1 is classified into group 1 Mobile Computing & Data Mining Lab.
Drop useless group • Performs the test on gene 2~gene 10 • Group 3 is not useful for differentiate samples between C1 and C2 Mobile Computing & Data Mining Lab.
Results of the gene reduction • Gene reduction by T-test: Mobile Computing & Data Mining Lab.
Stage 2: generate gene subsets by probability sampling • Sampling genes from Group 1 and Group 2 according to probabilities of Group 1 and Group 2 • The probability represents the ability to differentiate samples between class C1 and C2 • Three steps in this stage • Step1calculate sampling probabilities • Step2geneate gene subsets • Step3filter gene subsets Mobile Computing & Data Mining Lab.
Step 1: Calculate sampling probabilities • Calculate aggregate t–values of each group • Define sampling probabilities based on aggregate t-values • An example Mobile Computing & Data Mining Lab.
t-statistic of gene 1: • t-statistics of Group 1 and Group 2: Mobile Computing & Data Mining Lab.
Sampling probabilities • Define sampling probabilities based on aggregate t-statistics Mobile Computing & Data Mining Lab.
Step 2: generate gene subsets by sampling Group 2 Group 1 Sg1 gene1 gene8 gene9 gene5gene6 sg1 gene2gene7 gene1 sg2 gene9 gene8 sg3 P1:54.87% P2:45.13% K times ..... Draw without replacement sgk A gene subset Mobile Computing & Data Mining Lab.
Filter useless gene subsets • Drop the gene subsets with low classification accuracy • Use KNN classifier sg1 SS1 sg2 SS2 sg3 ..... Accuracy≧threshold ..... Generated 1000gene subsets Keep only 378gene subset SSn sgk SSi represent the selected subsets,n≦k Mobile Computing & Data Mining Lab.
Stage 3: Gene selection based on 2 -test • Important genes will appear in most of the gene subsets • Sort the genes according to their occurrence frequencies in the gene subsets; select the top-ranked genes as the final gene subset • Use 2 -test to determine the cut-off threshold Mobile Computing & Data Mining Lab.
Final gene selection based on 2 -test for homogeneity • An example: SS1 SS2 ordering ..... SSn 378 left-overgene subsets Mobile Computing & Data Mining Lab.
2 -Test for homogeneity 2x2 Contingency table 22 An example for gene 7 and gene 8 Mobile Computing & Data Mining Lab.
Step2: (Continued) • Significant level α= 0.05, 2- value=3.84 • The last pair of gene with significant different frequencies are gene 8 and gene 2 • gene 7 and gene 8 are selected as the final subset of genes Mobile Computing & Data Mining Lab.
Results and comparisons • For the Leukemia(白血病)、 colon(結腸癌)、 lymphoma(淋巴癌) data set Mobile Computing & Data Mining Lab.
Comparisons * Classification accuracy []No. of selected genes ( )year of publication Mobile Computing & Data Mining Lab.
Conclusion • Proposed a sampling based gene selection method • Use simple t-statistics, 2-test and probability sampling • Achieve high classification accuracy with less number of selected genes Mobile Computing & Data Mining Lab.