100 likes | 185 Views
Genetic algorithms applied to multi-class prediction for the analysis of gene expressions data C.H. Ooi & Patrick Tan. Presentation by Tim Hamilton. “Genechips”. DNA microarrays – a collection of microscopic DNA spots representing single genes.
E N D
Genetic algorithms applied to multi-class prediction for the analysis of gene expressions dataC.H. Ooi & Patrick Tan Presentation by Tim Hamilton
“Genechips” • DNA microarrays – a collection of microscopic DNA spots representing single genes. • Commonly used to monitor expression levels of thousands of genes at once.
Classification • Gene expression data is commonly used in the classification of a biological sample. • Tumor subtypes • Response to certain types of treatment (e.g. chemotherapy). • Most approaches focus on classification of two, or at most three classes, and have high rates of error when run on sets containing multiple classes (19%) • Propose using GA for analyzing multiple-class expression data.
Reduced performance of previous rank-based approaches because of: 1) missing correlations between genes. 2) Predictor set size must be specified. • Data Sets used for the GA: • NCI60: expression profiles of 64 cancer cell lines containing 9703 cDNA sequences. • GCM: expression profiles for 198 tumor samples, 90 normal samples, and 20 unknowns containing 16063 genes. • Both data sets were pre-processed to generate a truncated 1000-gene dataset, color ratio of a single spot – color ration of all spots / standard deviation. Kept the genes with the highest standard deviation.
Choosing a GA chromosome • Determine some minimum and maximum gene range for selection. [Rmin, Rmax] • Chromosome string: [R g1 g2… gRmax ] - R is the size of the predictive set - any genes past length R are ignored. - genes are chosen from the list of 1000.
Parameters • Population size: 100 • Generations: 100 Other parameters were varied • Crossover method: one-point or universal • Selection method: stochastic universal sampling (SUS) or roulette wheel selection (RWS) • Probability of Crossover : 0.7 – 1.0 • Probability of mutation: 0.0005 – 0.01 • Predictor set size range [Rmin, Rmax]: [5, 10], [11, 15], [16, 20], [21, 25], [26,30]; • For each predictor set size this produced 96 different runs • Run on both the truncated set, and the full data set for comparison.
Each generation of chromosomes is used to classify the data sets using a maximum likelihood (MLHD) method. • Fitness = 200 – (E1 + E2) • E1 = cross validation error rate • E2 = independent test error rate. • The MLHD classifier involves a lot of math, but is based upon Bayes Rule • Used two previous rank-based methods on the same truncated data set for comparison.
Results • Uniform crossover produced the best predictors in size ranges [11,15] and [16,20] • One-point crossover best in ranges [5,10], [21,25] and [26,30]. • Higher predictive accuracies when run against the truncated data set.
Finally, GA compared to another method using SVM classification. • The SVM had best performance when all 16063 genes of a data-set were used, 22% error • The GA used only 32 elements, 18% error.