250 likes | 359 Views
Selecting Informative Genes with Parallel Genetic Algorithms. Deodatta Bhoite Prashant Jain . Terminology. Genes DNA, mRNA Gene expression Microarrays. Microarray output. Gene Selection. Large number of irrelevant genes introduce “biological noise”
E N D
Selecting Informative Geneswith Parallel Genetic Algorithms Deodatta Bhoite Prashant Jain
Terminology • Genes • DNA, mRNA • Gene expression • Microarrays
Gene Selection • Large number of irrelevant genes introduce “biological noise” • Analysis of results can be simplified by selecting only relevant genes for study • Two categories of gene selection • Filter approach selection • Wrapper approach selection
Classifier • What is a classifier used for? • Mapping of label pairs <xi, li> to {0,1,?} • Golub-Slonim classifier • Positive value = class 1, negative value = class 2
Ranking based gene selection methods • GS-correlation • Genes with most positive and negative correlation values are selected. • Tends to not select genes for which class values have large standard deviations with respect to training data (some of them may be most relevant and informative).
Ranking with disorder • This method doesn’t use the actual expression levels. • Ng_I represents the set of indices that belong to class I and h(x) is the indicator function.
Need for subset ranking • Individual ranking may not always result in selection of informative genes. • They ignore the relationships between genes by solely relying on individual scores. • Thus we need to explore subsets of genes to find the optimal subset for classification.
Genetic Algorithm • What is a genetic algorithm? • “Genetic Algorithms are defined as global optimization procedures that use an analogy of genetic evolution of biological organisms.” • Basically genetic algorithms tend to find the best solution to a problem by following an evolutionary process.
Parallel Genetic Algorithm • For large population sizes, G.A. is computationally infeasible. • Hence the use of Parallel Genetic Algorithms.
Model and Encoding • Island Model -: Each processor runs a G.A. on a subset of the population and there is periodic migration. • Fixed Length Binary String Encoding-: Here if gene is included in the subset then value is 1 else 0.
Fitness Evaluation • Two Different Criteria • Classification Accuracy • Size of the subset fitness(x) = w1 * accuracy(x) + w2 *(1 – dimensionality(x)) • Here, • accuracy(x) = test accuracy of the classifier built with the gene subset represented by x • dimensionality(x) [0,1] = the dimension of the subset
Fitness Evaluation • w1 = weight assigned to accuracy • w2 = weight assigned to dimensionality • High classification accuracy and low dimension has high fitness.
Test Parameters • The tests were run on two processors. • The parameters of G.A. in each processor were set as -: • Population Size : 1000 • Trials : 400000 • Crossover probability: 0.6 • Mutation probability: 0.001
Test Parameters • Selection Strategy: Elitist • Migration Probability: 0.002 • Crossover probability of average level to get different subpopulation with good traits of the parents. • Mutation Probability low to avoid randomness of selection. • Selection Strategy is Elitist which ensures that the best individuals are kept and hence leads to more accurate subsets of genes.
Results • Leukemia Data Set • Subset with 29 Genes found • Classifies 36/38 training instances correctly • Classifies 30/34 test instances correctly • Colon Data Set • Subset with 30 genes found • 92% accuracy on the training data set
Results Comparison • Results better than other algorithms such as G-S and NB algorithms which have accuracies less than 90% and gene numbers varying from 10 to 500.
Conclusion • Method does well in finding smaller gene subsets and better accuracies. • Fitness function needs to be something more sophisticated than the simple one used right now to ensure a final compact subset every time.
Questions Thank You.