Presentation by Tim Hamilton

Genetic algorithms applied to multi-class prediction for the analysis of gene expressions dataC.H. Ooi & Patrick Tan Presentation by Tim Hamilton

“Genechips” • DNA microarrays – a collection of microscopic DNA spots representing single genes. • Commonly used to monitor expression levels of thousands of genes at once.

Classification • Gene expression data is commonly used in the classification of a biological sample. • Tumor subtypes • Response to certain types of treatment (e.g. chemotherapy). • Most approaches focus on classification of two, or at most three classes, and have high rates of error when run on sets containing multiple classes (19%) • Propose using GA for analyzing multiple-class expression data.

Reduced performance of previous rank-based approaches because of: 1) missing correlations between genes. 2) Predictor set size must be specified. • Data Sets used for the GA: • NCI60: expression profiles of 64 cancer cell lines containing 9703 cDNA sequences. • GCM: expression profiles for 198 tumor samples, 90 normal samples, and 20 unknowns containing 16063 genes. • Both data sets were pre-processed to generate a truncated 1000-gene dataset, color ratio of a single spot – color ration of all spots / standard deviation. Kept the genes with the highest standard deviation.

Choosing a GA chromosome • Determine some minimum and maximum gene range for selection. [Rmin, Rmax] • Chromosome string: [R g1 g2… gRmax ] - R is the size of the predictive set - any genes past length R are ignored. - genes are chosen from the list of 1000.

Parameters • Population size: 100 • Generations: 100 Other parameters were varied • Crossover method: one-point or universal • Selection method: stochastic universal sampling (SUS) or roulette wheel selection (RWS) • Probability of Crossover : 0.7 – 1.0 • Probability of mutation: 0.0005 – 0.01 • Predictor set size range [Rmin, Rmax]: [5, 10], [11, 15], [16, 20], [21, 25], [26,30]; • For each predictor set size this produced 96 different runs • Run on both the truncated set, and the full data set for comparison.

Each generation of chromosomes is used to classify the data sets using a maximum likelihood (MLHD) method. • Fitness = 200 – (E1 + E2) • E1 = cross validation error rate • E2 = independent test error rate. • The MLHD classifier involves a lot of math, but is based upon Bayes Rule • Used two previous rank-based methods on the same truncated data set for comparison.

Results • Uniform crossover produced the best predictors in size ranges [11,15] and [16,20] • One-point crossover best in ranges [5,10], [21,25] and [26,30]. • Higher predictive accuracies when run against the truncated data set.

Results vs. Other Methods

Finally, GA compared to another method using SVM classification. • The SVM had best performance when all 16063 genes of a data-set were used, 22% error • The GA used only 32 elements, 18% error.

Presentation by Tim Hamilton

Presentation by Tim Hamilton

Presentation Transcript

Life Coaching Presentation By Samantha Hamilton

By Tim Palmer

By: Tim Kubetz

By Tim Stewart

Blade by Tim Bowler

By Tim Cho

By Tim

Zeely by Virginia Hamilton

by: Alexis Hamilton

By: Damien Hamilton

WW1-By Damien hamilton

By: Damien Hamilton

Presented by Tim Lyons

Hamilton CMA Market Presentation

By: Tim Green

Created by Tim Hines

Mythology By- Edith Hamilton

By Tim Brown

By Tim

Presentation by: Tim Sablik