400 likes | 411 Views
Explore computational methods for analyzing molecular array data, addressing challenges like inaccuracy, sparseness, and model complexity using genetic algorithms and Mantel statistics.
E N D
Computational Discrete Mathematics and Statistics for Molecular Array Data Bill Shannon Washington University School of Medicine
Molecular Biology “How Genes Work”, http://www.nigms.nih.gov
A B C Gene Microarrays *Messenger RNA Levels A B C Gene Normal Cell Tumor Cell *Brenner, Jacob, Meselson (1961) An unstable intermediate carrying information from genes to ribosomes for protein synthesis. Nature, 476:576-581.
Microarrays (Leukemia PPG) 35 Probes Selected from ~50,000
Array Data Present New Data Analysis Challenges (Curse of Dimensionality) • Inaccuracy, or error, of a model becomes large very fast • sparseness (descriptions of the data is impossible) • model complexity (too many interaction terms, non-linear effects, etc. to consider) • random multicollinearity (spurious correlations)
Regression (Curse of Dimensionality) • y = f(x) + error • sparseness = little local signal • model parameters not estimated accurately • unstable models over-fit data (not genralizable) • Non-parametric methods (e.g., CART, neural nets) • require a lot of model searching • use up degree’s of freedom rapidly • little or no information left to determine significance
Cluster Analysis (Curse of Dimensionality) • Find structure in data • Many cluster results with same goodness-of-fit • Deciding among the models is impossible.
Classification Models (Curse of Dimensionality) • Predict group membership (e.g., tumor versus normal) • Three broad categories • geometric methods (discriminant analysis, CART) • probabilistic methods (Bayesian) • algorithmic methods (neural networks, k-NN) • Require training/validation datasets
Other Methods (Curse of Dimensionality) • Resampling (cross validation, bootstrapping), model averaging (bagging), or iterative re-weighting (boosting) • Multiple testing adjustment such as false discovery rate or permutation testing
Mantel Statistics • Transform standard NxP data matrices into NxN subject pairwise distances or similarities • Instead of analyzing NxP data matrix (P >> N) avoid the curse of dimensionality problem and analyze the NxN matrix
Mantel Statistics Shannon (2008) Cluster Analysis, in Handbook of Statistics, Vol. 27, eds. Rao, Rao, Miller.
Mantel Statistics Signal + Noise Genes Signal Genes Only
Mantel Statistics Correlating DP with Dk<<P avoids curse of dimensionality! A positive Mantel correlation indicates the genes in Dk<<Pcontains the same information as the genes in DP Shannon, Watson, et al. (2002). Mantel statistics to correlate gene expression levels from microarrays with clinical covariates. Genet Epidemiology 23: 87-96.
GA-Mantel • Search algorithm to find signal genes • Solution representation • list of genes (10 123 456 798 835 888 923) • binary vector {0000100110000….00010} • Each solution maps to a Mantel correlation value • Assumption: the larger the correlation the more signal genes in the solution • Selection keeps solutions with high Mantel correlation Grefenstette, Thompson, Shannon, and Steinmeyer (2005): Genetic algorithms for feature selection using Mantel correlation scoring. Interface: Classification and Clustering 37th Symposium on the Interface. St. Louis, MO
Gene Subset Selection • Given • a data set comprising N microarray experiments with g genes • Find: • a subset of genes that captures relevant relationships among the experiments • Goal: • reduce data for further analysis • identify meaningful biological markers for diagnosis
Genetic Algorithm 1. Randomly generate an initial population 2. Do until stopping criteria is met: Select individuals to be parents (biased by fitness). Produce offspring by recombination/mutation. Select individuals to die (biased by fitness). End Do. 3. Return a result.
Fitness Evaluation for Gene Selection • Calculate DP using all genes • For each Subset(k) in current population: • Calculate Dk<<P • Correlate DP with Dk<<P • Use Mantel Correlation as fitness to select next population of solutions • Permute to compute P-values
GA on Artificial Data • Simulated data: • 100 experiments with 10,000 genes • 100 signal genes • 9900 noise genes • Two groups • Group 1 has signal genes sampled from N(0, 1) • Group 2 has signal genes sampled from N(1, 1) • GA Parameters • population size 200 • generations 200 • Outcome measures (averaged over 10 runs of the GA) • prevalence (signal, noise) – number of signal and noise genes in GA answer • correlation (signal, noise) – correlation of best subset distance matrix with the ‘full’ distance matrix • coverage - number of signal genes identified over all GA runs
GA on Artificial Data Length = 30 Prevalence: mean number of signal genes = 22.9 (0.7) 76.3% (std 0.53%) Correlation: mean rho for best subsets = 0.787 (0.009) p-value < 0.0001 Coverage: total signal genes identified across 10 runs = 65/100 Observation: solutions tends to converge to similar subsets. Same 4 signal genes appear in 90% of runs
GA on Golub Data Set • Data set: Golub training set (38 x 7129) • Two Groups: • 27 samples from ALL patients • 11 samples from AML patients • GA searched for subsets of fixed length (10 to 50) • population = 200, generations = 200 • Mantel correlation tends to increase with subset size
Significant Feature Subsets Clustering of Samples using all genes Clustering of Samples using 50 genes from GA
Letting GA Select Subset Length • Data set: Golub training set (38 x 7129) • GA searched over variable length subsets (min=5 max=50) • Fitness penalty = d * length / 50 • population = 200, generations = 200 • Tradeoff between length of subsets and correlation score
Data Reduction • GA can also be used to find feature subsets that minimize rho • pop = 200 • length = 50 • data set = Golub • GA finds subsets with rho = 0 within 50 gens • Observation: GA appears to repeatedly converge to same regions of feature space • In 50 runs, 954/1546 (61%) of "noise" genes appear more than once in feature sets
GA in Experimental Data Analysis • Graft Versus Host Disease (GVHD) in bone marrow transplantation (leukemia) • T-cells in the transplanted bone marrow sees recipient as foreign and initiates an immune response destroying host organs • Regulatory T-cells (Treg) suppress immune response • Choi and DiPersio are studying the genetic mechanisms of Treg regulation
Mouse Array Experiment GROUP TREATMENT ARRAYS 1 Naïve Treg dec1, dec5 2 Activated Treg dec2, dec6, dec10 3 PBST (Control) dec3, dec7, dec11 4 Decitabine treated dec4, dec8, dec12 ~$1,000/per array in total costs: $12,000 worth of data including dec9 that did not work
Mouse Array Experiment • Identify probes (genes) with similar mRNA levels between groups (gene by phenotype analysis)
Summary • GA-Mantel effective at identifying signal genes • Longer gene subsets associated with higher scores • tradeoff: higher correlations vs. smaller subsets • requires constraining growth of subsets in GA • GA effective at identifying noise genes • GA-Mantel can find genes associated with phenotype
Future Directions • RFA CA-08-005 (under review) • Optimize algorithm to improve coverage of solution space • Multiple solutions • Combine solutions (weak hierarchies) • Lung disease R01 (to be submitted) • Microarrays to identify disease subgroups across the bronchitis/emphysema continuum
Weak Hierarchies Day, McMorris (2003) Axiomatic Consensus Theory in Group Choice and Biomathematics, SIAM Frontiers in Applied Mathematics, Philadelphia, PA.