Computational Discrete Mathematics and Statistics for Molecular Array Data

Computational Discrete Mathematics and Statistics for Molecular Array Data Bill Shannon Washington University School of Medicine

Molecular Biology “How Genes Work”, http://www.nigms.nih.gov

A B C Gene Microarrays *Messenger RNA Levels A B C Gene Normal Cell Tumor Cell *Brenner, Jacob, Meselson (1961) An unstable intermediate carrying information from genes to ribosomes for protein synthesis. Nature, 476:576-581.

Microarrays (Leukemia PPG) 35 Probes Selected from ~50,000

Array Data Present New Data Analysis Challenges (Curse of Dimensionality) • Inaccuracy, or error, of a model becomes large very fast • sparseness (descriptions of the data is impossible) • model complexity (too many interaction terms, non-linear effects, etc. to consider) • random multicollinearity (spurious correlations)

Regression (Curse of Dimensionality) • y = f(x) + error • sparseness = little local signal • model parameters not estimated accurately • unstable models over-fit data (not genralizable) • Non-parametric methods (e.g., CART, neural nets) • require a lot of model searching • use up degree’s of freedom rapidly • little or no information left to determine significance

Cluster Analysis (Curse of Dimensionality) • Find structure in data • Many cluster results with same goodness-of-fit • Deciding among the models is impossible.

Classification Models (Curse of Dimensionality) • Predict group membership (e.g., tumor versus normal) • Three broad categories • geometric methods (discriminant analysis, CART) • probabilistic methods (Bayesian) • algorithmic methods (neural networks, k-NN) • Require training/validation datasets

Other Methods (Curse of Dimensionality) • Resampling (cross validation, bootstrapping), model averaging (bagging), or iterative re-weighting (boosting) • Multiple testing adjustment such as false discovery rate or permutation testing

Mantel Statistics • Transform standard NxP data matrices into NxN subject pairwise distances or similarities • Instead of analyzing NxP data matrix (P >> N) avoid the curse of dimensionality problem and analyze the NxN matrix

Mantel Statistics Shannon (2008) Cluster Analysis, in Handbook of Statistics, Vol. 27, eds. Rao, Rao, Miller.

Mantel Statistics

Mantel Statistics Signal + Noise Genes Signal Genes Only

Mantel Statistics Correlating DP with Dk<<P avoids curse of dimensionality! A positive Mantel correlation indicates the genes in Dk<<Pcontains the same information as the genes in DP Shannon, Watson, et al. (2002). Mantel statistics to correlate gene expression levels from microarrays with clinical covariates. Genet Epidemiology 23: 87-96.

GA-Mantel • Search algorithm to find signal genes • Solution representation • list of genes (10 123 456 798 835 888 923) • binary vector {0000100110000….00010} • Each solution maps to a Mantel correlation value • Assumption: the larger the correlation the more signal genes in the solution • Selection keeps solutions with high Mantel correlation Grefenstette, Thompson, Shannon, and Steinmeyer (2005): Genetic algorithms for feature selection using Mantel correlation scoring. Interface: Classification and Clustering 37th Symposium on the Interface. St. Louis, MO

Recombination

Mutation

Gene Subset Selection • Given • a data set comprising N microarray experiments with g genes • Find: • a subset of genes that captures relevant relationships among the experiments • Goal: • reduce data for further analysis • identify meaningful biological markers for diagnosis

Genetic Algorithm 1. Randomly generate an initial population 2. Do until stopping criteria is met: Select individuals to be parents (biased by fitness). Produce offspring by recombination/mutation. Select individuals to die (biased by fitness). End Do. 3. Return a result.

Fitness Evaluation for Gene Selection • Calculate DP using all genes • For each Subset(k) in current population: • Calculate Dk<<P • Correlate DP with Dk<<P • Use Mantel Correlation as fitness to select next population of solutions • Permute to compute P-values

GA on Artificial Data • Simulated data: • 100 experiments with 10,000 genes • 100 signal genes • 9900 noise genes • Two groups • Group 1 has signal genes sampled from N(0, 1) • Group 2 has signal genes sampled from N(1, 1) • GA Parameters • population size 200 • generations 200 • Outcome measures (averaged over 10 runs of the GA) • prevalence (signal, noise) – number of signal and noise genes in GA answer • correlation (signal, noise) – correlation of best subset distance matrix with the ‘full’ distance matrix • coverage - number of signal genes identified over all GA runs

GA on Artificial Data Length = 30 Prevalence: mean number of signal genes = 22.9 (0.7) 76.3% (std 0.53%) Correlation: mean rho for best subsets = 0.787 (0.009) p-value < 0.0001 Coverage: total signal genes identified across 10 runs = 65/100 Observation: solutions tends to converge to similar subsets. Same 4 signal genes appear in 90% of runs

GA on Golub Data Set • Data set: Golub training set (38 x 7129) • Two Groups: • 27 samples from ALL patients • 11 samples from AML patients • GA searched for subsets of fixed length (10 to 50) • population = 200, generations = 200 • Mantel correlation tends to increase with subset size

Significant Feature Subsets Clustering of Samples using all genes Clustering of Samples using 50 genes from GA

Letting GA Select Subset Length • Data set: Golub training set (38 x 7129) • GA searched over variable length subsets (min=5 max=50) • Fitness penalty = d * length / 50 • population = 200, generations = 200 • Tradeoff between length of subsets and correlation score

Data Reduction • GA can also be used to find feature subsets that minimize rho • pop = 200 • length = 50 • data set = Golub • GA finds subsets with rho = 0 within 50 gens • Observation: GA appears to repeatedly converge to same regions of feature space • In 50 runs, 954/1546 (61%) of "noise" genes appear more than once in feature sets

GA in Experimental Data Analysis • Graft Versus Host Disease (GVHD) in bone marrow transplantation (leukemia) • T-cells in the transplanted bone marrow sees recipient as foreign and initiates an immune response destroying host organs • Regulatory T-cells (Treg) suppress immune response • Choi and DiPersio are studying the genetic mechanisms of Treg regulation

Mouse Array Experiment GROUP TREATMENT ARRAYS 1 Naïve Treg dec1, dec5 2 Activated Treg dec2, dec6, dec10 3 PBST (Control) dec3, dec7, dec11 4 Decitabine treated dec4, dec8, dec12 ~$1,000/per array in total costs: $12,000 worth of data including dec9 that did not work

Mouse Array Experiment • Identify probes (genes) with similar mRNA levels between groups (gene by phenotype analysis)

Act+Dec Vs Naïve Vs Control

Naïve+Dec Vs Act Vs Control

Summary • GA-Mantel effective at identifying signal genes • Longer gene subsets associated with higher scores • tradeoff: higher correlations vs. smaller subsets • requires constraining growth of subsets in GA • GA effective at identifying noise genes • GA-Mantel can find genes associated with phenotype

Future Directions • RFA CA-08-005 (under review) • Optimize algorithm to improve coverage of solution space • Multiple solutions • Combine solutions (weak hierarchies) • Lung disease R01 (to be submitted) • Microarrays to identify disease subgroups across the bronchitis/emphysema continuum

Weak Hierarchies Day, McMorris (2003) Axiomatic Consensus Theory in Group Choice and Biomathematics, SIAM Frontiers in Applied Mathematics, Philadelphia, PA.

Computational Discrete Mathematics and Statistics for Molecular Array Data

Computational Discrete Mathematics and Statistics for Molecular Array Data

Presentation Transcript

Discrete Mathematics

Discrete Mathematics

CS201: Data Structures and Discrete Mathematics I

Discrete Mathematics

Discrete Mathematics

Discrete Mathematics

Discrete Mathematics

Discrete Mathematics

Discrete Mathematics

Discrete Mathematics

Computational Discrete Mathematics and Statistics for Molecular Array Data

Discrete Mathematics

Discrete Mathematics

Discrete Mathematics

Discrete Mathematics

Discrete Mathematics

CS201: Data Structures and Discrete Mathematics I

CS201: Data Structures and Discrete Mathematics I

Computational Mathematics for Large-scale Data Analysis

Discrete Mathematics

Discrete Mathematics

Discrete Mathematics