1 / 40

Computational Discrete Mathematics and Statistics for Molecular Array Data

Computational Discrete Mathematics and Statistics for Molecular Array Data. Bill Shannon Washington University School of Medicine. Molecular Biology. “How Genes Work”, http://www.nigms.nih.gov. A B C Gene. Microarrays. *Messenger RNA Levels. A B C Gene. Normal Cell.

zinna
Download Presentation

Computational Discrete Mathematics and Statistics for Molecular Array Data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Computational Discrete Mathematics and Statistics for Molecular Array Data Bill Shannon Washington University School of Medicine

  2. Molecular Biology “How Genes Work”, http://www.nigms.nih.gov

  3. A B C Gene Microarrays *Messenger RNA Levels A B C Gene Normal Cell Tumor Cell *Brenner, Jacob, Meselson (1961) An unstable intermediate carrying information from genes to ribosomes for protein synthesis. Nature, 476:576-581.

  4. Microarrays (Leukemia PPG) 35 Probes Selected from ~50,000

  5. Array Data Present New Data Analysis Challenges (Curse of Dimensionality) • Inaccuracy, or error, of a model becomes large very fast • sparseness (descriptions of the data is impossible) • model complexity (too many interaction terms, non-linear effects, etc. to consider) • random multicollinearity (spurious correlations)

  6. Regression (Curse of Dimensionality) • y = f(x) + error • sparseness = little local signal • model parameters not estimated accurately • unstable models over-fit data (not genralizable) • Non-parametric methods (e.g., CART, neural nets) • require a lot of model searching • use up degree’s of freedom rapidly • little or no information left to determine significance

  7. Cluster Analysis (Curse of Dimensionality) • Find structure in data • Many cluster results with same goodness-of-fit • Deciding among the models is impossible.

  8. Classification Models (Curse of Dimensionality) • Predict group membership (e.g., tumor versus normal) • Three broad categories • geometric methods (discriminant analysis, CART) • probabilistic methods (Bayesian) • algorithmic methods (neural networks, k-NN) • Require training/validation datasets

  9. Other Methods (Curse of Dimensionality) • Resampling (cross validation, bootstrapping), model averaging (bagging), or iterative re-weighting (boosting) • Multiple testing adjustment such as false discovery rate or permutation testing

  10. Mantel Statistics • Transform standard NxP data matrices into NxN subject pairwise distances or similarities • Instead of analyzing NxP data matrix (P >> N) avoid the curse of dimensionality problem and analyze the NxN matrix

  11. Mantel Statistics Shannon (2008) Cluster Analysis, in Handbook of Statistics, Vol. 27, eds. Rao, Rao, Miller.

  12. Mantel Statistics

  13. Mantel Statistics Signal + Noise Genes Signal Genes Only

  14. Mantel Statistics Correlating DP with Dk<<P avoids curse of dimensionality! A positive Mantel correlation indicates the genes in Dk<<Pcontains the same information as the genes in DP Shannon, Watson, et al. (2002). Mantel statistics to correlate gene expression levels from microarrays with clinical covariates. Genet Epidemiology 23: 87-96.

  15. GA-Mantel • Search algorithm to find signal genes • Solution representation • list of genes (10 123 456 798 835 888 923) • binary vector {0000100110000….00010} • Each solution maps to a Mantel correlation value • Assumption: the larger the correlation the more signal genes in the solution • Selection keeps solutions with high Mantel correlation Grefenstette, Thompson, Shannon, and Steinmeyer (2005): Genetic algorithms for feature selection using Mantel correlation scoring. Interface: Classification and Clustering 37th Symposium on the Interface. St. Louis, MO

  16. Recombination

  17. Mutation

  18. Gene Subset Selection • Given • a data set comprising N microarray experiments with g genes • Find: • a subset of genes that captures relevant relationships among the experiments • Goal: • reduce data for further analysis • identify meaningful biological markers for diagnosis

  19. Genetic Algorithm 1. Randomly generate an initial population 2. Do until stopping criteria is met: Select individuals to be parents (biased by fitness). Produce offspring by recombination/mutation. Select individuals to die (biased by fitness). End Do. 3. Return a result.

  20. Fitness Evaluation for Gene Selection • Calculate DP using all genes • For each Subset(k) in current population: • Calculate Dk<<P • Correlate DP with Dk<<P • Use Mantel Correlation as fitness to select next population of solutions • Permute to compute P-values

  21. GA on Artificial Data • Simulated data: • 100 experiments with 10,000 genes • 100 signal genes • 9900 noise genes • Two groups • Group 1 has signal genes sampled from N(0, 1) • Group 2 has signal genes sampled from N(1, 1) • GA Parameters • population size 200 • generations 200 • Outcome measures (averaged over 10 runs of the GA) • prevalence (signal, noise) – number of signal and noise genes in GA answer • correlation (signal, noise) – correlation of best subset distance matrix with the ‘full’ distance matrix • coverage - number of signal genes identified over all GA runs

  22. GA on Artificial Data Length = 30 Prevalence: mean number of signal genes = 22.9 (0.7) 76.3% (std 0.53%) Correlation: mean rho for best subsets = 0.787 (0.009) p-value < 0.0001 Coverage: total signal genes identified across 10 runs = 65/100 Observation: solutions tends to converge to similar subsets. Same 4 signal genes appear in 90% of runs

  23. GA on Golub Data Set • Data set: Golub training set (38 x 7129) • Two Groups: • 27 samples from ALL patients • 11 samples from AML patients • GA searched for subsets of fixed length (10 to 50) • population = 200, generations = 200 • Mantel correlation tends to increase with subset size

  24. Significant Feature Subsets Clustering of Samples using all genes Clustering of Samples using 50 genes from GA

  25. Letting GA Select Subset Length • Data set: Golub training set (38 x 7129) • GA searched over variable length subsets (min=5 max=50) • Fitness penalty = d * length / 50 • population = 200, generations = 200 • Tradeoff between length of subsets and correlation score

  26. Data Reduction • GA can also be used to find feature subsets that minimize rho • pop = 200 • length = 50 • data set = Golub • GA finds subsets with rho = 0 within 50 gens • Observation: GA appears to repeatedly converge to same regions of feature space • In 50 runs, 954/1546 (61%) of "noise" genes appear more than once in feature sets

  27. GA in Experimental Data Analysis • Graft Versus Host Disease (GVHD) in bone marrow transplantation (leukemia) • T-cells in the transplanted bone marrow sees recipient as foreign and initiates an immune response destroying host organs • Regulatory T-cells (Treg) suppress immune response • Choi and DiPersio are studying the genetic mechanisms of Treg regulation

  28. Mouse Array Experiment GROUP TREATMENT ARRAYS 1 Naïve Treg dec1, dec5 2 Activated Treg dec2, dec6, dec10 3 PBST (Control) dec3, dec7, dec11 4 Decitabine treated dec4, dec8, dec12 ~$1,000/per array in total costs: $12,000 worth of data including dec9 that did not work

  29. Mouse Array Experiment • Identify probes (genes) with similar mRNA levels between groups (gene by phenotype analysis)

  30. Act+Dec Vs Naïve Vs Control

  31. Naïve+Dec Vs Act Vs Control

  32. Summary • GA-Mantel effective at identifying signal genes • Longer gene subsets associated with higher scores • tradeoff: higher correlations vs. smaller subsets • requires constraining growth of subsets in GA • GA effective at identifying noise genes • GA-Mantel can find genes associated with phenotype

  33. Future Directions • RFA CA-08-005 (under review) • Optimize algorithm to improve coverage of solution space • Multiple solutions • Combine solutions (weak hierarchies) • Lung disease R01 (to be submitted) • Microarrays to identify disease subgroups across the bronchitis/emphysema continuum

  34. Weak Hierarchies Day, McMorris (2003) Axiomatic Consensus Theory in Group Choice and Biomathematics, SIAM Frontiers in Applied Mathematics, Philadelphia, PA.

More Related