430 likes | 678 Views
Bioinformatics Multifactor Dimensionality Reduction Kristel Van Steen, PhD, ScD (kristel.vansteen@ulg.ac.be) Université de Liege - Institut Montefiore 2008-2009. Outline. Setting the scene Analyses methods for gene-gene interactions Traditional vs non-Traditional MDR, MB-MDR, FAM-MDR
E N D
BioinformaticsMultifactor Dimensionality ReductionKristel Van Steen, PhD, ScD(kristel.vansteen@ulg.ac.be)Université de Liege - Institut Montefiore2008-2009
Outline • Setting the scene • Analyses methods for gene-gene interactions • Traditional vs non-Traditional • MDR, MB-MDR, FAM-MDR • The future: work in progress
Genetic Architecture of Disease • The number of genes that impact disease susceptibility • The distribution of alleles and genotypes at those genes • The manner in which the alleles and genotypes impact disease susceptibility (Weiss 1993)
Complications in disentangling? There are likely to be many susceptibility genes each with combinations of rare and common alleles and genotypes that impact disease susceptibility primarily through non-linear interactions with genetic and environmental factors
Does evidence of statistical epistasis necessarily imply genetical or biological epistasis? Terminology: Epistasis (Moore 2004)
Analysis Methods Traditional vs Non-Traditional
Traditional methods involving single markers have limited use and more advanced and efficient methods are needed to identify gene interactions and epistatic patterns of susceptibility
Alternative Methods • Tree-based methods: • Recursive Partitioning (Helix Tree) • Random Forests (R, CART) • Pattern recognition methods: • Symbolic Discriminant Analysis (SDA) • Mining association rules • Neural networks (NN) • Support vector machines (SVM) • Data reduction methods: • DICE (Detection of Informative Combined Effects) • MDR (Multifactor Dimensionality Reduction) • Logic regression … (e.g., Onkamo and Toivonen 2006)
Gene Interaction Models • Non-parametric: • Appealing because no distributional assumptions on genotype-phenotype effect • Parametric: • Appealing because easy adjustment for confounding variables and main effects • Severe limitations in presence of too many independent variables in relation to number of observed outcome events
2 x 1026 3 x 1021 2 x 1016 1 x 1011 5 x 105 Out-of-control curse? ~500,000 SNPs span 80% of common variation in genome (HapMap)
Curse of Dimensionality • Bellman R (1961) Adaptive control processes: A guided tour. Princeton University Press: “... Multidimensional variational problems cannot be solved routinely ... . This does not mean that we cannot attack them. It merely means that we must employ some more sophisticated techniques.”
Limitation of Regression • Having too many independent variables in relation to the number of observed outcome events • Assuming 10 bi-allelic loci: # of Parameters =
Limitation of Regression • Fewer than 10 outcome events per independent variable can lead to biased estimates of the regression coefficients and to an increase in Type 1 and Type 2 errors. • For 200 cases and 200 controls, this formula suggests that no more than 19 (= 200/10 – 1) parameters should be estimated in logistic regression model. # of parameters P min(ncase , ncontrol)/10 - 1
Multifactor Dimensionality Reduction (MDR) to tackle the dimensionality problem of interaction detection
MDR for Interaction Detection • MDR creates a one-dimensional multi-locus genotype variable (high and low risk), which is evaluated for its ability to classify and predict disease status through cross-validation and permutation testing. (Ritchie et al 2001; Hahn et al 2003)
10 cross-validation 10 best models. The model with minimum PE is the best n-locus model. MDR Steps 1/10 test data 9/10 training data (Ritchie et al 2003)
Two Measures for Selection of Best n-locus model • Misclassification error: The proportion of incorrect classification in the training set. • Prediction error (PE): The proportion of incorrect prediction in the test set.
Best Multi-factor Models Best 2-factor model Best 3-factor model Best 4-factor model Best 5-factor model Best 6-factor model . . Best n-factor model
Model Selection and Evaluation • Among the best n-factor models, the best model is: • The model with the minimum average PE. • The model with the maximum average CVC. • Rule of parsimony: If there is a tie, select the smaller model.
Significance of the Final Model • Via permutation tests: • Randomize the the case and control labels in the original dataset multiple times to create a set of permuted datasets. • Run MDR on each permuted dataset. • Maximum CVC and minimum PE identified for each dataset saved and used to create an empirical distribution for estimation of a P-value.
Example: through simulation 200 cases and 200 controls; 10 SNPs: 1, 2, 3 , …, 10. Disease etiology due to interaction between SNP 1 and SNP 6. Over 10 CVs and 10 runs
Advantages of MDR • Simultaneous detection of multiple genetic loci associated with a discrete clinical endpoint in absence of main effect. • Non-parametric: Overcomes “curse of dimensionality” by logistic regression model. • Three genotype groups are considered separately • Non-linear interactions between multiple polymorphisms in the absence of independent effects • Low false positive rates
Disadvantages of MDR • Need to introduce parametrics? • MDR in its initial layout cannot deal with main effects / confounding factors / non-dichotomous outcomes: • GMDR / OR-MDR • Low power in the presence of genetic heterogeneity
Power Simulation Set-Up no noise 5% genotyping error -- GE 5% missing data -- MS 50% phenocopy -- PC 50% genetic heterogeneity – GH GE + MS … … GE+MS+PC … … GE+MS+PC+GH 6 models 4 models Total 16 models
Disadvantages of MDR • Noteworthy: • Model selection on the basis of prediction accuracy • One single higher-order interaction model is proposed • Some important interactions could be missed due to pooling too many cells together
MDR: X={H,L} MDR-MB: X={H,L,O} Model Based MDR (MB-MDR)
Step 1: New risk cell identification via association test on each genotype cell cj Parametric or non-parametric test of association ORj Step 2: Test X on Y Parametric or non-parametric ORH, ORL MB-MDR in its simplest form
MB-MDR in its simplest form • Step 3: assess significance • W = [b/se(b)]2, b=ln(OR) • Adjust for number of combined cells in high and low risk category
Improve power in the presence of heterogeneity Power of MDR compared to MB-MDR under aforementioned scenarios (Calle, Urrea, Malats, Van Steen 2008- submitted)
MB-MDR in its simplest form • Step 4: Adjusted p-values need to be corrected for multiple testing
From MB-MDR to FAM-MDR • Extension to families • Perform polygenic analysis using the complete pedigree structure but ignore marker data. • Derive residuals from this model (gives rise to independent quantitative “new” traits) • Submit to MB-MDR • Effected sizes can be derived using measured (multi-locus) genotype models on the selected combinations of markers. Adjusted p-values need to be corrected for multiple testing
Motivation for FAM-MDR • The idea of removing “family trend due to genetic inheritance” was also adopted in the GRAMMAR approach of Aulchenko and colleagues.
FBAT? “For each particular method there are situations for which it is particularly well suited, and others where it performs badly compared to the best that can be done with that data… However, it is seldom known in advance which procedure will perform best or even well for any given problem.” (Hastie et al 2001)
Acknowledgements Helpful discussions: Marylyn Ritchie and co-workers (USA), MaluCalle and Victor Urrea (Spain) Phd students on the project: JestinahMahachie (e.g., MDR and longitudinal measurements), Vaness De Wit (e.g., MDR and multi-allelic markers; sparse cell management), Lizzy De Lobel (e.g., pre-screening algorithms) Post-doc on the project: Tom Cattaert (e.g., FAM-MDR simulations)