840 likes | 999 Views
Bioinformatics R for Bioinformatics PART II Kristel Van Steen, PhD, ScD (kristel.vansteen@ulg.ac.be) Université de Liege - Institut Montefiore 2008-2009. Simplified Epistasis Testing.
E N D
BioinformaticsR for BioinformaticsPART IIKristel Van Steen, PhD, ScD(kristel.vansteen@ulg.ac.be)Université de Liege - Institut Montefiore2008-2009
SimplifiedEpistasisTesting • We shall now use logistic regression in R to test for epistatic interactions between locus 3 and another unlinked locus (locus 5). An epistatic interaction means that the combined effect of locus 3 and 5 is greater than the product (on the odds scale) or the sum (on the log odds scale) of the locus 3 and locus 5 individual effects. First get rid of the data in the memory and read in the new data. This data is the same as the original pedfile, but with an additional column giving genotype at (unlinked) locus 5: detach(casecon) newcasecon <- read.table("newcasecondata.txt", header=T)attach(newcasecon) You can look at the data by typing fix(newcasecon)
Cordellpractical(seestatisticalgenetics class) • Nextcreateappropriategenotype and case variables: case <- affected-1 g3 <- genotype(loc3_1, loc3_2) g5 <- genotype(loc5_1, loc5_2) The individualeffectsat locus 3 and 5 are nowcoded by the variables g3 and g5. Wecan test for association ateach locus separately: gcontrasts(g3) <- "genotype" logit (case ~ g3) anova(logit (case ~ g3)) gcontrasts(g5) <- "genotype" logit (case ~ g5) anova(logit (case ~ g5))
In order to investigate epistasis, it is more convenient to create new variables that code numerically for the number of copies of allele 2 in each genotypes count3<-allele.count(g3,2) count5<-allele.count(g5,2) Check you understand how variables count3 and count5 relate to g3 and g5 by typing g3 count3 g5 count5
We then create a variable that codes for the combined effect of locus 3 and 5 as follows: combo<-10*count3+count5 Check you understand how the 'combo' variable relates to g3 and g5 by typing g3 g5 combo Now we need to code each of these variables as 'factors' which means we simply consider the numeric codes to act as labels for the different categories rather than having numeric meaning: fact3<-factor(count3) fact5<-factor(count5) factcombo<-factor(combo)
Check that the analysis with the 'factors' gives the same results as you found previously with the genotype variables: anova(logit (case ~ fact3)) anova(logit (case ~ fact5)) Now test whether there is significant epistasis by typing anova(logit(case ~ fact3 + fact5 + factcombo)) 1-pchisq(9.59,4) This first fits the individual locus factors, and then adds in the extra effect of looking at the model with epistasis included (i.e. a model with 9 estimated parameters corresponding to the 9 genotype combinations), and tests the difference between the models. You should get a chi-squared of 9.59 on 4 df with p value 0.048 i.e. there is marginal evidence of epistasis. The above test is valid for testing for epistasis between linked or unlinked loci, although it does not allow for haplotype (phase) effects between linked loci. A more powerful test for epistasis between UNLINKED LOCI ONLY is to use 'case-only' analysis and test whether the genotypes at one locus predict those at the other, in the cases alone. This is only valid at unlinked loci, because at linked loci we expect genotypes at one locus to predict those at the other (even in controls) due to linkage disequilibrium.
To do this, we can use a chi squared test to look for correlation (association) between the loci within the case and control groups separately. First we need to set up 2 new vectors of genotypes for loci 3 and 5, using only the cases. To do this, we can take advantage of the fact that the data has been ordered in such a way that cases are the first 384 observations. (Check this by typing case or fix(newcasecon) ). So we can create genotype vectors just for the cases using the following commands caseg3<-g3[1:384] caseg5<-g5[1:384] Take a look at the vectors you have created by typing caseg3 caseg5 Now do a chi-squared test on the genotype variables to see if they are correlated with each other: table(caseg3,caseg5) chisq.test(caseg3,caseg5)
You should find much more significant evidence of epistasis (p value 0.0018) than you did using logistic regression. This is not surprising as the case-only test of interaction is a more powerful test. However, the case-only test does rely on the assumption that the two genotype variables g3 and g5 are uncorrelated in the general population. Strictly speaking, we cannot test this assumption as we do not have a population-based control sample (our controls are all unaffected). However, if the disease is rare, our controls should be reasonably close to an unselected sample. So we can use them to see if the genotype variables g3 and g5 are uncorrelated in the control population: contg3<-g3[385:1056] contg5<-g5[385:1056] contg3 contg5 table(contg3,contg5) chisq.test(contg3,contg5) You should find a non-significant p value (p=0.99). This suggests that the case-only analysis we did is valid, so there is indeed some reasonable (p=0.002) evidence for statistical interaction between these loci.
Running the command lines in R to test for epistasis