600 likes | 677 Views
Explore strategies for finding causal mutations through case-control studies, genetic correlations, and mutation analysis. Learn about SNP correlations, genealogies, recombination, and statistical tests for association mapping.
E N D
Association mapping for mendelian, and complex disorders Bafna, BfB
UG Bioinformatics specialization at UCSD Bafna, BfB
Abstraction of a causal mutation Bafna, BfB
Looking for the mutation in populations A possible strategy is to collect cases (affected) and control individuals, and look for a mutation that consistently separates the two classes. Next, identify the gene. Bafna, BfB
Looking for the causal mutation in populations Problem 1: many unrelated common mutations, around one every 1000bp Case Control Bafna, BfB
Case Control Bafna, BfB
Looking for the causal mutation in populations Problem 2: We may not sample the causal mutation. Case Control Bafna, BfB
How to hunt for disease genes • We are guided by two simple facts governing these mutations • Nearby mutations are correlated • Distal mutations are not Case Control Bafna, BfB
This lecture • Nearby mutations are correlated • Distal mutations are not Case Control • The bottom line: How do these facts help in finding disease genes? • The genetics: why should this happen? • The computation • Challenge of complex diseases. Bafna, BfB
The basics of association mapping 0 0 Case Control 0 • Sample a population of individuals at variant locations across the genome. Typically, these variants are single nucleotide polymorphisms (SNPs). • Create a new bi-allelic variant corresponding to cases and controls, and test for correlations. • By our assumptions, only the proximal variants will be correlated. • Investigate genes near the correlated variants. 0 1 1 1 1 Bafna, BfB
So, why should the proximal SNPs be correlated, and distal SNPs not? Bafna, BfB
A bit of evolution Time • Consider a fixed population (of chromosomes) evolving in time. • Each individual arises from a unique, randomly chosen parent from the previous generation Bafna, BfB
Genealogy of a chromosomal population Time Current (extant) population Bafna, BfB
Adding mutations Infinite sites assumption: A mutation occurs at most once at a site. Bafna, BfB
SNPs The collection of acquired mutations in the extant population describe the SNPs Bafna, BfB
Fixation and elimination • Not all mutations survive. • Some mutations get fixed, and are no longer polymorphic Bafna, BfB
Removing extinct genealogies Bafna, BfB
Removing fixed mutations Bafna, BfB
The coalescent Bafna, BfB
Disease mutation • We drop the ancestral chromosomes, and place the mutations on the internal branches. Bafna, BfB
Disease mutation • A causal mutation creates a clade of affected descendants. Bafna, BfB
Disease mutation • Note that the tree (genealogy) is hidden. • However, the underlying tree topology introduces a correlation between each pair of SNPs Bafna, BfB
What have we learnt? • The underlying genealogy creates a correlation between SNPs. • By itself, this is not sufficient, because distal SNPs might also be correlated. • Fortunately, for us the correlation between distal SNPs is quickly destroyed. Bafna, BfB
Recombination Bafna, BfB
Recombination • In our idealized model, we assume that each individual chromosome chooses two parental chromosomes from the previous generation Bafna, BfB
Multiple recombination change the local genealogy Bafna, BfB
A bit of evolution • Proximal SNPs are correlated, distal SNPs are not. • The correlation (Linkage disequilibirium) decays rapidly after 20-50kb Bafna, BfB
Basic statistics Bafna, BfB
Testing for correlation • In the absence of correlation Bafna, BfB
Testing for correlation • When correlated Bafna, BfB
Assigning confidence Observed Expected X X X X Bafna, BfB
Assigning confidence Observed Expected X X X X Bafna, BfB
Assigning confidence Observed Expected X X X X Bafna, BfB
Statistical tests of association Bafna, BfB
Tests for association: Pearson Cases Controls O1 MM Mm mm O2 • Case-control phenotype: • Build a 3X2 contingency table • Pearson test (2df)= O3 O4 O5 O6 Bafna, BfB
The χ2 test Cases Controls O1 O2 MM O3 O4 Mm O5 O6 mm • The statistic behaves like a χ2 distribution. • A p-value can be computed directly Bafna, BfB
Χ2 distribution properties A related distribution is the F-distribution Bafna, BfB
Likelihood ratio • Another way to check the extremeness of the distribution is by computing a (log) likelihood ratio. • We have two competing hypothesis. Let N be the total number of observations Bafna, BfB
LLR • An LLR value close to 0, implies that the null hypothesis is true. Asymptotically, the LLR statistic also follows the chi-square distribution. Bafna, BfB
Exact test • The chi-square test does not work so well when the numbers are small. • How can we compute an exact probability of seeing a specific distribution of values in the cells? • Remember: we know the marginals (# cases, # controls, Bafna, BfB
Fischer exact test Cases Controls a b MM c d Mm e f mm • Num: #ways of getting configuration (a,b,c,d,e,f) • Den: #ways of ensuring that the row sums and column sums are fixed Bafna, BfB
Fischer exact test • Remember that the probability of seeing any specific values in the cells is going to be small. • To get a p-value, we must sum over all similarly extreme values. How? Bafna, BfB
Test for association: Fisher exact test Cases Controls a b MM c d Mm e f mm • Here P is the probability of seeing the exact count. • The actual significance is computed by summing over all such tables that are at least this extreme. Bafna, BfB
Test for association: Fisher exact test Cases Controls a b MM c d Mm e f mm Bafna, BfB
Continuous outcomes • Instead of discrete (Case/control) data, we have real-valued phenotypes • Ex: Diastolic Blood Pressure • In this case, how do we test for association Bafna, BfB
Continuous outcome ANOVA • Often, the phenotypes are not offered as case-controls but like a continuous variable • Ex: blood-pressure measurements • Question: Are the mean values of the two groups significantly different? MM mm Bafna, BfB
Two-sided t-test • For two categories, ANOVA is also known as the t-test • Assume that the variables from the two sets are drawn from Normal distributions • Different means, equal variances • Null hypothesis is that they are both from the same distribution Bafna, BfB
t-test continued Bafna, BfB
Two-sample t-test • As the variance is not known, we use an estimate S, defined by • The T-statistic is given by • Significant deviations from 0 are used to reject the Null hypothesis Bafna, BfB
Two-sample t-test (unequal variances) • If the variances cannot be assumed to be equal, we use • The T-statistic is given by • Significant deviations from 0 are used to reject the Null hypothesis Bafna, BfB