490 likes | 504 Views
Explore the concepts of population genetics, Hardy-Weinberg equilibrium, and linkage disequilibrium in disease gene mapping. Learn about simulating population data and the Wright-Fisher model of evolution. Understand the coalescent theory and its application in analyzing genetic data.
E N D
CSE280b: Population Genetics Vineet Bafna/Pavel Pevzner www.cse.ucsd.edu/classes/sp05/cse291 Vineet Bafna
Review • Hardy Weinberg Equilibrium • Linkage Equlibrium Vineet Bafna
Recombination and Linkage Equilibrium • In a freely mixing population, an individual chromosome randomly chooses its parent from the available pool • With unimpeded recombination (Linkage Equilibrium), the individual freely chooses its two parent chromsomes, and then freely chooses alleles from the parents • What is the probability of seeing the allele <1 1>? 0 1 0 1 0 1 0 1 1 0 1 1 1 0 0 0 Vineet Bafna
Measures of LD • Consider two bi-allelic sites with alleles marked with 0 and 1 • Define • P00 = Pr[Allele 0 in locus 1, and 0 in locus 2] • P0* = Pr[Allele 0 in locus 1] • Linkage equilibrium if P00 = P0* P*0 • D = abs(P00 - P0* P*0) = abs(P01 - P0* P*1) = … Vineet Bafna
LD over time • With random mating, and fixed recombination rate r between the sites, Linkage Disequilibrium will disappear over time • Let D(t) = LD at time t • P(t)00 = (1-r) P(t-1)00 + r P(t-1)0* P(t-1)*0 • D(t) =P(t)00 - P(t)0* P(t)*0 = P(t)00 - P(t-1)0* P(t-1)*0 • (HW) • D(t) =(1-r) D(t-1) =(1-r)t D(0) Vineet Bafna
LD over distance • Assumption • Recombination rate increases linearly with distance • Let r be the (constant) recombination rate per bp per generation D(t) = P(t)00- P0* P*0 = (1-r)d P(t-1)00+ [1- (1-r)d ] P0* P*0 - P0* P*0 = (1-r)d D(t-1) • The assumption is reasonable, but recombination rates vary from region to region, adding to complexity Vineet Bafna
LD and disease mapping • Consider a mutation that is causal for a disease. • The goal of disease gene mapping is to discover which gene (locus) carries the mutation. • Consider every polymorphism, and check: • There might be too many polymorphisms • Multiple mutations (even at a single locus) that lead to the same disease • Instead, consider a dense sample of polymorphisms that span the genome Vineet Bafna
LD can be used to map disease genes • LD decays with distance from the disease allele. • By plotting LD, one can short list the region containing the disease gene. LD D N N D D N 0 1 1 0 0 1 Vineet Bafna
LD and disease gene mapping problems • Marker density? • Complex diseases • Population sub-structure Vineet Bafna
Population Genetics • Often we look at these equilibria (Linkage/HW) and their deviations in specific populations • These deviations offer insight into evolution. • However, what is Normal? • A combination of empirical (simulation) and theoretical insight helps distinguish between expected and unexpected. Vineet Bafna
Topic 2: Simulating population data • We described various population genetic concepts (HW, LD), and their applicability • The values of these parameters depend critically upon the population assumptions. • What if we do not have infinite populations • No random mating (Ex: geographic isolation) • Sudden growth • Bottlenecks • Ad-mixture • It would be nice to have a simulation of such a population to test various ideas. How would you do this simulation? Vineet Bafna
Wright Fisher Model of Evolution • Fixed population size from generation to generation • Random mating Vineet Bafna
Coalescent model • Insight 1: • Separate the genealogy from allelic states (mutations) • First generate the genealogy (who begat whom) • Assign an allelic state (0) to the ancestor. Drop mutations on the branches. Vineet Bafna
Coalescent model • Insight 1: • Assign an allelic state (0) to the ancestor. Drop mutations on the branches. • The mutations are proportional to the branch length. • Each site (locus) mutates at most once. • At the end, drop any fixed sites • How efficient is this? How many generations do we need to simulate today’s population? 0 0 0 0 0 Loci 0 1 1 0 1 0 0 1 1 1 0 1 1 0 1 0 0 0 0 1 Loci Individuals Vineet Bafna
Coalescent theory • Insight 2: • Much of the genealogy is irrelevant, because it disappears. • Better to go backwards Vineet Bafna
Coalescent • Note that in the Wright Fischer model, the population is freely mixing, and constant size N • One way to think about it is that each individual in the current generation selects a parent uniformly at random from the N individuals in the previous generation. • When two individuals choose the same parent, they coalesce. • Once they coalesce, they stay together. We continue until only one individual is left. • Note that this only gives a random topology with labeled leaves. The only thing of interest is branch length (number of generations to MRCA) 1 2 3 4 Vineet Bafna
Coalescent theory (Kingman) • Input • (Fixed population (N individuals), random mating) • Consider 2 individuals. • Probability that they coalesce in the previous generation (have the same parent)= • Probability that they do not coalesce after t generations= Vineet Bafna
Coalescent theory • is time in units of N generations • Consider k individuals. • Probability that no pair coalesces after 1 generation • Probability that no pair coalesces after t generations Vineet Bafna
Coalescent approximation • Insight 3: • Topology is independent of coalescent times • If you have n individuals, generate a random binary topology • Iterate (until one individual) • Pick a pair at random, and coalesce • Insight 4: • To generate coalescent times, there is no need to go back generation by generation. Generate n random variables to get the n coalescence times. Vineet Bafna
Coalescent approximation • At any step, there are 1 <= k <= n individuals • To generate time to coalesce (k to k-1 individuals) • Pick a number from exponential distribution with rate k(k-1)/2 • Mean time to coalescence (in units of N generations) = 2/(k(k-1)) Vineet Bafna
Mean time to coalesce • If there are k individuals, the Probability for a coalescence in one generation is • k(k-1)/2N • Expected time to coalesce = 2N/k(k-1) Vineet Bafna
Typical coalescents • 4 random examples with n=6 (Note that we do not need to specify N. Why?) • Expected time to coalesce to 1 node? Vineet Bafna
Coalescent properties • Expected time for the last step • The last step is half of the total time to coalesce • Studying larger number of individuals does not change numbers tremendously • EX: Number of mutations in a population is proportional to the total branch length of the tree • E(Ttot) =1 Vineet Bafna
Variants (exponentially growing populations) • If the population is growing exponentially, the branch lengths become similar, or even star-like. Why? • With appropriate scaling of time, the same process can be extended to various scenarios: male-female, hermaphrodite, segregation, migration, etc. Vineet Bafna
Simulating population data • Generate a coalescent (Topology + Branch lengths) • For each branch length, drop mutations with rate • Generate sequence data • Note that the resulting sequence is a perfect phylogeny. • Given such sequence data, can you reconstruct the coalescent tree? (Only the topology, not the branch lengths) • Also, note that all pairs of positions are correlated (should have high LD). Vineet Bafna
Coalescent with Recombination • An individual may have one parent, or 2 parents Vineet Bafna
ARG: Coalescent with recombination • Given: mutation rate , recombination rate , population size 2N (diploid), sample size n. • How can you generate the ARG (topology+branch lengths) efficiently? • How will you generate sequences for n individuals? • Given sequence data, can you reconstruct the ARG (topology) Vineet Bafna
Recombination • Define r as the probability of recombining per generation. • Note that the parameter is a value which will be defined later • Assume k individuals in a generation. The following might happen: • An individual arises because of a recombination event between two individuals (It will have 2 parents). • Two individuals coalesce. • Neither (Each individual has a distinct parent). • Multiple events (low probability). Vineet Bafna
Recombination • We ignore the case of multiple (> 1) events in one generation • Pr (No recombination) = 1-kr • Pr (No coalescence) • Consider scaled time in units of 2N generations. Thus the number of individuals increase with rate kr2N, and decrease with rate • The value 2rN is usually small, and therefore, the process will ultimately coalesce to a single individual (MRCA) Vineet Bafna
ARG What is the flaw in this procedure? • Let k = n, • Define • Iterate until k= 1 • Choose time from an exponential distribution with rate • Pick event as recombination with probability • If event is recombination, choose an individual to recombine, and a position, else choose a pair to coalesce. • Update k, and continue Vineet Bafna
Ancestral Recombination Graph Vineet Bafna
Simulating sequences on the ARG • Generate topology and branch lengths as before • For each recombination, generate a position. • Next generate mutations at random on branch lengths • For a mutation, select a position as well. Vineet Bafna
Review: Coalescent theory applications • Coalescent simulations allow us to test various hypothesis. The coalescent/ARG is usually not inferred, unlike in phylogenies. Vineet Bafna
Coalescent theory: example • Ex: ~1400bp at Sod locus in Dros. • 10 taxa • 5 were identical. The other 5 had 55 mutations. • Q: Is this a chance event, or is there selection for this haplotype. Vineet Bafna
Coalescent application • 10000 coalescent simulations were performed on 10 taxa. • 55 mutations on the coalescent branches • Count the number of times 5 lineages are identical • The event happened in 1.1% of the cases. • Conclusion: selection, or some other mechanism explains this data. Vineet Bafna
Coalescent example: Out of Africa hypothesis • Looking at lineage specific mutations might help discard the candelabra model. How? • How do we decide between the multi-regional and Out-of-Africa model? How do we decide if the ancestor was African? Vineet Bafna
Human Samples • We look at data from human samples • Gabriel et al. Science 2002. • 3 populations were sampled at multiple regions spanning the genome • 54 regions (Average size 250Kb) • SNP density 1 over 2Kb • 90 Individuals from Nigeria (Yoruban) • 93 Europeans • 42 Asian • 50 African American Vineet Bafna
Population specific recombination • D’ was used as the measure between SNP pairs. • SNP pairs were classified in one of the following • Strong LD • Strong evidence for recombination • Others (13% of cases) • This roughly favors out-of-africa. A Coalescent simulation can help give confidence values on this. Gabriel et al., Science 2002 Vineet Bafna
Recombination events and • Given , n, can you compute the expected number of recombination events? • It can be shown that E(n, ) = log (n) • Questions that people are interested in • Given a set of sequences from a population, compute the recombination rate • Given a population reconstruct the most likely history (as an ancestral recombination graph) Vineet Bafna
Re-constructing history without the coalescent Vineet Bafna
An algorithm for constructing a perfect phylogeny • We will consider the case where 0 is the ancestral state, and 1 is the mutated state. This will be fixed later. • In any tree, each node (except the root) has a single parent. • It is sufficient to construct a parent for every node. • In each step, we add a column and refine some of the nodes containing multiple children. • Stop if all columns have been considered. Vineet Bafna
Inclusion Property • For any pair of columns i,j • i < j if and only if i1 j1 • Note that if i<j then the edge containing i is an ancestor of the edge containing i i j Vineet Bafna
Example r A B C D E 1 2 3 4 5 A 1 1 0 0 0 B 0 0 1 0 0 C 1 1 0 1 0 D 0 0 1 0 1 E 1 0 0 0 0 Initially, there is a single clade r, and each node has r as its parent Vineet Bafna
Sort columns • Sort columns according to the inclusion property (note that the columns are already sorted here). • This can be achieved by considering the columns as binary representations of numbers (most significant bit in row 1) and sorting in decreasing order 1 2 3 4 5 A 1 1 0 0 0 B 0 0 1 0 0 C 1 1 0 1 0 D 0 0 1 0 1 E 1 0 0 0 0 Vineet Bafna
Add first column • In adding column i • Check each edge and decide which side you belong. • Finally add a node if you can resolve a clade 1 2 3 4 5 A 1 1 0 0 0 B 0 0 1 0 0 C 1 1 0 1 0 D 0 0 1 0 1 E 1 0 0 0 0 r u B D A C E Vineet Bafna
Adding other columns • Add other columns on edges using the ordering property 1 2 3 4 5 A 1 1 0 0 0 B 0 0 1 0 0 C 1 1 0 1 0 D 0 0 1 0 1 E 1 0 0 0 0 r 1 3 E 2 B 5 4 D A C Vineet Bafna
Unrooted case • Switch the values in each column, so that 0 is the majority element. • Apply the algorithm for the rooted case Vineet Bafna