460 likes | 536 Views
CSE280b: Population Genetics. Vineet Bafna/Pavel Pevzner. www.cse.ucsd.edu/classes/sp05/cse291. Simulating population data. Generate a coalescent (Topology + Branch lengths) For each branch length, drop mutations with rate Generate sequence data
E N D
CSE280b: Population Genetics Vineet Bafna/Pavel Pevzner www.cse.ucsd.edu/classes/sp05/cse291 Vineet Bafna
Simulating population data • Generate a coalescent (Topology + Branch lengths) • For each branch length, drop mutations with rate • Generate sequence data • Note that the resulting sequence is a perfect phylogeny. • Given such sequence data, can you reconstruct the coalescent tree? (Only the topology, not the branch lengths) • Also, note that all pairs of positions are correlated (should have high LD). Vineet Bafna
Coalescent with Recombination • An individual may have one parent, or 2 parents Vineet Bafna
ARG: Coalescent with recombination • Given: mutation rate , recombination rate , population size 2N (diploid), sample size n. • How can you generate the ARG (topology+branch lengths) efficiently? • How will you generate sequences for n individuals? • Given sequence data, can you reconstruct the ARG (topology) Vineet Bafna
Recombination • Define r as the probability of recombining per generation. • Assume k individuals in a generation. The following might happen: • An individual arises because of a recombination event between two individuals (It will have 2 parents). • Two individuals coalesce. • Neither (Each individual has a distinct parent). • Multiple events (low probability). Vineet Bafna
Recombination • We ignore the case of multiple (> 1) events in one generation • Pr (No recombination) = 1-kr • Pr (No coalescence) • Consider scaled time in units of 2N generations. Thus the number of individuals increase with rate kr2N, and decrease with rate • The value 2rN is usually small, and therefore, the process will ultimately coalesce to a single individual (MRCA) Vineet Bafna
ARG What is the flaw in this procedure? • Let k = n, • Define • Iterate until k= 1 • Choose time from an exponential distribution with rate • Pick event as recombination with probability • If event is recombination, choose an individual to recombine, and a position, else choose a pair to coalesce. • Update k, and continue Vineet Bafna
Ancestral Recombination Graph Vineet Bafna
Simulating sequences on the ARG • Generate topology and branch lengths as before • For each recombination, generate a position. • Next generate mutations at random on branch lengths • For a mutation, select a position as well. • Generate Sequence data. • Program called ms (Hudson) is a commonly used coalescent simulator Vineet Bafna
Coalescent theory applications • Coalescent simulations allow us to test various hypothesis. The coalescent/ARG is usually not inferred, unlike in phylogenies. Vineet Bafna
Coalescent theory: example • Ex: ~1400bp at Sod locus in Dros. • 10 taxa • 5 were identical. The other 5 had 55 mutations. • Q: Is this a chance event, or is there selection for this haplotype. Vineet Bafna
Coalescent application • 10000 coalescent simulations were performed on 10 taxa. • 55 mutations on the coalescent branches • Count the number of times 5 lineages are identical • The event happened in 1.1% of the cases. • Conclusion: selection, or some other mechanism explains this data. Vineet Bafna
Coalescent example: Out of Africa hypothesis • Looking at lineage specific mutations might help discard the candelabra model. How? • How do we decide between the multi-regional and Out-of-Africa model? How do we decide if the ancestor was African? Vineet Bafna
Human Samples • We look at data from human samples • Gabriel et al. Science 2002. • 3 populations were sampled at multiple regions spanning the genome • 54 regions (Average size 250Kb) • SNP density 1 over 2Kb • 90 Individuals from Nigeria (Yoruban) • 93 Europeans • 42 Asian • 50 African American Vineet Bafna
Population specific recombination • D’ was used as the measure between SNP pairs. • SNP pairs were classified in one of the following • Strong LD • Strong evidence for recombination • Others (13% of cases) • This roughly favors out-of-africa. A Coalescent simulation can help give confidence values on this. Gabriel et al., Science 2002 Vineet Bafna
Haplotype Blocks • A haplotype block is a region of low recombination. • Define a region as a block if less than 5% of the pairs show strong recombination • Much of the genome is in blocks. • Distribution of block sizes vary across populations. Vineet Bafna
Testing Out-of-Africa • Generate simulations with and without migration. • Check size of haplotype blocks. • Does it vary when migrations are allowed? • When the ‘new’ population has a bottleneck? • If there was a bottleneck that created European and Asian populations, can we say anything about frequency of alleles that are ‘African specific’? • Should they be high frequency, or low frequency in African populations? Vineet Bafna
Haplotype Block: implications • The genome is mostly partitioned into haplotype blocks. • Within a block, there is extensive LD. • Is this good, or bad, for association mapping? Vineet Bafna
Coalescent reconstruction • Reconstructing likely coalescents Vineet Bafna
Re-constructing history in the absence of recombination Vineet Bafna
An algorithm for constructing a perfect phylogeny • We will consider the case where 0 is the ancestral state, and 1 is the mutated state. This will be fixed later. • In any tree, each node (except the root) has a single parent. • It is sufficient to construct a parent for every node. • In each step, we add a column and refine some of the nodes containing multiple children. • Stop if all columns have been considered. Vineet Bafna
Inclusion Property • For any pair of columns i,j • i < j if and only if i1 j1 • Note that if i<j then the edge containing i is an ancestor of the edge containing i i j Vineet Bafna
Example r A B C D E 1 2 3 4 5 A 1 1 0 0 0 B 0 0 1 0 0 C 1 1 0 1 0 D 0 0 1 0 1 E 1 0 0 0 0 Initially, there is a single clade r, and each node has r as its parent Vineet Bafna
Sort columns • Sort columns according to the inclusion property (note that the columns are already sorted here). • This can be achieved by considering the columns as binary representations of numbers (most significant bit in row 1) and sorting in decreasing order 1 2 3 4 5 A 1 1 0 0 0 B 0 0 1 0 0 C 1 1 0 1 0 D 0 0 1 0 1 E 1 0 0 0 0 Vineet Bafna
Add first column • In adding column i • Check each edge and decide which side you belong. • Finally add a node if you can resolve a clade 1 2 3 4 5 A 1 1 0 0 0 B 0 0 1 0 0 C 1 1 0 1 0 D 0 0 1 0 1 E 1 0 0 0 0 r u B D A C E Vineet Bafna
Adding other columns • Add other columns on edges using the ordering property 1 2 3 4 5 A 1 1 0 0 0 B 0 0 1 0 0 C 1 1 0 1 0 D 0 0 1 0 1 E 1 0 0 0 0 r 1 3 E 2 B 5 4 D A C Vineet Bafna
Unrooted case • Important point is that the perfect phylogeny condition does not change when you interchange 1s and 0s at a column. • Switch the values in each column, so that 0 is the majority element. • Apply the algorithm for the rooted case. • Homework: show that this is a correct algorithm Vineet Bafna
Population Sub-structure Vineet Bafna
Population sub-structure can increase LD Pop. A 0 .. 1 0 .. 1 0 .. 0 1 .. 1 0 .. 1 0 .. 1 0 .. 1 0 .. 1 0 .. 1 p1=0.1 q1=0.9 P11=0.1 D=0.01 Pop. B 1 .. 0 1 .. 0 0 .. 0 1 .. 1 1 .. 0 1 .. 0 1 .. 0 1 .. 0 1 .. 0 p1=0.9 q1=0.1 P11=0.1 D=0.01 • Consider two populations that were isolated and evolving independently. • They might have different allele frequencies in some regions. • Pick two regions that are far apart (LD is very low, close to 0) Vineet Bafna
Recent ad-mixing of population • If the populations came together recently (Ex: African and European population), artificial LD might be created. • D = 0.15 (instead of 0.01), increases 10-fold • This spurious LD might lead to false associations • Other genetic events can cause LD to arise, and one needs to be careful 0 .. 1 0 .. 1 0 .. 0 1 .. 1 0 .. 1 0 .. 1 0 .. 1 0 .. 1 0 .. 1 Pop. A+B p1=0.5 q1=0.5 P11=0.1 D=0.1-0.25=0.15 1 .. 0 1 .. 0 0 .. 0 1 .. 1 1 .. 0 1 .. 0 1 .. 0 1 .. 0 1 .. 0 Vineet Bafna
Determining population sub-structure • Given a mix of people, can you sub-divide them into ethnic populations. • Turn the ‘problem’ of spurious LD into a clue. • Find markers that are too far apart to show LD • If they do show LD (correlation), that shows the existence of multiple populations. • Sub-divide them into populations so that LD disappears. Vineet Bafna
Determining Population sub-structure 0 .. 1 0 .. 1 0 .. 0 1 .. 1 0 .. 1 0 .. 1 0 .. 1 0 .. 1 0 .. 1 1 .. 0 1 .. 0 0 .. 0 1 .. 1 1 .. 0 1 .. 0 1 .. 0 1 .. 0 1 .. 0 • Same example as before: • The two markers are too similar to show any LD, yet they do show LD. • However, if you split them so that all 0..1 are in one population and all 1..0 are in another, LD disappears Vineet Bafna
Iterative algorithm for population sub-structure • Define • N = number of individuals (each has a single chromosome) • k = number of sub-populations. • Z {1..k}N is a vector giving the sub-population. • Zi=k’ => individual i is assigned to population k’ • Xi,j = allelic value for individual i in position j • Pk,j,l = frequency of allele l at position j in population k Vineet Bafna
Example • Ex: consider the following assignment • P1,1,0 = 0.9 • P2,1,0 = 0.1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 0 .. 1 0 .. 1 0 .. 0 1 .. 1 0 .. 1 0 .. 1 0 .. 1 0 .. 1 0 .. 1 0 .. 1 1 .. 0 1 .. 0 0 .. 0 1 .. 1 1 .. 0 1 .. 0 1 .. 0 1 .. 0 1 .. 0 1 .. 0 Vineet Bafna
Goal • X is known. • P, Z are unknown. • The goal is to estimate Pr(P,Z|X) • Various learning techniques can be employed. • maxP,Z Pr(X|P,Z) (Max likelihood estimate) • maxP,Z Pr(X|P,Z) Pr(P,Z) (MAP) • Sample P,Z from Pr(P,Z|X) • Here a Bayesian (MCMC) scheme is employed to sample from Pr(P,Z|X). We will only consider a simplified version Vineet Bafna
Algorithm:Structure • Iteratively estimate • (Z(0),P(0)), (Z(1),P(1)),.., (Z(m),P(m)) • After ‘convergence’, Z(m) is the answer. • Iteration • Guess Z(0) • For m = 1,2,.. • Sample P(m) from Pr(P | X, Z(m-1)) • Sample Z(m) from Pr(Z | X, P(m)) • How is this sampling done? Vineet Bafna
Example • Choose Z at random, so each individual is assigned to be in one of 2 populations. See example. • Now, we need to sample P(1) from Pr(P | X, Z(0)) • Simply count • Nk,j,l = number of people in pouplation k which have allele l in position j • pk,j,l = Nk,j,l / N 1 2 2 1 1 2 1 2 1 2 1 2 2 1 1 2 1 2 2 1 0 .. 1 0 .. 1 0 .. 0 1 .. 1 0 .. 1 0 .. 1 0 .. 1 0 .. 1 0 .. 1 0 .. 1 1 .. 0 1 .. 0 0 .. 0 1 .. 1 1 .. 0 1 .. 0 1 .. 0 1 .. 0 1 .. 0 1 .. 0 Vineet Bafna
Example • Nk,j,l = number of people in population k which have allele l in position j • pk,j,l = Nk,j,l / Nk,j,* • N1,1,0 = 4 • N1,1,1 = 6 • p1,1,0 = 4/10 • p1,2,0 = 4/10 • Thus, we can sample P(m) 1 2 2 1 1 2 1 2 1 2 1 2 2 1 1 2 1 2 2 1 0 .. 1 0 .. 1 0 .. 0 1 .. 1 0 .. 1 0 .. 1 0 .. 1 0 .. 1 0 .. 1 0 .. 1 1 .. 0 1 .. 0 0 .. 0 1 .. 1 1 .. 0 1 .. 0 1 .. 0 1 .. 0 1 .. 0 1 .. 0 Vineet Bafna
Sampling Z • Pr[Z1 = 1] = Pr[”01” belongs to population 1]? • We know that each position should be in linkage equilibrium and independent. • Pr[”01” |Population 1] = p1,1,0 * p1,2,1 =(4/10)*(6/10)=(0.24) • Pr[”01” |Population 2] = p2,1,0 * p2,2,1 = (6/10)*(4/10)=0.24 • Pr [Z1 = 1] = 0.24/(0.24+0.24) = 0.5 Assuming, HWE, and LE Vineet Bafna
Sampling • Suppose, during the iteration, there is a bias. • Then, in the next step of sampling Z, we will do the right thing • Pr[“01”| pop. 1] = p1,1,0 * p1,2,1 = 0.7*0.7 = 0.49 • Pr[“01”| pop. 2] = p2,1,0 * p2,2,1 =0.3*0.3 = 0.09 • Pr[Z1 = 1] = 0.49/(0.49+0.09) = 0.85 • Pr[Z6 = 1] = 0.49/(0.49+0.09) = 0.85 • Eventually all “01” will become 1 population, and all “10” will become a second population 1 1 1 2 1 2 1 2 1 1 2 2 2 1 2 2 1 2 2 1 0 .. 1 0 .. 1 0 .. 0 1 .. 1 0 .. 1 0 .. 1 0 .. 1 0 .. 1 0 .. 1 0 .. 1 1 .. 0 1 .. 0 0 .. 0 1 .. 1 1 .. 0 1 .. 0 1 .. 0 1 .. 0 1 .. 0 1 .. 0 Vineet Bafna
Allowing for admixture • Define qi,k as the fraction of individual i that originated from population k. • Iteration • Guess Z(0) • For m = 1,2,.. • Sample P(m),Q(m) from Pr(P,Q | X, Z(m-1)) • Sample Z(m) from Pr(Z | X, P(m),Q(m)) Vineet Bafna
Estimating Z (admixture case) • Instead of estimating Pr(Z(i)=k|X,P,Q), (origin of individual i is k), we estimate Pr(Z(i,j,l)=k|X,P,Q) i,1 i,2 j Vineet Bafna
Results on admixture prediction Vineet Bafna
Results: Thrush data • For each individual, q(i) is plotted as the distance to the opposite side of the triangle. • The assignment is reliable, and there is evidence of admixture. Vineet Bafna
Population Structure • 377 locations (loci) were sampled in 1000 people from 52 populations. • 6 genetic clusters were obtained, which corresponded to 5 geographic regions (Rosenberg et al. Science 2003) Oceania Eurasia East Asia America Africa Vineet Bafna
Population sub-structure:research problem • Systematically explore the effect of admixture. Can admixture be predicted for a locus, or for an individual • The sampling approach may or may not be appropriate. Formulate as an optimization/learning problem: • (w/out admixture). Assign individuals to sub-populations so as to maximize linkage equilibrium, and hardy weinberg equilibrium in each of the sub-populations • (w/ admixture) Assign (individuals, loci) to sub-populations Vineet Bafna