1 / 49

Equilibria in populations

Equilibria in populations. Population data. Recall that we often study a population in the form of a SNP matrix Rows correspond to individuals (or individual chromosomes), columns correspond to SNPs The matrix is binary (why?)

gloria
Download Presentation

Equilibria in populations

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Equilibria in populations Vineet Bafna

  2. Population data • Recall that we often study a population in the form of a SNP matrix • Rows correspond to individuals (or individual chromosomes), columns correspond to SNPs • The matrix is binary (why?) • The underlying genealogy is hidden. If the span is large, the genealogy is not a tree any more. Why? 0100 0011 0011 0011 1000 1000 1000

  3. Basic Principles • In a ‘stable’ population, the distribution of alleles obeys certain laws • Not really, and the deviations are interesting • HW Equilibrium • (due to mixing in a population) • Linkage (dis)-equilibrium • Due to recombination Vineet Bafna

  4. Hardy Weinberg equilibrium • Assume a population of diploid individuals; • Consider a locus with 2 alleles, A, a • Define p(respectively, q): the frequency of A (resp. a) in the population • 3 Genotypes: AA, Aa, aa • Q: What is the frequency of each genotypes • If various assumptions are satisfied, (such as random mating, no natural selection), Then • PAA=p2 • PAa=2pq • Paa=q2 Vineet Bafna

  5. Hardy Weinberg: why? • Assumptions: • Diploid • Sexual reproduction • Random mating • Bi-allelic sites • Large population size, … • Why? Each individual randomly picks his two parental chromosomes. Therefore, Prob. (Aa) = pq+qp = 2pq, and so on. 0100 0011 0011 0011 1000 1000 1000 0100 1000 Vineet Bafna

  6. Hardy Weinberg: Generalizations • Multiple alleles with frequencies • By HW, • Multiple loci? Vineet Bafna

  7. Hardy Weinberg: Implications • The allele frequency does not change from generation to generation. True or false? • It is observed that 1 in 10,000 caucasians have the disease phenylketonuria. The disease mutation(s) are all recessive. What fraction of the population carries the mutation? • Males are 100 times more likely to have the ‘red’ type of color blindness than females. Why? • Individuals homozygous for S have the sickle-cell disease. In an experiment, the ratios A/A:A/S: S/S were 9365:2993:29. Is HWE violated? Is there a reason for this violation? • A group of individuals was chosen in NYC. Would you be surprised if HWE was violated? • Conclusion: While the HW assumptions are rarely satisfied, the principle is still important as a baseline assumption, and significant deviations are interesting. Vineet Bafna

  8. A modern example of HW application • SNP-chips can give us the allelic value at each polymorphic site based on hybridization. • Typically, the sites sampled have common SNPs • Minor allele frequency > 10% • We simply plot the 3 genotypes at each locus on 3 separate horizontal lines. • What is peculiar in the picture? • What is your conclusion? Vineet Bafna

  9. Linkage Equilibrium • HW equilibrium is the equilibrium in allele frequencies at a single locus. • Linkage equilibrium refers to the equilibrium between allele occurrences at two loci. • LE is due to recombination events • First, we will consider the case where no recombination occurs. HWE

  10. Perfect phylogeny

  11. What if there were no recombinations? • Life would be simpler • Each individual sequence would have a single parent (even for higher ploidy) • True for mtDNA (maternal) , and y-chromosome (paternal) • The genealogical relationship is expressed as a tree. This principle can be used to track ancestry of an individual Vineet Bafna

  12. Phylogeny • Recall that we often study a population in the form of a SNP matrix • Rows correspond to individuals (or individual chromosomes), columns correspond to SNPs • The matrix is binary (why?) • The underlying genealogy is hidden. If the span is large, the genealogy is not a tree any more. Why? 0100 0011 0011 0011 1000 1000 1000

  13. Reconstructing Perfect Phylogeny • Input: SNP matrix M (n rows, m columns/sites/mutations) • Output: a tree with the following properties • Rows correspond to leaf nodes • We add mutations to edges • each edge labeled with i splits the individuals into two subsets. • Individuals with a 1 in column i • Individuals with a 0 in column i i 1 in position i 0 in position i Vineet Bafna

  14. Reconstructing perfect phylogeny 12345678 01000000 00110110 00110100 00110000 10000000 10001000 10001001 • Each mutation can be labeled by the column number • Goal is to reconstruct the phylogeny • genographic atlas 2 7 6 3 4 5 1 8

  15. An algorithm for constructing a perfect phylogeny • We will consider the case where 0 is the ancestral state, and 1 is the mutated state. This will be fixed later. • In any tree, each node (except the root) has a single parent. • It is sufficient to construct a parent for every node. • In each step, we add a column and refine some of the nodes containing multiple children. • Stop if all columns have been considered. Vineet Bafna

  16. Columns • Define i0: taxa (individuals) with a 0 at the i’th column • Define i1: taxa (individuals) with a 1 at the i’th column Vineet Bafna

  17. Inclusion Property • For any pair of columns i,j, one of the folowing holds • i1 j1 • j1 i1 • i1 j1 =  • For any pair of columns i,j • i < j if and only if i1 j1 • Note that if i<j then the edge containing i is an ancestor of the edge containingj i j Vineet Bafna

  18. r A B C D E Perfect Phylogeny via Example 1 2 3 4 5 A 1 1 0 0 0 B 0 0 1 0 0 C 1 1 0 1 0 D 0 0 1 0 1 E 1 0 0 0 0 Initially, there is a single clade r, and each node has r as its parent Vineet Bafna

  19. Sort columns • Sort columns according to the inclusion property (note that the columns are already sorted here). • This can be achieved by considering the columns as binary representations of numbers (most significant bit in row 1) and sorting in decreasing order 1 2 3 4 5 A 1 1 0 0 0 B 0 0 1 0 0 C 1 1 0 1 0 D 0 0 1 0 1 E 1 0 0 0 0 Vineet Bafna

  20. Add first column 1 2 3 4 5 A 1 1 0 0 0 B 0 0 1 0 0 C 1 1 0 1 0 D 0 0 1 0 1 E 1 0 0 0 0 • In adding column i • Check each individual and decide which side you belong. • Finally add a node if you can resolve a clade r u B D A C E Vineet Bafna

  21. Adding other columns 1 2 3 4 5 A 1 1 0 0 0 B 0 0 1 0 0 C 1 1 0 1 0 D 0 0 1 0 1 E 1 0 0 0 0 • Add other columns on edges using the ordering property r 1 3 E 2 B 5 4 D A C Vineet Bafna

  22. Unrooted case • Important point is that the perfect phylogeny condition does not change when you interchange 1s and 0s at a column. • Alg (Unrooted) • Switch the values in each column, so that 0 is the majority element. • Apply the algorithm for the rooted case. • Relabel columns and individuals. • Show that this is a correct algorithm. Vineet Bafna

  23. Unrooted perfect phylogeny • We transform matrix M to a 0-major matrix M0. • if M0 has a directed perfect phylogeny, M has a perfect phylogeny. • If M has a perfect phylogeny, does M0 have a directed perfect phylogeny?

  24. Unrooted case • Theorem: If M has a perfect phylogeny, there exists a relabeling, and a perfect phylogeny s.t. • Root is all 0s • For any SNP (column), #1s <= #0s • All edges are mutated 01 Vineet Bafna

  25. Proof • Consider the perfect phylogeny of M. • Find the center: none of the clades has greater than n/2 nodes. • Is this always possible? • Root at one of the 3 edges of the center, and direct all mutations from 01 away from the root. QED • If the theorem is correct, then simply relabeling all columns so that the majority element is 0 is sufficient. Vineet Bafna

  26. ‘Homework’ Problems • What if there is missing data? (An entry that can be 0 or 1)? • What if there are recurrent mutations? Vineet Bafna

  27. Linkage Disequilibrium Vineet Bafna

  28. Quiz • Recall that a SNP data-set is a ‘binary’ matrix. • Rows are individual (chromosomes) • Columns are alleles at a specific locus • Suppose you have 2 SNP datasets of a contiguous genomic region but no other information • One from an African population, and one from a European Population. • Can you tell which is which? • How long does the genomic region have to be? Vineet Bafna

  29. Linkage (Dis)-equilibrium (LD) • Consider sites A &B • Case 1: No recombination • Each new individual chromosome chooses a parent from the existing ‘haplotype’ A B 0 1 0 1 0 0 0 0 1 0 1 0 1 0 1 0 1 0 Vineet Bafna

  30. Linkage (Dis)-equilibrium (LD) • Consider sites A &B • Case 2: diploidy and recombination • Each new individual chooses a parent from the existing alleles A B 0 1 0 1 0 0 0 0 1 0 1 0 1 0 1 0 1 1 Vineet Bafna

  31. Linkage (Dis)-equilibrium (LD) • Consider sites A &B • Case 1: No recombination • Each new individual chooses a parent from the existing ‘haplotype’ • Pr[A,B=0,1] = 0.25 • Linkage disequilibrium • Case 2: Extensive recombination • Each new individual simply chooses and allele from either site • Pr[A,B=(0,1)]=0.125 • Linkage equilibrium A B 0 1 0 1 0 0 0 0 1 0 1 0 1 0 1 0 Vineet Bafna

  32. LD • In the absence of recombination, • Correlation between columns • The joint probability Pr[A=a,B=b] is different from P(a)P(b) • With extensive recombination • Pr(a,b)=P(a)P(b) Vineet Bafna

  33. Measures of LD • Consider two bi-allelic sites with alleles marked with 0 and 1 • Define • P00 = Pr[Allele 0 in locus 1, and 0 in locus 2] • P0* = Pr[Allele 0 in locus 1] • Linkage equilibrium if P00 = P0* P*0 • D = (P00 - P0* P*0) = -(P01 - P0* P*1) = … Vineet Bafna

  34. Other measures of LD • D’ is obtained by dividing D by the largest possible value • Suppose D = (P00 - P0* P*0) >0. • Then the maximum value of Dmax= min{P0* P*1, P1* P*0} • If D<0, then maximum value is max{-P0* P*0, -P1* P*1} • D’ = D/ Dmax 0 1 -D D 0 Site 1 -D D 1 Site 2 Vineet Bafna

  35. Other measures of LD • D’ is obtained by dividing D by the largest possible value • Ex: D’ = abs(P11- P1* P*1)/ Dmax •  = D/(P1* P0* P*1 P*0)1/2 • Let N be the number of individuals • Show that 2N is the 2 statistic between the two sites 0 1 P00N 0 P0*N Site 1 1 Site 2 Vineet Bafna

  36. Digression: The χ2 test 0 1 • The statistic behaves like a χ2 distribution (sum of squares of normal variables). • A p-value can be computed directly O1 O2 0 O3 O4 1

  37. Observed and expected 0 1 0 1 P0*P*1N P00N 0 P0*P*0N P01N Site 1 1 P10N P11N P1*P*0N P1*P*1N •  = D/(P1* P0* P*1 P*0)1/2 • Verify that 2N is the 2 statistic between the two sites

  38. LD over time and distance • The number of recombination events between two sites, can be assumed to be Poisson distributed. • Let r denote the recombination rate between two adjacent sites • r = # crossovers per bp per generation • The recombination rate between two sites l apart is rl

  39. LD over time • Decay in LD • Let D(t) = LD at time t between two sites • r’=lr • P(t)00 = (1-r’) P(t-1)00 + r’ P(t-1)0* P(t-1)*0 • D(t) =P(t)00 - P(t)0* P(t)*0 = P(t)00 - P(t-1)0* P(t-1)*0 (Why?) • D(t) =(1-r’) D(t-1) =(1-r’)t D(0) Vineet Bafna

  40. LD over distance • Assumption • Recombination rate increases linearly with distance and time • LD decays exponentially. • The assumption is reasonable, but recombination rates vary from region to region, adding to complexity • This simple fact is the basis of disease association mapping. Vineet Bafna

  41. LD and disease mapping • Consider a mutation that is causal for a disease. • The goal of disease gene mapping is to discover which gene (locus) carries the mutation. • Consider every polymorphism, and check: • There might be too many polymorphisms • Multiple mutations (even at a single locus) that lead to the same disease • Instead, consider a dense sample of polymorphisms that span the genome Vineet Bafna

  42. LD can be used to map disease genes LD • LD decays with distance from the disease allele. • By plotting LD, one can short list the region containing the disease gene. D N N D D N 0 1 1 0 0 1 Vineet Bafna

  43. 269 individuals • 90 Yorubans • 90 Europeans (CEPH) • 44 Japanese • 45 Chinese • ~1M SNPs Vineet Bafna

  44. Haplotype blocks • It was found that recombination rates vary across the genome • How can the recombination rate be measured? • In regions with low recombination, you expect to see long haplotypes that are conserved. Why? • Typically, haplotype blocks do not span recombination hot-spots Vineet Bafna

  45. Long haplotypes • Chr 2 region with high r2 value (implies little/no recombination) • History/Genealogy can be explained by a tree ( a perfect phylogeny) • Large haplotypes with high frequency Vineet Bafna

  46. 19q13 Vineet Bafna

  47. LD variation across populations Reich et al. Nature 411, 199-204(10 May 2001) • LD is maintained upto 60kb in swedish population, 6kb in Yoruban population Vineet Bafna

  48. Understanding Population Data • We described various population genetic concepts (HW, LD), and their applicability • The values of these parameters depend critically upon the population assumptions. • What if we do not have infinite populations • No random mating (Ex: geographic isolation) • Sudden growth • Bottlenecks • Ad-mixture • It would be nice to have a simulation of such a population to test various ideas. How would you do this simulation? Vineet Bafna

More Related