740 likes | 977 Views
Population Genetics Basics. Terminology review. Allele Locus Diploid SNP. Single Nucleotide Polymorphisms. Infinite Sites Assumption: Each site mutates at most once. 00000101011 10001101001 01000101010 01000000011 00011110000 00101100110. What causes variation in a population?.
E N D
Terminology review • Allele • Locus • Diploid • SNP
Single Nucleotide Polymorphisms Infinite Sites Assumption: Each site mutates at most once 00000101011 10001101001 01000101010 01000000011 00011110000 00101100110
What causes variation in a population? • Mutations (may lead to SNPs) • Recombinations • Other genetic events (gene conversion) • Structural Polymorphisms
Recombination 00000000 11111111 00011111
Gene Conversion • Gene Conversion versus crossover • Hard to distinguish in a population
Structural polymorphisms • Large scale structural changes (deletions/insertions/inversions) may occur in a population.
Topic 1: Basic Principles • In a ‘stable’ population, the distribution of alleles obeys certain laws • Not really, and the deviations are interesting • HW Equilibrium • (due to mixing in a population) • Linkage (dis)-equilibrium • Due to recombination
Hardy Weinberg equilibrium • Consider a locus with 2 alleles, A, a • p(respectively, q) is the frequency of A (resp. a) in the population • 3 Genotypes: AA, Aa, aa • Q: What is the frequency of each genotype • If various assumptions are satisfied, (such as random mating, no natural selection), Then • PAA=p2 • PAa=2pq • Paa=q2
Hardy Weinberg: why? • Assumptions: • Diploid • Sexual reproduction • Random mating • Bi-allelic sites • Large population size, … • Why? Each individual randomly picks his two chromosomes. Therefore, Prob. (Aa) = pq+qp = 2pq, and so on.
Hardy Weinberg: Generalizations • Multiple alleles with frequencies • By HW, • Multiple loci?
Hardy Weinberg: Implications • The allele frequency does not change from generation to generation. Why? • It is observed that 1 in 10,000 caucasians have the disease phenylketonuria. The disease mutation(s) are all recessive. What fraction of the population carries the mutation? • Males are 100 times more likely to have the “red’ type of color blindness than females. Why? • Conclusion: While the HW assumptions are rarely satisfied, the principle is still important as a baseline assumption, and significant deviations are interesting.
Recombination 00000000 11111111 00011111
What if there were no recombinations? • Life would be simpler • Each individual sequence would have a single parent (even for higher ploidy) • The relationship is expressed as a tree.
The Infinite Sites Assumption 0 0 0 0 0 0 0 0 3 0 0 1 0 0 0 0 0 5 8 0 0 1 0 1 0 0 0 0 0 1 0 0 0 0 1 • The different sites are linked. A 1 in position 8 implies 0 in position 5, and vice versa. • Some phenotypes could be linked to the polymorphisms • Some of the linkage is “destroyed” by recombination
Infinite sites assumption and Perfect Phylogeny • Each site is mutated at most once in the history. • All descendants must carry the mutated value, and all others must carry the ancestral value i 1 in position i 0 in position i
Perfect Phylogeny • Assume an evolutionary model in which no recombination takes place, only mutation. • The evolutionary history is explained by a tree in which every mutation is on an edge of the tree. All the species in one sub-tree contain a 0, and all species in the other contain a 1. Such a tree is called a perfect phylogeny.
The 4-gamete condition • A column i partitions the set of species into two sets i0, and i1 • A column is homogeneous w.r.t a set of species, if it has the same value for all species. Otherwise, it is heterogenous. • EX: i is heterogenous w.r.t {A,D,E} i A 0 B 0 C 0 D 1 E 1 F 1 i0 i1
4 Gamete Condition • 4 Gamete Condition • There exists a perfect phylogeny if and only if for all pair of columns (i,j), either j is not heterogenous w.r.t i0, or i1. • Equivalent to • There exists a perfect phylogeny if and only if for all pairs of columns (i,j), the following 4 rows do not exist (0,0), (0,1), (1,0), (1,1)
4-gamete condition: proof i i0 i1 • Depending on which edge the mutation j occurs, either i0, or i1 should be homogenous. • (only if) Every perfect phylogeny satisfies the 4-gamete condition • (if) If the 4-gamete condition is satisfied, does a prefect phylogeny exist?
An algorithm for constructing a perfect phylogeny • We will consider the case where 0 is the ancestral state, and 1 is the mutated state. This will be fixed later. • In any tree, each node (except the root) has a single parent. • It is sufficient to construct a parent for every node. • In each step, we add a column and refine some of the nodes containing multiple children. • Stop if all columns have been considered.
Inclusion Property • For any pair of columns i,j • i < j if and only if i1 j1 • Note that if i<j then the edge containing i is an ancestor of the edge containing i i j
Example r A B C D E 1 2 3 4 5 A 1 1 0 0 0 B 0 0 1 0 0 C 1 1 0 1 0 D 0 0 1 0 1 E 1 0 0 0 0 Initially, there is a single clade r, and each node has r as its parent
Sort columns • Sort columns according to the inclusion property (note that the columns are already sorted here). • This can be achieved by considering the columns as binary representations of numbers (most significant bit in row 1) and sorting in decreasing order 1 2 3 4 5 A 1 1 0 0 0 B 0 0 1 0 0 C 1 1 0 1 0 D 0 0 1 0 1 E 1 0 0 0 0
Add first column • In adding column i • Check each edge and decide which side you belong. • Finally add a node if you can resolve a clade 1 2 3 4 5 A 1 1 0 0 0 B 0 0 1 0 0 C 1 1 0 1 0 D 0 0 1 0 1 E 1 0 0 0 0 r u B D A C E
Adding other columns • Add other columns on edges using the ordering property 1 2 3 4 5 A 1 1 0 0 0 B 0 0 1 0 0 C 1 1 0 1 0 D 0 0 1 0 1 E 1 0 0 0 0 r 1 3 E 2 B 5 4 D A C
Unrooted case • Switch the values in each column, so that 0 is the majority element. • Apply the algorithm for the rooted case
Handling recombination • A tree is not sufficient as a sequence may have 2 parents • Recombination leads to loss of correlation between columns
Linkage (Dis)-equilibrium (LD) • Consider sites A &B • Case 1: No recombination • Pr[A,B=0,1] = 0.25 • Linkage disequilibrium • Case 2:Extensive recombination • Pr[A,B=(0,1)=0.125 • Linkage equilibrium A B 0 1 0 1 0 0 0 0 1 0 1 0 1 0 1 0
Handling recombination • A tree is not sufficient as a sequence may have 2 parents • Recombination leads to loss of correlation between columns
Recombination, and populations • Think of a population of N individual chromosomes. • The population remains stable from generation to generation. • Without recombination, each individual has exactly one parent chromosome from the previous generation. • With recombinations, each individual is derived from one or two parents. • We will formalize this notion later in the context of coalescent theory.
Linkage (Dis)-equilibrium (LD) • Consider sites A &B • Case 1: No recombination • Each new individual chromosome chooses a parent from the existing ‘haplotype’ A B 0 1 0 1 0 0 0 0 1 0 1 0 1 0 1 0 1 0
Linkage (Dis)-equilibrium (LD) • Consider sites A &B • Case 2: diploidy and recombination • Each new individual chooses a parent from the existing alleles A B 0 1 0 1 0 0 0 0 1 0 1 0 1 0 1 0 1 1
Linkage (Dis)-equilibrium (LD) • Consider sites A &B • Case 1: No recombination • Each new individual chooses a parent from the existing ‘haplotype’ • Pr[A,B=0,1] = 0.25 • Linkage disequilibrium • Case 2: Extensive recombination • Each new individual simply chooses and allele from either site • Pr[A,B=(0,1)=0.125 • Linkage equilibrium A B 0 1 0 1 0 0 0 0 1 0 1 0 1 0 1 0
LD • In the absence of recombination, • Correlation between columns • The joint probability Pr[A=a,B=b] is different from P(a)P(b) • With extensive recombination • Pr(a,b)=P(a)P(b)
Measures of LD • Consider two bi-allelic sites with alleles marked with 0 and 1 • Define • P00 = Pr[Allele 0 in locus 1, and 0 in locus 2] • P0* = Pr[Allele 0 in locus 1] • Linkage equilibrium if P00 = P0* P*0 • D = abs(P00 - P0* P*0) = abs(P01 - P0* P*1) = …
LD over time • With random mating, and fixed recombination rate r between the sites, Linkage Disequilibrium will disappear • Let D(t) = LD at time t • P(t)00 = (1-r) P(t-1)00 + r P(t-1)0* P(t-1)*0 • D(t) =P(t)00 - P(t)0* P(t)*0 = P(t)00 - P(t-1)0* P(t-1)*0 • D(t) =(1-r) D(t-1) =(1-r)t D(0)
LD over distance • Assumption • Recombination rate increases linearly with distance • LD decays exponentially with distance. • The assumption is reasonable, but recombination rates vary from region to region, adding to complexity • This simple fact is the basis of disease association mapping.
LD and disease mapping • Consider a mutation that is causal for a disease. • The goal of disease gene mapping is to discover which gene (locus) carries the mutation. • Consider every polymorphism, and check: • There might be too many polymorphisms • Multiple mutations (even at a single locus) that lead to the same disease • Instead, consider a dense sample of polymorphisms that span the genome
LD can be used to map disease genes • LD decays with distance from the disease allele. • By plotting LD, one can short list the region containing the disease gene. LD D N N D D N 0 1 1 0 0 1
LD and disease gene mapping problems • Marker density? • Complex diseases • Population sub-structure
Human Samples • We look at data from human samples • Gabriel et al. Science 2002. • 3 populations were sampled at multiple regions spanning the genome • 54 regions (Average size 250Kb) • SNP density 1 over 2Kb • 90 Individuals from Nigeria (Yoruban) • 93 Europeans • 42 Asian • 50 African American
Population specific recombination • D’ was used as the measure between SNP pairs. • SNP pairs were classified in one of the following • Strong LD • Strong evidence for recombination • Others (13% of cases) • This roughly favors out-of-africa. A Coalescent simulation can help give confidence values on this. Gabriel et al., Science 2002
Haplotype Blocks • A haplotype block is a region of low recombination. • Define a region as a block if less than 5% of the pairs show strong recombination • Much of the genome is in blocks. • Distribution of block sizes vary across populations.
Testing Out-of-Africa • Generate simulations with and without migration. • Check size of haplotype blocks. • Does it vary when migrations are allowed? • When the ‘new’ population has a bottleneck? • If there was a bottleneck that created European and Asian populations, can we say anything about frequency of alleles that are ‘African specific’? • Should they be high frequency, or low frequency in African populations?
Haplotype Block: implications • The genome is mostly partitioned into haplotype blocks. • Within a block, there is extensive LD. • Is this good, or bad, for association mapping?
Coalescent reconstruction • Reconstructing likely coalescents
An algorithm for constructing a perfect phylogeny • We will consider the case where 0 is the ancestral state, and 1 is the mutated state. This will be fixed later. • In any tree, each node (except the root) has a single parent. • It is sufficient to construct a parent for every node. • In each step, we add a column and refine some of the nodes containing multiple children. • Stop if all columns have been considered.
Inclusion Property • For any pair of columns i,j • i < j if and only if i1 j1 • Note that if i<j then the edge containing i is an ancestor of the edge containing i i j