650 likes | 658 Views
CSE280b: Population Genetics. Vineet Bafna/Pavel Pevzner. www.cse.ucsd.edu/classes/sp05/cse291. Population Genetics. Individuals in a species (population) are phenotypically different. Often these differences are inherited (genetic). Studying these differences is important!
E N D
CSE280b: Population Genetics Vineet Bafna/Pavel Pevzner www.cse.ucsd.edu/classes/sp05/cse291 Vineet Bafna
Population Genetics • Individuals in a species (population) are phenotypically different. • Often these differences are inherited (genetic). • Studying these differences is important! • Q:How predictive are these differences? Vineet Bafna
EX:Population Structure Oceania Eurasia East Asia America Africa • 377 locations (loci) were sampled in 1000 people from 52 populations. • 6 genetic clusters were obtained, which corresponded to 5 geographic regions (Rosenberg et al. Science 2003) • Genetic differences can predict ethnicity. Vineet Bafna
Scope of these lectures • Basic terminology • Key principles • Sources of variation • HW equilibrium • Linkage • Coalescent theory • Recombination/Ancestral Recombination Graph • Haplotypes/Haplotype phasing • Population sub-structure • Structural polymorphisms • Medical genetics basis: Association mapping/pedigree analysis Vineet Bafna
Alleles • Genotype: genetic makeup of an individual • Allele: A specific variant at a location • The notion of alleles predates the concept of gene, and DNA. • Initially, alleles referred to variants that described a measurable phenotype (round/wrinkled seed) • Now, an allele might be a nucleotide on a chromosome, with no measurable phenotype. • Humans are diploid, they have 2 copies of each chromosome. • They may have heterozygosity/homozygosity at a location • Other organisms (plants) have higher forms of ploidy. • Additionally, some sites might have 2 allelic forms, or even many allelic forms. Vineet Bafna
What causes variation in a population? • Mutations (may lead to SNPs) • Recombinations • Other genetic events (gene conversion) • Structural Polymorphisms Vineet Bafna
Single Nucleotide Polymorphisms Infinite Sites Assumption: Each site mutates at most once 00000101011 10001101001 01000101010 01000000011 00011110000 00101100110 Vineet Bafna
Short Tandem Repeats GCTAGATCATCATCATCATTGCTAG GCTAGATCATCATCATTGCTAGTTA GCTAGATCATCATCATCATCATTGC GCTAGATCATCATCATTGCTAGTTA GCTAGATCATCATCATTGCTAGTTA GCTAGATCATCATCATCATCATTGC 4 3 5 3 3 5 Vineet Bafna
STR can be used as a DNA fingerprint • Consider a collection of regions with variable length repeats. • Variable length repeats will lead to variable length DNA • Vector of lengths is a finger-print 4 2 3 3 5 1 3 2 3 1 5 3 individuals loci Vineet Bafna
Recombination 00000000 11111111 00011111 Vineet Bafna
Gene Conversion • Gene Conversion versus crossover • Hard to distinguish in a population Vineet Bafna
Structural polymorphisms • Large scale structural changes (deletions/insertions/inversions) may occur in a population. Vineet Bafna
Topic 1: Basic Principles • In a ‘stable’ population, the distribution of alleles obeys certain laws • Not really, and the deviations are interesting • HW Equilibrium • (due to mixing in a population) • Linkage (dis)-equilibrium • Due to recombination Vineet Bafna
Hardy Weinberg equilibrium • Consider a locus with 2 alleles, A, a • p(respectively, q) is the frequency of A (resp. a) in the population • 3 Genotypes: AA, Aa, aa • Q: What is the frequency of each genotype • If various assumptions are satisfied, (such as • random mating, no natural selection), Then • PAA=p2 • PAa=2pq • Paa=q2 Vineet Bafna
Hardy Weinberg: why? • Assumptions: • Diploid • Sexual reproduction • Random mating • Bi-allelic sites • Large population size, … • Why? Each individual randomly picks his two chromosomes. Therefore, Prob. (Aa) = pq+qp = 2pq, and so on. Vineet Bafna
Hardy Weinberg: Generalizations • Multiple alleles with frequencies • By HW, • Multiple loci? Vineet Bafna
Hardy Weinberg: Implications • The allele frequency does not change from generation to generation. Why? • It is observed that 1 in 10,000 caucasians have the disease phenylketonuria. The disease mutation(s) are all recessive. What fraction of the population carries the disease? • Males are 100 times more likely to have the “red’ type of color blindness than females. Why? • Conclusion: While the HW assumptions are rarely satisfied, the principle is still important as a baseline assumption, and significant deviations are interesting. Vineet Bafna
Recombination 00000000 11111111 00011111 Vineet Bafna
What if there were no recombinations? • Life would be simpler • Each individual sequence would have a single parent (even for higher ploidy) • The relationship is expressed as a tree. Vineet Bafna
The Infinite Sites Assumption 0 0 0 0 0 0 0 0 3 0 0 1 0 0 0 0 0 5 8 0 0 1 0 1 0 0 0 0 0 1 0 0 0 0 1 • The different sites are linked. A 1 in position 8 implies 0 in position 5, and vice versa. • Some phenotypes could be linked to the polymorphisms • Some of the linkage is “destroyed” by recombination Vineet Bafna
Infinite sites assumption and Perfect Phylogeny • Each site is mutated at most once in the history. • All descendants must carry the mutated value, and all others must carry the ancestral value i 1 in position i 0 in position i Vineet Bafna
Perfect Phylogeny • Assume an evolutionary model in which no recombination takes place, only mutation. • The evolutionary history is explained by a tree in which every mutation is on an edge of the tree. All the species in one sub-tree contain a 0, and all species in the other contain a 1. Such a tree is called a perfect phylogeny. Vineet Bafna
The 4-gamete condition • A column i partitions the set of species into two sets i0, and i1 • A column is homogeneous w.r.t a set of species, if it has the same value for all species. Otherwise, it is heterogenous. • EX: i is heterogenous w.r.t {A,D,E} i A 0 B 0 C 0 D 1 E 1 F 1 i0 i1 Vineet Bafna
4 Gamete Condition • 4 Gamete Condition • There exists a perfect phylogeny if and only if for all pair of columns (i,j), j is not heterogenous w.r.t i0, or i1. • Equivalent to • There exists a perfect phylogeny if and only if for all pairs of columns (i,j), the following 4 rows do not exist (0,0), (0,1), (1,0), (1,1) Vineet Bafna
4-gamete condition: proof (only if) • Depending on which edge the mutation j occurs, either i0, or i1 should be homogenous. • (only if) Every perfect phylogeny satisfies the 4-gamete condition • (if) If the 4-gamete condition is satisfied, does a prefect phylogeny exist? i j i0 i1 Vineet Bafna
Handling recombination • A tree is not sufficient as a sequence may have 2 parents • Recombination leads to loss of correlation between columns Vineet Bafna
Linkage (Dis)-equilibrium (LD) • Consider sites A &B • Case 1: No recombination • Each new individual chromosome chooses a parent from the existing ‘haplotype’ A B 0 1 0 1 0 0 0 0 1 0 1 0 1 0 1 0 1 0 Vineet Bafna
Linkage (Dis)-equilibrium (LD) • Consider sites A &B • Case 2: diploidy and recombination • Each new individual chooses a parent from the existing alleles A B 0 1 0 1 0 0 0 0 1 0 1 0 1 0 1 0 1 1 Vineet Bafna
Linkage (Dis)-equilibrium (LD) • Consider sites A &B • Case 1: No recombination • Each new individual chooses a parent from the existing ‘haplotype’ • Pr[A,B=0,1] = 0.25 • Linkage disequilibrium • Case 2: Extensive recombination • Each new individual simply chooses and allele from either site • Pr[A,B=(0,1)=0.125 • Linkage equilibrium A B 0 1 0 1 0 0 0 0 1 0 1 0 1 0 1 0 Vineet Bafna
LD • In the absence of recombination, • Correlation between columns • The joint probability Pr[A=a,B=b] is different from P(a)P(b) • With extensive recombination • Pr(a,b)=P(a)P(b) Vineet Bafna
Measures of LD • Consider two bi-allelic sites with alleles marked with 0 and 1 • Define • P00 = Pr[Allele 0 in locus 1, and 0 in locus 2] • P0* = Pr[Allele 0 in locus 1] • Linkage equilibrium if P00 = P0* P*0 • D = abs(P00 - P0* P*0) = abs(P01 - P0* P*1) = … Vineet Bafna
LD over time • With random mating, and fixed recombination rate r between the sites, Linkage Disequilibrium will disappear • Let D(t) = LD at time t • P(t)00 = (1-r) P(t-1)00 + r P(t-1)0* P(t-1)*0 • D(t) =P(t)00 - P(t)0* P(t)*0 = P(t)00 - P(t-1)0* P(t-1)*0 (HW) • D(t) =(1-r) D(t-1) =(1-r)t D(0) Vineet Bafna
LD over distance • Assumption • Recombination rate increases linearly with distance • LD decays exponentially with distance. • The assumption is reasonable, but recombination rates vary from region to region, adding to complexity • This simple fact is the basis of disease association mapping. Vineet Bafna
LD and disease mapping • Consider a mutation that is causal for a disease. • The goal of disease gene mapping is to discover which gene (locus) carries the mutation. • Consider every polymorphism, and check: • There might be too many polymorphisms • Multiple mutations (even at a single locus) that lead to the same disease • Instead, consider a dense sample of polymorphisms that span the genome Vineet Bafna
LD can be used to map disease genes • LD decays with distance from the disease allele. • By plotting LD, one can short list the region containing the disease gene. LD D N N D D N 0 1 1 0 0 1 Vineet Bafna
LD and disease gene mapping problems • Marker density? • Complex diseases • Population sub-structure Vineet Bafna
Population Genetics • Often we look at these equilibria (Linkage/HW) and their deviations in specific populations • These deviations offer insight into evolution. • However, what is Normal? • A combination of empirical (simulation) and theoretical insight helps distinguish between expected and unexpected. Vineet Bafna
Topic 2: Simulating population data • We described various population genetic concepts (HW, LD), and their applicability • The values of these parameters depend critically upon the population assumptions. • What if we do not have infinite populations • No random mating (Ex: geographic isolation) • Sudden growth • Bottlenecks • Ad-mixture • It would be nice to have a simulation of such a population to test various ideas. How would you do this simulation? Vineet Bafna
Wright Fisher Model of Evolution • Fixed population size from generation to generation • Random mating Vineet Bafna
Coalescent model • Insight 1: • Separate the genealogy from allelic states (mutations) • First generate the genealogy (who begat whom) • Assign an allelic state (0) to the ancestor. Drop mutations on the branches. Vineet Bafna
Coalescent theory • Insight 2: • Much of the genealogy is irrelevant, because it disappears. • Better to go backwards Vineet Bafna
Coalescent theory (Kingman) • Input • (Fixed population (N individuals), random mating) • Consider 2 individuals. • Probability that they coalesce in the previous generation (have the same parent)= • Probability that they do not coalesce after t generations= Vineet Bafna
Coalescent theory • is time in units of N generations • Consider k individuals. • Probability that no pair coalesces after 1 generation • Probability that no pair coalesces after t generations Vineet Bafna
Coalescent approximation • Insight 3: • Topology is independent of coalescent times • If you have n individuals, generate a random binary topology • Iterate (until one individual) • Pick a pair at random, and coalesce • Insight 4: • To generate coalescent times, there is no need to go back generation by generation Vineet Bafna
Coalescent approximation • At any step, there are 1 <= k <= n individuals • To generate time to coalesce (k to k-1 individuals) • Pick a number from exponential distribution with rate k(k-1)/2 • Mean time to coalescence = 2/(k(k-1)) Vineet Bafna
Typical coalescents • 4 random examples with n=6 (Note that we do not need to specify N. Why?) • Expected time to coalesce? Vineet Bafna
Coalescent properties • Expected time for the last step • The last step is half of the total time to coalesce • Studying larger number of individuals does not change numbers tremendously • EX: Number of mutations in a population is proportional to the total branch length of the tree • E(Ttot) =1 Vineet Bafna
Variants (exponentially growing populations) • If the population is growing exponentially, the branch lengths become similar, or even star-like. Why? • With appropriate scaling of time, the same process can be extended to various scenarios: male-female, hermaphrodite, segregation, migration, etc. Vineet Bafna
Simulating population data • Generate a coalescent (Topology + Branch lengths) • For each branch length, drop mutations with rate • Generate sequence data • Note that the resulting sequence is a perfect phylogeny. • Given such sequence data, can you reconstruct the coalescent tree? (Only the topology, not the branch lengths) • Also, note that all pairs of positions are correlated (should have high LD). Vineet Bafna
Coalescent with Recombination • An individual may have one parent, or 2 parents Vineet Bafna