Population Genetics Basics

Population Genetics Basics

Terminology review • Allele • Locus • Diploid • SNP

Single Nucleotide Polymorphisms Infinite Sites Assumption: Each site mutates at most once 00000101011 10001101001 01000101010 01000000011 00011110000 00101100110

What causes variation in a population? • Mutations (may lead to SNPs) • Recombinations • Other genetic events (gene conversion) • Structural Polymorphisms

Recombination 00000000 11111111 00011111

Gene Conversion • Gene Conversion versus crossover • Hard to distinguish in a population

Structural polymorphisms • Large scale structural changes (deletions/insertions/inversions) may occur in a population.

Topic 1: Basic Principles • In a ‘stable’ population, the distribution of alleles obeys certain laws • Not really, and the deviations are interesting • HW Equilibrium • (due to mixing in a population) • Linkage (dis)-equilibrium • Due to recombination

Hardy Weinberg equilibrium • Consider a locus with 2 alleles, A, a • p(respectively, q) is the frequency of A (resp. a) in the population • 3 Genotypes: AA, Aa, aa • Q: What is the frequency of each genotype • If various assumptions are satisfied, (such as random mating, no natural selection), Then • PAA=p2 • PAa=2pq • Paa=q2

Hardy Weinberg: why? • Assumptions: • Diploid • Sexual reproduction • Random mating • Bi-allelic sites • Large population size, … • Why? Each individual randomly picks his two chromosomes. Therefore, Prob. (Aa) = pq+qp = 2pq, and so on.

Hardy Weinberg: Generalizations • Multiple alleles with frequencies • By HW, • Multiple loci?

Hardy Weinberg: Implications • The allele frequency does not change from generation to generation. Why? • It is observed that 1 in 10,000 caucasians have the disease phenylketonuria. The disease mutation(s) are all recessive. What fraction of the population carries the mutation? • Males are 100 times more likely to have the “red’ type of color blindness than females. Why? • Conclusion: While the HW assumptions are rarely satisfied, the principle is still important as a baseline assumption, and significant deviations are interesting.

Recombination 00000000 11111111 00011111

What if there were no recombinations? • Life would be simpler • Each individual sequence would have a single parent (even for higher ploidy) • The relationship is expressed as a tree.

The Infinite Sites Assumption 0 0 0 0 0 0 0 0 3 0 0 1 0 0 0 0 0 5 8 0 0 1 0 1 0 0 0 0 0 1 0 0 0 0 1 • The different sites are linked. A 1 in position 8 implies 0 in position 5, and vice versa. • Some phenotypes could be linked to the polymorphisms • Some of the linkage is “destroyed” by recombination

Infinite sites assumption and Perfect Phylogeny • Each site is mutated at most once in the history. • All descendants must carry the mutated value, and all others must carry the ancestral value i 1 in position i 0 in position i

Perfect Phylogeny • Assume an evolutionary model in which no recombination takes place, only mutation. • The evolutionary history is explained by a tree in which every mutation is on an edge of the tree. All the species in one sub-tree contain a 0, and all species in the other contain a 1. Such a tree is called a perfect phylogeny.

The 4-gamete condition • A column i partitions the set of species into two sets i0, and i1 • A column is homogeneous w.r.t a set of species, if it has the same value for all species. Otherwise, it is heterogenous. • EX: i is heterogenous w.r.t {A,D,E} i A 0 B 0 C 0 D 1 E 1 F 1 i0 i1

4 Gamete Condition • 4 Gamete Condition • There exists a perfect phylogeny if and only if for all pair of columns (i,j), either j is not heterogenous w.r.t i0, or i1. • Equivalent to • There exists a perfect phylogeny if and only if for all pairs of columns (i,j), the following 4 rows do not exist (0,0), (0,1), (1,0), (1,1)

4-gamete condition: proof i i0 i1 • Depending on which edge the mutation j occurs, either i0, or i1 should be homogenous. • (only if) Every perfect phylogeny satisfies the 4-gamete condition • (if) If the 4-gamete condition is satisfied, does a prefect phylogeny exist?

An algorithm for constructing a perfect phylogeny • We will consider the case where 0 is the ancestral state, and 1 is the mutated state. This will be fixed later. • In any tree, each node (except the root) has a single parent. • It is sufficient to construct a parent for every node. • In each step, we add a column and refine some of the nodes containing multiple children. • Stop if all columns have been considered.

Inclusion Property • For any pair of columns i,j • i < j if and only if i1 j1 • Note that if i<j then the edge containing i is an ancestor of the edge containing i i j

Example r A B C D E 1 2 3 4 5 A 1 1 0 0 0 B 0 0 1 0 0 C 1 1 0 1 0 D 0 0 1 0 1 E 1 0 0 0 0 Initially, there is a single clade r, and each node has r as its parent

Sort columns • Sort columns according to the inclusion property (note that the columns are already sorted here). • This can be achieved by considering the columns as binary representations of numbers (most significant bit in row 1) and sorting in decreasing order 1 2 3 4 5 A 1 1 0 0 0 B 0 0 1 0 0 C 1 1 0 1 0 D 0 0 1 0 1 E 1 0 0 0 0

Add first column • In adding column i • Check each edge and decide which side you belong. • Finally add a node if you can resolve a clade 1 2 3 4 5 A 1 1 0 0 0 B 0 0 1 0 0 C 1 1 0 1 0 D 0 0 1 0 1 E 1 0 0 0 0 r u B D A C E

Adding other columns • Add other columns on edges using the ordering property 1 2 3 4 5 A 1 1 0 0 0 B 0 0 1 0 0 C 1 1 0 1 0 D 0 0 1 0 1 E 1 0 0 0 0 r 1 3 E 2 B 5 4 D A C

Unrooted case • Switch the values in each column, so that 0 is the majority element. • Apply the algorithm for the rooted case

Handling recombination • A tree is not sufficient as a sequence may have 2 parents • Recombination leads to loss of correlation between columns

Linkage (Dis)-equilibrium (LD) • Consider sites A &B • Case 1: No recombination • Pr[A,B=0,1] = 0.25 • Linkage disequilibrium • Case 2:Extensive recombination • Pr[A,B=(0,1)=0.125 • Linkage equilibrium A B 0 1 0 1 0 0 0 0 1 0 1 0 1 0 1 0

Handling recombination • A tree is not sufficient as a sequence may have 2 parents • Recombination leads to loss of correlation between columns

Recombination, and populations • Think of a population of N individual chromosomes. • The population remains stable from generation to generation. • Without recombination, each individual has exactly one parent chromosome from the previous generation. • With recombinations, each individual is derived from one or two parents. • We will formalize this notion later in the context of coalescent theory.

Linkage (Dis)-equilibrium (LD) • Consider sites A &B • Case 1: No recombination • Each new individual chromosome chooses a parent from the existing ‘haplotype’ A B 0 1 0 1 0 0 0 0 1 0 1 0 1 0 1 0 1 0

Linkage (Dis)-equilibrium (LD) • Consider sites A &B • Case 2: diploidy and recombination • Each new individual chooses a parent from the existing alleles A B 0 1 0 1 0 0 0 0 1 0 1 0 1 0 1 0 1 1

Linkage (Dis)-equilibrium (LD) • Consider sites A &B • Case 1: No recombination • Each new individual chooses a parent from the existing ‘haplotype’ • Pr[A,B=0,1] = 0.25 • Linkage disequilibrium • Case 2: Extensive recombination • Each new individual simply chooses and allele from either site • Pr[A,B=(0,1)=0.125 • Linkage equilibrium A B 0 1 0 1 0 0 0 0 1 0 1 0 1 0 1 0

LD • In the absence of recombination, • Correlation between columns • The joint probability Pr[A=a,B=b] is different from P(a)P(b) • With extensive recombination • Pr(a,b)=P(a)P(b)

Measures of LD • Consider two bi-allelic sites with alleles marked with 0 and 1 • Define • P00 = Pr[Allele 0 in locus 1, and 0 in locus 2] • P0* = Pr[Allele 0 in locus 1] • Linkage equilibrium if P00 = P0* P*0 • D = abs(P00 - P0* P*0) = abs(P01 - P0* P*1) = …

LD over time • With random mating, and fixed recombination rate r between the sites, Linkage Disequilibrium will disappear • Let D(t) = LD at time t • P(t)00 = (1-r) P(t-1)00 + r P(t-1)0* P(t-1)*0 • D(t) =P(t)00 - P(t)0* P(t)*0 = P(t)00 - P(t-1)0* P(t-1)*0 • D(t) =(1-r) D(t-1) =(1-r)t D(0)

LD over distance • Assumption • Recombination rate increases linearly with distance • LD decays exponentially with distance. • The assumption is reasonable, but recombination rates vary from region to region, adding to complexity • This simple fact is the basis of disease association mapping.

LD and disease mapping • Consider a mutation that is causal for a disease. • The goal of disease gene mapping is to discover which gene (locus) carries the mutation. • Consider every polymorphism, and check: • There might be too many polymorphisms • Multiple mutations (even at a single locus) that lead to the same disease • Instead, consider a dense sample of polymorphisms that span the genome

LD can be used to map disease genes • LD decays with distance from the disease allele. • By plotting LD, one can short list the region containing the disease gene. LD D N N D D N 0 1 1 0 0 1

LD and disease gene mapping problems • Marker density? • Complex diseases • Population sub-structure

Human Samples • We look at data from human samples • Gabriel et al. Science 2002. • 3 populations were sampled at multiple regions spanning the genome • 54 regions (Average size 250Kb) • SNP density 1 over 2Kb • 90 Individuals from Nigeria (Yoruban) • 93 Europeans • 42 Asian • 50 African American

Population specific recombination • D’ was used as the measure between SNP pairs. • SNP pairs were classified in one of the following • Strong LD • Strong evidence for recombination • Others (13% of cases) • This roughly favors out-of-africa. A Coalescent simulation can help give confidence values on this. Gabriel et al., Science 2002

Haplotype Blocks • A haplotype block is a region of low recombination. • Define a region as a block if less than 5% of the pairs show strong recombination • Much of the genome is in blocks. • Distribution of block sizes vary across populations.

Testing Out-of-Africa • Generate simulations with and without migration. • Check size of haplotype blocks. • Does it vary when migrations are allowed? • When the ‘new’ population has a bottleneck? • If there was a bottleneck that created European and Asian populations, can we say anything about frequency of alleles that are ‘African specific’? • Should they be high frequency, or low frequency in African populations?

Haplotype Block: implications • The genome is mostly partitioned into haplotype blocks. • Within a block, there is extensive LD. • Is this good, or bad, for association mapping?

Coalescent reconstruction • Reconstructing likely coalescents

Re-constructing history in the absence of recombination

An algorithm for constructing a perfect phylogeny • We will consider the case where 0 is the ancestral state, and 1 is the mutated state. This will be fixed later. • In any tree, each node (except the root) has a single parent. • It is sufficient to construct a parent for every node. • In each step, we add a column and refine some of the nodes containing multiple children. • Stop if all columns have been considered.

Inclusion Property • For any pair of columns i,j • i < j if and only if i1 j1 • Note that if i<j then the edge containing i is an ancestor of the edge containing i i j

Population Genetics Basics

Population Genetics Basics

Presentation Transcript

Population Genetics

Population Genetics

Population Genetics

Population Genetics

Population Genetics

Population Genetics

Population Genetics

Population Genetics

Population Genetics

Population Genetics

POPULATION GENETICS

Population Genetics

Population Genetics

Population Genetics

Population Genetics

Population genetics

Population Genetics

Population Genetics

Population Genetics

Population Genetics

Population Genetics: