630 likes | 1.11k Views
Population structure - Foundations to software. Andrew J. Eckert Section of Evolution and Ecology, University of California at Davis, Davis, CA 95616 USA Ph: (530) 754-5743 E-mail: ajeckert@ucdavis.edu. Eckert, Population Structure, 5-Aug-2008 1.
E N D
Population structure - Foundations to software Andrew J. Eckert Section of Evolution and Ecology, University of California at Davis, Davis, CA 95616 USA Ph: (530) 754-5743 E-mail: ajeckert@ucdavis.edu Eckert, Population Structure, 5-Aug-2008 1
An example from foxtail pine (Pinus balfouriana) Eckert, Population Structure, 5-Aug-2008 2
Population structure in forest trees Eckert, Population Structure, 5-Aug-2008 3
Canonical questions • To what extent due gene frequencies differ among populations of forest trees? • How is gene flow structured among populations of forest trees? • How can population structure inform us about other processes in forest trees? Eckert, Population Structure, 5-Aug-2008 4
Topics • Hardy-Weinberg Equilibrium • Wahlund effect and F-statistics • Estimating F-statistics from real data • Relationship between Fst and Nem • Clustering methods Eckert, Population Structure, 5-Aug-2008 6
Hardy-Weinberg Principle and Estimation of Allele Frequencies in Populations Eckert, Population Structure, 5-Aug-2008 2 Eckert, Population Structure, 5-Aug-2008 6
Mating Tables • Construct a mating table by assuming that: • Genotype frequencies are same between sexes • Mating is at random with respect to the genotypes at a particular locus • No segregation distortion or differential survival of zygotes. Eckert, Population Structure, 5-Aug-2008 7
A generalized mating table Eckert, Population Structure, 5-Aug-2008 8
Genotype frequencies of newly formed zygotes Now make 3 more assumptions: No mutation No drift All matings produce the same number of offspring on average The frequency of each genotype in newly formed zygotes is then: Eckert, Population Structure, 5-Aug-2008 9
More assumptions To make those zygote genotype frequencies into adult genotype frequencies assume further: • Generations do not overlap • No differential survival among genotypes Eckert, Population Structure, 5-Aug-2008 10
Hardy-Weinberg Equilibrium (HWE) • This is HWE: • Freq(A1A1 in zygotes) = p2 • Freq(A1A2 in zygotes) = 2p(1-p) • Freq(A2A2 in zygotes) = (1-p)2 • Deviations from HWE must occur by violation of one of the previous assumptions. This is the power of HWE. Eckert, Population Structure, 5-Aug-2008 11
Null hypothesis: No deviation from HW • Procedure: • Estimate allele frequencies • p, q • For two alleles, p is distributed as a binomial random variable. For more than 2 alleles, this is a multinomial distribution. • Maximum likelihood and Bayesian methods to do this. • Generate expected HW genotypic frequencies • p2, 2pq, q2 • Compare with observed genotypic frequencies • various test statistics: • 2 goodness of fit (discussed here) • G test (similar to chi-square, uses likelihood method) • Exact tests (small samples) Eckert, Population Structure, 5-Aug-2008 12
Chi-square test • Numbers, not frequencies • k=number of categories • n “degrees of freedom. • This statistic is distributed as the sum of n independent squared “random normal variables” with mean=0 and variance=1. Eckert, Population Structure, 5-Aug-2008 13
Relaxation of random mating and the fixation index (f) • Now imagine that we have a mixture of randomly mating and selfing populations. The fraction of selfing individuals is . Eckert, Population Structure, 5-Aug-2008 14
More on f: A simple estimator • Notice that x12 is an observed quantity and that 2pq is the expectation under HWE. • If the genotype and allele frequencies were observed without error: Eckert, Population Structure, 5-Aug-2008 15
Wahlund Effect and Wright’s F-statistics Eckert, Population Structure, 5-Aug-2008 16
The Wahlund Effect • Consider two subpopulations each in HWE with a single biallelic locus where p1 and p2 are the allele frequencies in each subpopulation for allele A. If a sample is collected across both populations, the heterozygosity (H) of the sample is: • However, if the collection of subpopulations was in HWE: • The Wahlund effect: Eckert, Population Structure, 5-Aug-2008 17
The Wahlund Effect - An example p1 = 0.3 p2 = 0.7 p = 0.50 Eckert, Population Structure, 5-Aug-2008 18
What are the qualitative aspects of the Wahlund effect? Eckert, Population Structure, 5-Aug-2008 19
Incorporation of local inbreeding (f) • Let Hi be the actual heterozygosity within individuals, Hs the expected heterozygosity within subpopulations all at HWE and Ht the expected heterozygosity across the entire set of subpopulations, we can then define: Eckert, Population Structure, 5-Aug-2008 21
The meaning of Fst • Reduction in variance due to population structure relative to maximum variance possible in a randomly mating population. • Proportion of the total expected heterozygosity accounted for by the expected heterozygosity within subpopulations • Also, a measure of what is the probability that two gene copies chosen at random from two different subpopulations are identical-by-descent. • See Slatkin (1991, Genetical Res. 58: 167-175) for a relationship of Fstto coalescence times among gene copies. Eckert, Population Structure, 5-Aug-2008 22
Estimating Fst from real data • So, far we have assumed that we know quantities without error. However, you have sampled from a set of populations and therefore have three kinds of error: • Error due to taking a sample from the existing populations. • Error due to real differences among populations. • Error associated from the fact that the existing populations exist along one of a infinitely large number of evolutionary trajectories producing the observed gene frequencies. Eckert, Population Structure, 5-Aug-2008 23
A fixed effects estimator • Nei and Chesser (1983) provide a bias corrected version of Gst (Nei, 1972; PNAS 70: 3321-3323): Eckert, Population Structure, 5-Aug-2008 24
Precision of Gst Eckert, Population Structure, 5-Aug-2008 25
A random effects estimator • Weir and Cockerham (1984; Evolution 38: 1358-1370) frame the estimator in ANOVA theory using random effects for one allele at a time (extension of Cockerham 1969, 1973): See Smouse and Williams (1982, Biometrics 38: 757-768) for a multivariate ANOVA approach. Eckert, Population Structure, 5-Aug-2008 26
Multilocus estimators and significance testing • Weir and Cockerham suggest that multilocus estimates be weighted averages across loci assumed to be in linkage equilibrium. The weights are functions of the allele frequencies at the loci. • Significance of the estimates is typically done by bootstrapping over loci to get 95% confidence intervals. The test is then, does my 95% CI overlap 0? • Weir and Cockerham (1984) advocate the using the jackknife over samples or loci to estimate variances for a given locus or the multilocus estimate, respectively. Eckert, Population Structure, 5-Aug-2008 27
Software • FSTAT http://www2.unil.ch/popgen/softwares/fstat.htm • Arlequin http://lgb.unige.ch/arlequin/ • Genepop http://genepop.curtin.edu.au/ Eckert, Population Structure, 5-Aug-2008 28
Extensions • Estimation of pollen and seed movement in conifers (Ennos, 1994) • Bayesian estimation – very good for dominant data (cf. HICKORY http://darwin.eeb.uconn.edu/hickory/hickory.html) • Outlier detection. FDIST2 (http://www.rubic.rdg.ac.uk/~mab/software.html) • Population specific Fst values (Weir and Hill, 2002): • Models for differing data types: Haplotype frequencies and divergence between haplotypes. • Rst: Stepwise mutation models incorporated (Slatkin, 1995) • Nst: DNA sequence models incorporated Eckert, Population Structure, 5-Aug-2008 29
Pollen-to-seed flow ratio Ennos (1994; Heredity) showed that under the equilibrium conditions applicable to Fstin general: Without inbreeding: With inbreeding: Eckert, Population Structure, 5-Aug-2008 30
Outlier Detection: An example cf. Beaumont and Balding (2004, Mol. Ecol. 13:969-980) for a refinement of fdist2 into a hierarchical Bayesian model. Eckert, Population Structure, 5-Aug-2008 31
FST to Nem • Under a number of assumptions that lead to the n-island model: • Where, p is the correlation in gene frequencies among populations. Typically this is assumed to be 0. • If the mutation rate is small enough so that it can be ignored (especially if m >> u) then this simplifies to: Eckert, Population Structure, 5-Aug-2008 32
Relationship between Nem and FST Populations not very divergent due to high gene flow Populations very divergent due to low gene flow Nem =1 Eckert, Population Structure, 5-Aug-2008 33
Estimators of Nem • Coalescent-based estimators of Nem (cf. Beerli and Felsentein, 2001; PNAS 98: 4563-4568) • Coalescent-based estimators of Nem plus other forces (cf. Kuhner, 2006; Bioinofrmatics 22: 768-770). • Coalescent-based estimators of Nem plus other forces and population divergence (cf. Hey and Nielsen, 2007; PNAS 104: 2785-2790). Eckert, Population Structure, 5-Aug-2008 34
Coalescent-based analyses - Examples Eckert, Population Structure, 5-Aug-2008 35
AMOVA - Analysis of Molecular Variance • Hierarchical fixation indices are conducive to inclusion of more levels. Before we had: 1. Individual 2. Subpopulation 3. Total Population • We could instead have the following: • Individual • Subpopulation • Groups of subpopulations • Entire population (= all subpopulations) • This can be properly addressed with AMOVA. Eckert, Population Structure, 5-Aug-2008 36
AMOVA - An Example of Hierarchical Levels Fixation indices can be calculated for among groups of populations (red lines; FCT) and among subpopulations within groups (green lines; FSC). This is done within an Analysis of Variance framework using (co)variance components within a general linear model. Eckert, Population Structure, 5-Aug-2008 37
An AMOVA table Eckert, Population Structure, 5-Aug-2008 38
Significance of variance components • Permutation analyses depending upon the component: • For, FCT this is typically done by permuting populations among groups • For FSC this done by permuting genotypes among populations within groups. • For FST this done by permuting genotypes among populations among groups. Eckert, Population Structure, 5-Aug-2008 39
Isolation-by-Distance (IBD) For populations that have large distributions, the n-island model may be inappropriate. For example, populations close in space may be less differentiated than those far way in space. If space is the primary driving force, then there should exist a correlation between geographic and genetic distance among populations. This is tested by correlating pairwise FST with pairwise geographic distances using a matrix correlation function (i.e., the Mantel statistic). Eckert, Population Structure, 5-Aug-2008 40
IBD - An Example FST Mantel r significant Distance (km) IBD present FST IBD not present Distance (km) Mantel r not significant
Clustering Methods Eckert, Population Structure, 5-Aug-2008 42
Bayesian Clustering as an Alternative to FSTanalyses • Multilocus genotypes contain information about population structure. • The underlying information is the same as that used in estimating FST. • However, a model-based effort alleviates the need to define populations a priori, because the number of populations is a parameter of the model (Pritchard et al., 2000 Genetics 155: 945-959). • So, what is the model…..? Eckert, Population Structure, 5-Aug-2008 43
The Model • Parametric model for the frequency distribution of alleles in an unknown number of populations each in HWE. • Pritchard et al. assume a Dirichlet distribution for the allele frequencies • The basic idea in to approximate: • Using Markov Chain Monte Carlo. This is the Posterior Probability Distribution (PPD) of the allele frequencies (P) in the unknown populations of origin (Z) given the observed multilocus genotypes (X). [Loci are assumed to be in linkage equilibrium]. PPD KP DP ML http://pritch.bsd.uchicago.edu/structure.html Eckert, Population Structure, 5-Aug-2008 44
Inferences Using STRUCTURE • Population structure present: • This method can also infer the optimal number of clusters (= populations) K given the data. This is done by assuming a uniform prior on a range of possible values for K. The result are the posterior probabilities for each value of K in the prior. Eckert, Population Structure, 5-Aug-2008 45
Another method for inference of K • The K method of Evanno et al. (2005, Mol. Ecol. 14: 2611-2620): Eckert, Population Structure, 5-Aug-2008 46
Post-processing of STRUCTURE runs • Production of Q-plots: DISTRUCT • Solving the label-switching problem and averaging across replicated runs: CLUMPP http://rosenberglab.bioinformatics.med.umich.edu/ Eckert, Population Structure, 5-Aug-2008 47
What are those strange plots? • Plots of Q-values are nothing more than stacked barplots • These can be grouped into geographical regions or can be plotted onto a map if you have spatial coordinates using R scripts (available from the TESS website) Eckert, Population Structure, 5-Aug-2008 48
Other clustering methods • TESS (prior on spatial coordinates using Dirichlet tessellations; Francois et al., 2006; Genetics, 174:805-816) available from: http://www-timc.imag.fr/Olivier.Francois/tess.html • EIGENSOFT (PCA, Patterson et al., 2006; PloS Genetics 2:e190). • Good for SNP data Eckert, Population Structure, 5-Aug-2008 49
What does it all mean? A point for discussion • What is the “evolutionary” significance of population structure in forest trees? • How would something like Sewall Wright’s shifting balance theory of evolution work in forest trees? Eckert, Population Structure, 5-Aug-2008 50