1 / 50

Population structure - Foundations to software

Population structure - Foundations to software. Andrew J. Eckert Section of Evolution and Ecology, University of California at Davis, Davis, CA 95616 USA Ph: (530) 754-5743 E-mail: ajeckert@ucdavis.edu. Eckert, Population Structure, 5-Aug-2008 1.

Download Presentation

Population structure - Foundations to software

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Population structure - Foundations to software Andrew J. Eckert Section of Evolution and Ecology, University of California at Davis, Davis, CA 95616 USA Ph: (530) 754-5743 E-mail: ajeckert@ucdavis.edu Eckert, Population Structure, 5-Aug-2008 1

  2. An example from foxtail pine (Pinus balfouriana) Eckert, Population Structure, 5-Aug-2008 2

  3. Population structure in forest trees Eckert, Population Structure, 5-Aug-2008 3

  4. Canonical questions • To what extent due gene frequencies differ among populations of forest trees? • How is gene flow structured among populations of forest trees? • How can population structure inform us about other processes in forest trees? Eckert, Population Structure, 5-Aug-2008 4

  5. Topics • Hardy-Weinberg Equilibrium • Wahlund effect and F-statistics • Estimating F-statistics from real data • Relationship between Fst and Nem • Clustering methods Eckert, Population Structure, 5-Aug-2008 6

  6. Hardy-Weinberg Principle and Estimation of Allele Frequencies in Populations Eckert, Population Structure, 5-Aug-2008 2 Eckert, Population Structure, 5-Aug-2008 6

  7. Mating Tables • Construct a mating table by assuming that: • Genotype frequencies are same between sexes • Mating is at random with respect to the genotypes at a particular locus • No segregation distortion or differential survival of zygotes. Eckert, Population Structure, 5-Aug-2008 7

  8. A generalized mating table Eckert, Population Structure, 5-Aug-2008 8

  9. Genotype frequencies of newly formed zygotes Now make 3 more assumptions: No mutation No drift All matings produce the same number of offspring on average The frequency of each genotype in newly formed zygotes is then: Eckert, Population Structure, 5-Aug-2008 9

  10. More assumptions To make those zygote genotype frequencies into adult genotype frequencies assume further: • Generations do not overlap • No differential survival among genotypes Eckert, Population Structure, 5-Aug-2008 10

  11. Hardy-Weinberg Equilibrium (HWE) • This is HWE: • Freq(A1A1 in zygotes) = p2 • Freq(A1A2 in zygotes) = 2p(1-p) • Freq(A2A2 in zygotes) = (1-p)2 • Deviations from HWE must occur by violation of one of the previous assumptions. This is the power of HWE. Eckert, Population Structure, 5-Aug-2008 11

  12. Null hypothesis: No deviation from HW • Procedure: • Estimate allele frequencies • p, q • For two alleles, p is distributed as a binomial random variable. For more than 2 alleles, this is a multinomial distribution. • Maximum likelihood and Bayesian methods to do this. • Generate expected HW genotypic frequencies • p2, 2pq, q2 • Compare with observed genotypic frequencies • various test statistics: • 2 goodness of fit (discussed here) • G test (similar to chi-square, uses likelihood method) • Exact tests (small samples) Eckert, Population Structure, 5-Aug-2008 12

  13. Chi-square test • Numbers, not frequencies • k=number of categories • n “degrees of freedom. • This statistic is distributed as the sum of n independent squared “random normal variables” with mean=0 and variance=1. Eckert, Population Structure, 5-Aug-2008 13

  14. Relaxation of random mating and the fixation index (f) • Now imagine that we have a mixture of randomly mating and selfing populations. The fraction of selfing individuals is . Eckert, Population Structure, 5-Aug-2008 14

  15. More on f: A simple estimator • Notice that x12 is an observed quantity and that 2pq is the expectation under HWE. • If the genotype and allele frequencies were observed without error: Eckert, Population Structure, 5-Aug-2008 15

  16. Wahlund Effect and Wright’s F-statistics Eckert, Population Structure, 5-Aug-2008 16

  17. The Wahlund Effect • Consider two subpopulations each in HWE with a single biallelic locus where p1 and p2 are the allele frequencies in each subpopulation for allele A. If a sample is collected across both populations, the heterozygosity (H) of the sample is: • However, if the collection of subpopulations was in HWE: • The Wahlund effect: Eckert, Population Structure, 5-Aug-2008 17

  18. The Wahlund Effect - An example p1 = 0.3 p2 = 0.7 p = 0.50 Eckert, Population Structure, 5-Aug-2008 18

  19. What are the qualitative aspects of the Wahlund effect? Eckert, Population Structure, 5-Aug-2008 19

  20. Connection from the Wahlund effect to F-statistics

  21. Incorporation of local inbreeding (f) • Let Hi be the actual heterozygosity within individuals, Hs the expected heterozygosity within subpopulations all at HWE and Ht the expected heterozygosity across the entire set of subpopulations, we can then define: Eckert, Population Structure, 5-Aug-2008 21

  22. The meaning of Fst • Reduction in variance due to population structure relative to maximum variance possible in a randomly mating population. • Proportion of the total expected heterozygosity accounted for by the expected heterozygosity within subpopulations • Also, a measure of what is the probability that two gene copies chosen at random from two different subpopulations are identical-by-descent. • See Slatkin (1991, Genetical Res. 58: 167-175) for a relationship of Fstto coalescence times among gene copies. Eckert, Population Structure, 5-Aug-2008 22

  23. Estimating Fst from real data • So, far we have assumed that we know quantities without error. However, you have sampled from a set of populations and therefore have three kinds of error: • Error due to taking a sample from the existing populations. • Error due to real differences among populations. • Error associated from the fact that the existing populations exist along one of a infinitely large number of evolutionary trajectories producing the observed gene frequencies. Eckert, Population Structure, 5-Aug-2008 23

  24. A fixed effects estimator • Nei and Chesser (1983) provide a bias corrected version of Gst (Nei, 1972; PNAS 70: 3321-3323): Eckert, Population Structure, 5-Aug-2008 24

  25. Precision of Gst Eckert, Population Structure, 5-Aug-2008 25

  26. A random effects estimator • Weir and Cockerham (1984; Evolution 38: 1358-1370) frame the estimator in ANOVA theory using random effects for one allele at a time (extension of Cockerham 1969, 1973): See Smouse and Williams (1982, Biometrics 38: 757-768) for a multivariate ANOVA approach. Eckert, Population Structure, 5-Aug-2008 26

  27. Multilocus estimators and significance testing • Weir and Cockerham suggest that multilocus estimates be weighted averages across loci assumed to be in linkage equilibrium. The weights are functions of the allele frequencies at the loci. • Significance of the estimates is typically done by bootstrapping over loci to get 95% confidence intervals. The test is then, does my 95% CI overlap 0? • Weir and Cockerham (1984) advocate the using the jackknife over samples or loci to estimate variances for a given locus or the multilocus estimate, respectively. Eckert, Population Structure, 5-Aug-2008 27

  28. Software • FSTAT http://www2.unil.ch/popgen/softwares/fstat.htm • Arlequin http://lgb.unige.ch/arlequin/ • Genepop http://genepop.curtin.edu.au/ Eckert, Population Structure, 5-Aug-2008 28

  29. Extensions • Estimation of pollen and seed movement in conifers (Ennos, 1994) • Bayesian estimation – very good for dominant data (cf. HICKORY http://darwin.eeb.uconn.edu/hickory/hickory.html) • Outlier detection. FDIST2 (http://www.rubic.rdg.ac.uk/~mab/software.html) • Population specific Fst values (Weir and Hill, 2002): • Models for differing data types: Haplotype frequencies and divergence between haplotypes. • Rst: Stepwise mutation models incorporated (Slatkin, 1995) • Nst: DNA sequence models incorporated Eckert, Population Structure, 5-Aug-2008 29

  30. Pollen-to-seed flow ratio Ennos (1994; Heredity) showed that under the equilibrium conditions applicable to Fstin general: Without inbreeding: With inbreeding: Eckert, Population Structure, 5-Aug-2008 30

  31. Outlier Detection: An example cf. Beaumont and Balding (2004, Mol. Ecol. 13:969-980) for a refinement of fdist2 into a hierarchical Bayesian model. Eckert, Population Structure, 5-Aug-2008 31

  32. FST to Nem • Under a number of assumptions that lead to the n-island model: • Where, p is the correlation in gene frequencies among populations. Typically this is assumed to be 0. • If the mutation rate is small enough so that it can be ignored (especially if m >> u) then this simplifies to: Eckert, Population Structure, 5-Aug-2008 32

  33. Relationship between Nem and FST Populations not very divergent due to high gene flow Populations very divergent due to low gene flow Nem =1 Eckert, Population Structure, 5-Aug-2008 33

  34. Estimators of Nem • Coalescent-based estimators of Nem (cf. Beerli and Felsentein, 2001; PNAS 98: 4563-4568) • Coalescent-based estimators of Nem plus other forces (cf. Kuhner, 2006; Bioinofrmatics 22: 768-770). • Coalescent-based estimators of Nem plus other forces and population divergence (cf. Hey and Nielsen, 2007; PNAS 104: 2785-2790). Eckert, Population Structure, 5-Aug-2008 34

  35. Coalescent-based analyses - Examples Eckert, Population Structure, 5-Aug-2008 35

  36. AMOVA - Analysis of Molecular Variance • Hierarchical fixation indices are conducive to inclusion of more levels. Before we had: 1. Individual 2. Subpopulation 3. Total Population • We could instead have the following: • Individual • Subpopulation • Groups of subpopulations • Entire population (= all subpopulations) • This can be properly addressed with AMOVA. Eckert, Population Structure, 5-Aug-2008 36

  37. AMOVA - An Example of Hierarchical Levels Fixation indices can be calculated for among groups of populations (red lines; FCT) and among subpopulations within groups (green lines; FSC). This is done within an Analysis of Variance framework using (co)variance components within a general linear model. Eckert, Population Structure, 5-Aug-2008 37

  38. An AMOVA table Eckert, Population Structure, 5-Aug-2008 38

  39. Significance of variance components • Permutation analyses depending upon the component: • For, FCT this is typically done by permuting populations among groups • For FSC this done by permuting genotypes among populations within groups. • For FST this done by permuting genotypes among populations among groups. Eckert, Population Structure, 5-Aug-2008 39

  40. Isolation-by-Distance (IBD) For populations that have large distributions, the n-island model may be inappropriate. For example, populations close in space may be less differentiated than those far way in space. If space is the primary driving force, then there should exist a correlation between geographic and genetic distance among populations. This is tested by correlating pairwise FST with pairwise geographic distances using a matrix correlation function (i.e., the Mantel statistic). Eckert, Population Structure, 5-Aug-2008 40

  41. IBD - An Example FST Mantel r significant Distance (km) IBD present FST IBD not present Distance (km) Mantel r not significant

  42. Clustering Methods Eckert, Population Structure, 5-Aug-2008 42

  43. Bayesian Clustering as an Alternative to FSTanalyses • Multilocus genotypes contain information about population structure. • The underlying information is the same as that used in estimating FST. • However, a model-based effort alleviates the need to define populations a priori, because the number of populations is a parameter of the model (Pritchard et al., 2000 Genetics 155: 945-959). • So, what is the model…..? Eckert, Population Structure, 5-Aug-2008 43

  44. The Model • Parametric model for the frequency distribution of alleles in an unknown number of populations each in HWE. • Pritchard et al. assume a Dirichlet distribution for the allele frequencies • The basic idea in to approximate: • Using Markov Chain Monte Carlo. This is the Posterior Probability Distribution (PPD) of the allele frequencies (P) in the unknown populations of origin (Z) given the observed multilocus genotypes (X). [Loci are assumed to be in linkage equilibrium]. PPD KP DP ML http://pritch.bsd.uchicago.edu/structure.html Eckert, Population Structure, 5-Aug-2008 44

  45. Inferences Using STRUCTURE • Population structure present: • This method can also infer the optimal number of clusters (= populations) K given the data. This is done by assuming a uniform prior on a range of possible values for K. The result are the posterior probabilities for each value of K in the prior. Eckert, Population Structure, 5-Aug-2008 45

  46. Another method for inference of K • The K method of Evanno et al. (2005, Mol. Ecol. 14: 2611-2620): Eckert, Population Structure, 5-Aug-2008 46

  47. Post-processing of STRUCTURE runs • Production of Q-plots: DISTRUCT • Solving the label-switching problem and averaging across replicated runs: CLUMPP http://rosenberglab.bioinformatics.med.umich.edu/ Eckert, Population Structure, 5-Aug-2008 47

  48. What are those strange plots? • Plots of Q-values are nothing more than stacked barplots • These can be grouped into geographical regions or can be plotted onto a map if you have spatial coordinates using R scripts (available from the TESS website) Eckert, Population Structure, 5-Aug-2008 48

  49. Other clustering methods • TESS (prior on spatial coordinates using Dirichlet tessellations; Francois et al., 2006; Genetics, 174:805-816) available from: http://www-timc.imag.fr/Olivier.Francois/tess.html • EIGENSOFT (PCA, Patterson et al., 2006; PloS Genetics 2:e190). • Good for SNP data Eckert, Population Structure, 5-Aug-2008 49

  50. What does it all mean? A point for discussion • What is the “evolutionary” significance of population structure in forest trees? • How would something like Sewall Wright’s shifting balance theory of evolution work in forest trees? Eckert, Population Structure, 5-Aug-2008 50

More Related