1 / 47

Population Structure, Association Studies, and QTLs

Population Structure, Association Studies, and QTLs. Stat 115/215. Structure Algorithm. One of the most widely-used programs in population genetics (original paper cited >9,000 times since 2000)

katen
Download Presentation

Population Structure, Association Studies, and QTLs

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Population Structure, Association Studies, and QTLs Stat 115/215

  2. Structure Algorithm • One of the most widely-used programs in population genetics (original paper cited >9,000 times since 2000) • Pritchard, Stephens and Donnelly (2000). Inference of Population Structure Using Multilocus Genotype Data, Genetics. 155:945-959. • Very flexible model can determine: • The most likely number of uniform groups (populations, K) • The genomic composition of each individual (admixture coefficients) • Possible population of origin

  3. A simple model of population structure • Individuals in our sample represent a mixture of K (unknown) ancestral populations. • Each population is characterized by (unknown) allele frequencies at each locus. • Within populations, markers are in Hardy-Weinberg and linkage equilibrium.

  4. The model • Let A1, A2, …, AK represent the (unknown) allele frequencies in each subpopulation • Let Z1, Z2, … , Zm represent the (unknown) subpopulation of origin of the sampled individuals – they are indicators • Assuming HWE and LE within subpopulations, the likelihood of an individual’s genotypes at various loci in subpopulation k is given by the product of the relevant allele frequencies:

  5. More details • Probability of observing a genotype at locus l by chance in population is a function of allele frequencies: • Pl=pi2 for homozygous loci • Pl=2pipjfor heterozygous loci • Assuming no linkage among the markers, we have the product form as in the previous page.

  6. Heuristics • If we knew the population allele frequencies in advance, then it would be easy to assign individuals (using Bayes rule). • If we knew the individual assignments, it would be easy to estimate frequencies. • In practice, we don’t know either of these, but we have the Gibbs sampler!

  7. MCMC algorithm (for fixed K) • Start with random assignment of individuals to populations • Step 1: Gene frequencies in each population are estimated based on the individuals that are assigned to it. • Step 2: Individuals are assigned to populations based on gene frequencies in each population. • And this is repeated... • Estimation of K performed separately

  8. Admixed individuals are mosaics of ancestral populations

  9. Two basic models

  10. Inferred from human populations

  11. More details

  12. Alternative approach • Structure is very computationally intensive • Often no clear best-supported K-value • Alternative is to use traditional multivariate statistics to find uniform groups • Principal Components Analysis is most commonly used algorithm • EIGENSOFT (PCA, Patterson et al., 2006; PloS Genetics 2:e190)

  13. Principal Component Analysis • Efficient way to summarize multivariate data like genotypes • Each axis passes through maximum variation in data, explains a component of the variation

  14. Human population assignment with SNPs • Assayed 500,000 SNP genotypes for 3,192 Europeans • Used Principal Components Analysis to ordinate samples in space • High correspondence between sample ordination and geographic origin of samples Individuals assigned to populations of origin with high accuracy

  15. Genetic Association Tests • Review of typical approach: chi-square test • 2x3 table (or 2x2 table) • Alternatively, we can do a logistic regression

  16. Genetic Models and Underlining Hypotheses Genotypic value is the expected phenotypic value of a particular genotype Genotypic Model Hypothesis: all 3 different genotypes have different effects AA vs. Aa vs. aa

  17. Genetic Models and Underlining Hypotheses • Dominant Model Hypothesis: the genetic effects of AA and Aa are the same (assuming A is the minor allele) AA and Aa vs. aa

  18. Genetic Models and Underlining Hypotheses • Recessive Model • Hypothesis: the genetic effects of Aa and aa are the same (A is the minor allele) AA vs. Aa and aa

  19. Genetic Models and Underlining Hypotheses Allelic Model Hypothesis: the genetic effects of allele A and allele a are different A vs. a

  20. Pearson’s Chi-squaredTest • Genotypic Model: • Null Hypothesis: Independence df = 2

  21. Pearson’s Chi-squaredTest • Dominant Model: • Null Hypothesis: Independence df = 1

  22. Pearson’s Chi-squaredTest • Recessive Model: • Null Hypothesis: Independence df = 1

  23. Pearson’s Chi-squaredTest • Allelic Model: • Null Hypothesis: Independence df = 1

  24. Test Statistic • Chi-squared Test Statistic: • O is the observed cell counts • E is the expected cell counts, under null hypothesis of independence

  25. Other Options Fisher’s Exact Test: When sample size is small, the asymptotic approximation of null distribution is no longer valid. By performing Fisher’s exact test, exact significance of the deviation from a null hypothesis can be calculated. For a 2 by 2 table, the exact p-value can be calculated as:

  26. Association Tool • PLINK: http://pngu.mgh.harvard.edu/~purcell/plink/ • Case-control, TDT, quantitative traits.

  27. Mapping Quantitative Traits • Examples: weight, height, blood pressure, BMI, mRNA expression of a gene, etc. • Example: F2 intercross mice

  28. Quantitative traits (phenotypes) 133 females from our earlier (NOD  B6)  (NOD  B6) cross Trait 4 is the log count of a particular white blood cell type.

  29. Another representation of a trait distribution Note the equivalent of dominance in our trait distributions.

  30. A second example Note the approximate additivity in our trait distributions here.

  31. Trait distributions: a classical view In general we seek a difference in the phenotype distributions of the parental strains before we think seeking genes associated with a trait is worthwhile. But even if there is little difference, there may be many such genes. Our trait 4 is a case like this.

  32. Data and goals Data Phenotypes: yi= trait value for mouse i Genotype: xij = 1/0 of mouse i is A/H at marker j (backcross); need two dummy variables for intercross Genetic map: Locations of markers Goals Identify the (or at least one) genomic region, called quantitative trait locus = QTL, that contributes to variation in the trait Form confidence intervals for the QTL location Estimate QTL effects

  33. Models: GenotypePhenotype • Let y = phenotype, g = whole genome genotype • Imagine a small number of QTLw with genotypes g1,…., gp (2por 3p distinct genotypes for BC, IC resp). • We assume E(y|g) = (g1,…gp ), var(y|g) = 2(g1,…gp)

  34. Models: GenotypePhenotype, ctd • Homoscedacity (constant variance)  2(g1,…gp) = 2(constant) • Normality of residual variation y|g ~ N(g ,2) • Additivity: (g1,…gp )=  + ∑j gj (gj = 0/1 for BC) • Epistasis: Any deviations from additivity.

  35. Additivity, or non-additivity (BC)

  36. Additivity or non-additivity: F2

  37. The simplest method: ANOVA • Split mice into groups according to genotype at a marker • Do a t-test/ANOVA • Repeat for each marker • Adjust for multiplicity LOD score = log10 likelihood ratio, comparing single-QTL model to the “no QTL anywhere” model.

  38. Interval mapping (IM) • Lander & Botstein (1989) • Take account of missing genotype data (uses the HMM) • Interpolates between markers • Maximum likelihood under a mixture model

  39. Interval mapping, cont • Imagine that there is a single QTL, at position z between two (flanking) markers • Let qi= genotype of mouse i at the QTL, and assume • yi | qi ~ Normal( qi , 2 ) • We won’t know qi, but we can calculate • pig = Pr(qi = g | marker data) • Then, yi, given the marker data, follows a mixture of normal distributions, with known mixing proportions (the pig). • Use an EM algorithm to get MLEs of  = (A, H, B, ). • Measure the evidence for a QTL via the LOD score, which is the log10 likelihood ratio comparing the hypothesis of a single QTL at position z to the hypothesis of no QTL anywhere.

  40. Epistasis, interactions, etc • How to find interactions? • Stepwise regression • BEAM (Zhang and Liu 2007)

  41. Naïve Bayes model Y X1 X2 X3 Xm

  42. Augmented Naïve Bayes Y X2.21 Group 0 X01 X02 X2.22 Group 22 X2.12 X11 X12 X13 X2.11 X2.13 Group 1 Group 21

  43. Variable Selection with Interaction

  44. Acknowledgment • Terry Speed (some of the slides) • Karl Broman (U of Wisconsin) • Steven P. DiFazio (West Virginia U)

More Related