1 / 22

Population Stratification

Population Stratification . Qunyuan Zhang Division of Statistical Genomics GEMS Course M21-621 Computational Statistical Genetics Mar. 24, 2011. https:// dsgweb.wustl.edu/qunyuan/presentations/PopStrat2011.pptx. What is Population Stratification (PS) ?. In narrow sense

ezhno
Download Presentation

Population Stratification

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Population Stratification Qunyuan Zhang Division of Statistical Genomics GEMS Course M21-621 Computational Statistical Genetics Mar. 24, 2011 https://dsgweb.wustl.edu/qunyuan/presentations/PopStrat2011.pptx

  2. What is Population Stratification (PS) ? In narrow sense PS is the presence of a systematic difference in allele frequencies between subpopulations in a population, possibly due to different ancestry or origins, especially in the context of genetic association studies. Population stratification is also referred to as population structure. In broad sense PS can be regarded as the presence of a difference in relatedness between individuals in a population, due to different subpopulations, family/pedigree structure and/or cryptic relation.

  3. PS & False Positives False Positives (inflation) Association could be due to the underlying structure of the population, even there is no disease-locus association.

  4. An Example of PS-caused False Positive • No disease-locus association. • Risk difference between sub-populations. • Allele Frequency difference between sub-populations. • False disease-locus association in mixed population. (any allele with higher frequency in higher-risk sub-population seems to be risk allele)

  5. Mantel-Haenszel Test for Stratification (1) Adjusted RR (2) Standard error An Example Chi-square test (3)

  6. Linear Model Marker data Population structure variable Genetic background variable Membership variable Subgroup/sub-population variable Ancestry/admixture proportion variable Usually Q is unknown, needs to be estimated

  7. Estimating Q by Eigen-analysis singular values X = U S VT T S2 eigenvalues Q1 Q2 Q3 Eigenvector of COV(X) References: Patterson et al. 2006, Price et al. 2006 (software EIGENSTRAT) Or SAS Proc PRINCOM; R svd() and eigen()

  8. Eigen-analysis of HapMap Populations Q2 Q1

  9. Estimating Q by MLE (for admixed population) • G: Observed genotypes of admixed [and parental populations] • Q: Allelic frequencies in parental populations • P : Individual membership to be estimated • Goal: obtain P that maximizes Pr(G|P,Q) • Assign prior values for Q (randomly or estimated from parental population genotype data) & P (randomly) • Compute P(i) by solving • Compute Q(i) by solving • Iterate Steps 1 and 2 until convergence. • Tang et al. Genetic Epidemiology, 2005(28): 289–301

  10. Estimating Q by MCMC (for admixed population) • ObservedG:genotypes of admixed [and parental populations] • UnknownZ :admixed individuals’ membership from ancestral populations • Problem: How to estimateZ ? • Bayesian and Markov Chain Monte Carlo (MCMC) methods • Assume ancestral population numberK (see next slide) • Define prior distribution Pr(Z) underK • Use MCMC to sample from posterior distribution Pr(Z|G) = Pr(Z)∙ Pr(G|Z) • Average over large number of MCMC samples to obtain estimate ofZ • Falush et al. Genetics, 2003(164):1567–1587 Software : STRUCTURE

  11. Infer Population Number (K)

  12. Linear Model (an example including m Q-variables) SAS Proc REG, Proc GENMOD; R lm(), glm() Generalized, can fit binary/categorical y

  13. Unified Mixed Model(more general) Inferred population membership SNP(s) Covariate(s) ID matrix Modeling the resemblance among individuals V = ZGZ ' + R

  14. Multi-Variate Normal Distribution (MVN) & Likelihood of Mixed Model Based on MVN, the likelihood of trait (y) in a matrix form is: no. of individuals (in a pedigree) mean phenotype vector nn variance-covariance matrix phenotype vector Kinship (IBD) matrix (nn ) V = ZGZ ' + R

  15. Kinship Inbreeding Coefficient The inbreeding coefficient of an individual is the probability that the pair of alleles carried by the gametes that produced it are Identical By Descent (IBD). Identical By Descent (IBD) Two alleles come from the same ancestry. Kinship/Coancestry The inbreeding coefficient of an individual is equal to the coancestry between its parents. For example if parents X and Y have a child Z, then inbreeding coefficient of Z = coancestry between X and Y Software: SAS (PROC INBREED), MERLIN, SPAGedi , R(kinship, emma) et al. (need pedigree and/or marker data)

  16. Kinship Matrix (expected probability of allele sharing among relatives)

  17. Resources for Mixed Model with Kinship Matrix

  18. Diagnosis of Inflation of False Positives • Inflation: more false positives than expected under the null • In GWAS, usually due to PS • Can be caused by inappropriate statistical methods even with no PS • May (not necessarily) indicate PS

  19. Theoretical Basis of Diagnosis Uniform distribution [0,1] of p-values under the null no inflation inflation Histogram -log10(p) Q-Q plot

  20. Inflation Rate (IR) Devlin et al. 2004 For Binary Trait For Continuous Trait Amin , Duijn, Aulchenko, 2007

  21. Genomic Control (by IR) For Binary Trait For Continuous Trait Or based on p-value

  22. Practice • Download and unzip the data from • dsgweb.wustl.edu/qunyuan/data/ popstra2011hw.zip • Ignore pedigree.csv, test each SNP in snp.csv for association (with trait in trait.csv); • Investigate p-values to see if there is any inflation; • Try to explain why; • List some possible methods to reduce or control the inflation; • Choose one method, apply it to the data; • Does it work? • Try to explain why. • Clearly document each step of you analysis. • The is no standard answer, feel free to try anything you like ! • Report back to linusan@wustl.edu and qunyuan@wustl.edu in one week. Thanks !

More Related