Population Stratification

Population Stratification Qunyuan Zhang Division of Statistical Genomics GEMS Course M21-621 Computational Statistical Genetics Mar. 24, 2011 https://dsgweb.wustl.edu/qunyuan/presentations/PopStrat2011.pptx

What is Population Stratification (PS) ? In narrow sense PS is the presence of a systematic difference in allele frequencies between subpopulations in a population, possibly due to different ancestry or origins, especially in the context of genetic association studies. Population stratification is also referred to as population structure. In broad sense PS can be regarded as the presence of a difference in relatedness between individuals in a population, due to different subpopulations, family/pedigree structure and/or cryptic relation.

PS & False Positives False Positives (inflation) Association could be due to the underlying structure of the population, even there is no disease-locus association.

An Example of PS-caused False Positive • No disease-locus association. • Risk difference between sub-populations. • Allele Frequency difference between sub-populations. • False disease-locus association in mixed population. (any allele with higher frequency in higher-risk sub-population seems to be risk allele)

Mantel-Haenszel Test for Stratification (1) Adjusted RR (2) Standard error An Example Chi-square test (3)

Linear Model Marker data Population structure variable Genetic background variable Membership variable Subgroup/sub-population variable Ancestry/admixture proportion variable Usually Q is unknown, needs to be estimated

Estimating Q by Eigen-analysis singular values X = U S VT T S2 eigenvalues Q1 Q2 Q3 Eigenvector of COV(X) References: Patterson et al. 2006, Price et al. 2006 (software EIGENSTRAT) Or SAS Proc PRINCOM; R svd() and eigen()

Eigen-analysis of HapMap Populations Q2 Q1

Estimating Q by MLE (for admixed population) • G: Observed genotypes of admixed [and parental populations] • Q: Allelic frequencies in parental populations • P : Individual membership to be estimated • Goal: obtain P that maximizes Pr(G|P,Q) • Assign prior values for Q (randomly or estimated from parental population genotype data) & P (randomly) • Compute P(i) by solving • Compute Q(i) by solving • Iterate Steps 1 and 2 until convergence. • Tang et al. Genetic Epidemiology, 2005(28): 289–301

Estimating Q by MCMC (for admixed population) • ObservedG:genotypes of admixed [and parental populations] • UnknownZ :admixed individuals’ membership from ancestral populations • Problem: How to estimateZ ? • Bayesian and Markov Chain Monte Carlo (MCMC) methods • Assume ancestral population numberK (see next slide) • Define prior distribution Pr(Z) underK • Use MCMC to sample from posterior distribution Pr(Z|G) = Pr(Z)∙ Pr(G|Z) • Average over large number of MCMC samples to obtain estimate ofZ • Falush et al. Genetics, 2003(164):1567–1587 Software : STRUCTURE

Infer Population Number (K)

Linear Model (an example including m Q-variables) SAS Proc REG, Proc GENMOD; R lm(), glm() Generalized, can fit binary/categorical y

Unified Mixed Model(more general) Inferred population membership SNP(s) Covariate(s) ID matrix Modeling the resemblance among individuals V = ZGZ ' + R

Multi-Variate Normal Distribution (MVN) & Likelihood of Mixed Model Based on MVN, the likelihood of trait (y) in a matrix form is: no. of individuals (in a pedigree) mean phenotype vector nn variance-covariance matrix phenotype vector Kinship (IBD) matrix (nn ) V = ZGZ ' + R

Kinship Inbreeding Coefficient The inbreeding coefficient of an individual is the probability that the pair of alleles carried by the gametes that produced it are Identical By Descent (IBD). Identical By Descent (IBD) Two alleles come from the same ancestry. Kinship/Coancestry The inbreeding coefficient of an individual is equal to the coancestry between its parents. For example if parents X and Y have a child Z, then inbreeding coefficient of Z = coancestry between X and Y Software: SAS (PROC INBREED), MERLIN, SPAGedi , R(kinship, emma) et al. (need pedigree and/or marker data)

Kinship Matrix (expected probability of allele sharing among relatives)

Resources for Mixed Model with Kinship Matrix

Diagnosis of Inflation of False Positives • Inflation: more false positives than expected under the null • In GWAS, usually due to PS • Can be caused by inappropriate statistical methods even with no PS • May (not necessarily) indicate PS

Theoretical Basis of Diagnosis Uniform distribution [0,1] of p-values under the null no inflation inflation Histogram -log10(p) Q-Q plot

Inflation Rate (IR) Devlin et al. 2004 For Binary Trait For Continuous Trait Amin , Duijn, Aulchenko, 2007

Genomic Control (by IR) For Binary Trait For Continuous Trait Or based on p-value

Practice • Download and unzip the data from • dsgweb.wustl.edu/qunyuan/data/ popstra2011hw.zip • Ignore pedigree.csv, test each SNP in snp.csv for association (with trait in trait.csv); • Investigate p-values to see if there is any inflation; • Try to explain why; • List some possible methods to reduce or control the inflation; • Choose one method, apply it to the data; • Does it work? • Try to explain why. • Clearly document each step of you analysis. • The is no standard answer, feel free to try anything you like ! • Report back to linusan@wustl.edu and qunyuan@wustl.edu in one week. Thanks !

Population Stratification

Population Stratification

Presentation Transcript

Stratification

Social Stratification

Social stratification

Gender Stratification

Stratification

Stratification

Stratification

Social Stratification

Global Stratification

Stratification

Introduction to Population Stratification

Stratification

Stratification

Stratification

STRATIFICATION

Stratification

Define Heterozygote Advantage, Random Genetic Drift and Population Stratification

Population Stratification

Population stratification

Control of Population Stratification in Whole-Genome Scans

Controlling for population stratification and admixture in the MESA