560 likes | 688 Views
Association Studies in Unrelated populations: Perils and Opportunities. Elad Ziv, M.D. Outline. I. Population Structure, Admixture II. Population stratification: Confounding due to population structure/admixture III. Detecting and controlling for stratification
E N D
Association Studies in Unrelated populations:Perils and Opportunities Elad Ziv, M.D.
Outline I. Population Structure, Admixture II. Population stratification: Confounding due to population structure/admixture III. Detecting and controlling for stratification IV. Recent admixture and linkage disequilibrium
Population Structure and AdmixtureDefinitions and Background
Definitions Population Structure -distinct non-randomly mating subpopulations Population Admixture -distinct populations mix -mating may be random or non-random
Admixture • Two or more populations mix (individuals of mixed ancestry) Population 1 Population 2
Admixture • Under random mating
Origins of Population Structure 100-150,000 yrs
Human Population Structure Rosenberg et al Science 2002
Genetic vs. Geographic Distance Watkins et al Genome Res 2003
Population Stratification • Genetic background ancestry associated with phenotype • Leads to spurious associations
Diabetes in Pima Indians • Case-control study of HLA in DM • All participants Pima Indians • Gm (3,5,13,14) protective of DM (RR 0.27) • Gm (3,5,13,14) also associated with having a Caucasian parent Knowler et al AJHG 1987
DM in Pima Indians • Stratifying by Caucasian ancestry -> no association • Real association: partial Caucasian ancestry is protective of DM
Confounding Genetic Ancestry Candidate Polymorphism Trait/Disease
Confounding SES, culture, diet, other environmental factors Genetic Ancestry Candidate Polymorphism Trait/Disease
Magnitude of Confounding Ziv and Burchard, Pharmacogenomics 2003
Sample Size and Population Stratification • Simulations of the effect of population stratification as sample size increases • (a) 3 populations are included in the study with differences approximating those of East Asians, Europeans and African Americans. Relative Risk of disease in 3 populations is 1:1:3 • (b) same as (a), but 80% of the cases comes from only one population 10% from each of the other 2 populations Marchini et al Nat Gen 2004
Sample Size and Population Stratification • Simulations of the effect of population stratification as sample size increases with 2 populations which simulate differences among Asian subpopulations • (c) RR 1:1.3 • (d) RR 1:1.5 • (e) RR 1:2
“Negative” Confounding • Assume a true association with a polymorphism • Assume the high risk allele is lower prevalence in the high risk population • (Other genes or environment account for the increased risk of the high risk population) • Population stratification results in the high risk allele being UNDER-represented among cases • Decreases Power!!
Summary • Population stratification is defined as the association of a subpopulation with phenotype (disease) • Population stratification can lead to false positive and false negative associations. • Positive confounding: any allele higher frequency in the high risk population (increased Type I error) • Negative confounding: a real risk allele is higher frequency in the low risk population (increased Type 2 error) • Effect of population stratification worsens with • increased risk difference between populations • increased allele frequency difference between populations • Increased sample size
Alternatives • Matching by ethnicity • Family based studies • “Genetic” adjustment methods
Limitations of TDT • Lower power per person genotyped • High single parent rate • African Americans 70% • Puerto Rican Americans 60% • Mexican Americans 38% • Diseases of late onset require sib-based controls • Power for gene-environment interactions limited
III. Detecting and Adjusting for Stratification Using MarkersA. Detecting StratificationB. Genomic ControlC. Model Based Methods
Identifying confounding due to stratification • Population stratification = confounding • Measured confounders can be adjusted • Use genetic markers to adjust Pritchard & Rosenberg AJHG 1999
Detecting stratification • Genotype additional unlinked markers • For each marker calculate c2 between cases and controls • Compare allele frequencies of cases vs. controls Sc2 Pritchard & Rosenberg AJHG 1999
Detecting Stratification Probabilitythat stratification will bedetected, with use ofunlinked microsatellite markers (atthe .05 significance level). Totalsample size of 200individuals. The two linesat tau = .0are on top ofone another. Population divergence (tau =.2) corresponds to 80,000years. Pritchard and Rosenberg AJHG 1999
Controlling for stratification • Correct the test distribution • Devlin et al Biometrics 1999 • Reich et al Genetic Epi 2000 • Model based approaches: • Pritchard et al AJHG 2000 • Satten et al AJHG 2001 • Hoggart et al AJHG 2004 • Semi-Parametric Test Association Zhang et al 2003
Genomic Control • Determine the rate of association of unlinked markers with phenotype • Re-set the null hypothesis based on empirically derived distribution • Calculate a parameter l, that represents the overall difference between cases and controls • l represents the inflation of the variance of the test statistic. l = 1 -> no inflation -> no correction Devlin & Roeder 1999
Genomic Control • Advantages • No need to identify substructure • May require fewer markers • Limitations • Need RANDOM markers • Ancestry informative markers produce conservative correction • Correction applied evenly to all markers • May overcorrect for some markers
Summary • Population stratification (difference between cases and controls) can be detected by unlinked markers • More markers required for more subtle differences between cases and controls • Genomic control can be used to adjust p value using a uniform correction factor estimated from markers • New data suggests that • Genomic control may under-correct with limited number of markers and over correct with large number of markers • To be safe with genomic control would use >200 markers. More work likely in the future…
Model based approach • Use model to assign each person to: • Unique subgroup (discrete model) • Fraction/% of ancestry (admixture model) • Adjust association by membership in subgroup • Similar to adjusting for confounding in multivariate logistic regression (Ancestry is viewed as a confounder)
Model based approach • Advantages • Applies correction based on individual markers • May be more informative about relationship between ancestry and phenotype • Limitations • Need to correctly define subgroups • Difficult with limited # markers • Difficult when subpopulations are genetically similar
Model based approach • Maximum Likelihood measurement of Ancestry • Use allele frequency from ancestral populations to estimate each individual’s ancestry • Structure/Strat • Use clustering algorithm to derive population subgroups • Assign each person to one or more subgroups • Adjust association by membership in subgroup • Satten et al: Missing Latent Variable • Logistic/Linear regression framework with a missing latent variable estimated by markers • ADMIXMAP • SPTA/Eigenstrat • Uses Principal Components Analysis Framework Pritchard et al AJHG 2000 Satten et al AJHG 2001 McKeigue et al Annals of Hum Gen 2000 Zhang et al Gen Epi 2003 Price et al Nat Gen 2006
Maximum Likelihood Ancestry • Assume 2 population admixture • Assume P1A, P1a, P2A, P2a, the frequency of A, a alleles in population 1 and 2 respectively are known • The proportion of ancestry for each individual from population 1 and 2 are m, 1-m and are unknown • For each genotype there is a probability of that genotype based in any individual based on a certain ancestral combination m, 1-m • P(A) = mP1A + (1-m)P2A • P(AA) = [mP1A + (1-m)P2A]2 • Calculate P(Aa), P(aa) similar way
Single Marker Likelihood • Assume P1(A) = 0.6, P2(A) =0.1 (d = 0.5) • P(AA) = [m(.6) + (1-m)(.1)]2 • P(aa) = [m(.4) + (1-m)(.9)]2 • P(Aa) = 1-P(AA) – P(aa)
Multi-Marker Likelihood • Suppose a person has N genotypes: A1A1, A2a2, a3a3,..ANaN • P(All Genotypes) = P P(Gi) (Likelihood) • Log L = Log P P(Gi) = Log P(G1) + Log P(G2)… • Max Log L with respect to m (proportion ancestry) Likelihood assuming individual with mean of 50% ancestry and 10, 50, 100 markers each with frequency P1=0.6, P2=0.1
Accuracy of MLE of Ancestry NO. OF MARKERS REQUIREDFOR AN SD OF .2 .1 .05 .01 .9 4 16 62 1,544 .8 5 20 79 1,954 .7 7 26 103 2,552 .6 9 35 139 3,473 .5 13 50 200 5,000 .4 20 79 313 7,813 .3 35 139 556 13,889 .2 79 313 1,250 31,250 .1 313 1,250 5,000 125,000 Average delta Rosenberg et al AJHG 2003
Limitations of ML methods • Error in individual ancestry estimates due to finite # of markers • Underestimates the association between ancestry (confounder) and phenotype • Leads to under-correction of marker association • Error in specifying ancestral allele frequencies • Leads to systematic bias in estimating ancestry • Could lead to under or over-correction
STRUCTURE/STRAT • Model of population substructure with admixture possible • Model is estimated in a Bayesian Framework. Can incorporate prior information about ancestral populations, but can “correct” these • Estimated using the association between unlinked markers and discrepancies from HWE • The higher the sample size (of individuals and markers) the better the estimation • STRAT uses a likelihood ratio test to determine if genotype is associated with disease independently of genetic background
STRUCTURE/STRAT Pritchard et al AJHG 2000
STRUCTURE/STRAT with Ancestry Informative Markers Inferred population structurewith five clusters, basedon markers of highestand lowest informativeness Rosenberg et al AJHG 2003
SPTASemi Parametric Test of Association • Uses a principal components method to estimate genetic background • Uses individual scores on principal components to correct association in the context of linear/logistic regression • Uses a semi-parametric method to incorporate affect of genetic background into statistical test Zhang et al Gen Epi 2003 Chen et al Ann Hum Gen 2003
SPTA Type-I error comparisons of the four tests at the nominal value of 5% under discrete populat... Chen et al Ann Hum Genetics2003
Model based approach • Advantages • Applies correction based on individual markers • May be more informative about relationship between ancestry and phenotype • Limitations • Need to correctly define subgroups • Difficult with limited # markers • Difficult when subpopulations are not genetically similar
Summary • Several model based approaches • All require estimating background genetic ancestry from markers • Use background genetic ancestry to adjust association at candidate marker • In comparison to genomic control • May require more markers • Can use ancestry informative markers • Can identify “negative confounding” • In comparison to each other • Possible advantage of method of Zhang et al • More work required