Association Studies in Unrelated populations: Perils and Opportunities

Association Studies in Unrelated populations:Perils and Opportunities Elad Ziv, M.D.

Outline I. Population Structure, Admixture II. Population stratification: Confounding due to population structure/admixture III. Detecting and controlling for stratification IV. Recent admixture and linkage disequilibrium

Population Structure and AdmixtureDefinitions and Background

Definitions Population Structure -distinct non-randomly mating subpopulations Population Admixture -distinct populations mix -mating may be random or non-random

Admixture • Two or more populations mix (individuals of mixed ancestry) Population 1 Population 2

Admixture • Under random mating

Origins of Population Structure 100-150,000 yrs

Human Population Structure Rosenberg et al Science 2002

Genetic vs. Geographic Distance Watkins et al Genome Res 2003

Population Stratification:A Special Case of Confounding

Population Stratification • Genetic background ancestry associated with phenotype • Leads to spurious associations

Diabetes in Pima Indians • Case-control study of HLA in DM • All participants Pima Indians • Gm (3,5,13,14) protective of DM (RR 0.27) • Gm (3,5,13,14) also associated with having a Caucasian parent Knowler et al AJHG 1987

DM in Pima Indians • Stratifying by Caucasian ancestry -> no association • Real association: partial Caucasian ancestry is protective of DM

Confounding Genetic Ancestry Candidate Polymorphism Trait/Disease

Confounding SES, culture, diet, other environmental factors Genetic Ancestry Candidate Polymorphism Trait/Disease

Magnitude of Confounding Ziv and Burchard, Pharmacogenomics 2003

Sample Size and Population Stratification • Simulations of the effect of population stratification as sample size increases • (a) 3 populations are included in the study with differences approximating those of East Asians, Europeans and African Americans. Relative Risk of disease in 3 populations is 1:1:3 • (b) same as (a), but 80% of the cases comes from only one population 10% from each of the other 2 populations Marchini et al Nat Gen 2004

Sample Size and Population Stratification • Simulations of the effect of population stratification as sample size increases with 2 populations which simulate differences among Asian subpopulations • (c) RR 1:1.3 • (d) RR 1:1.5 • (e) RR 1:2

“Negative” Confounding • Assume a true association with a polymorphism • Assume the high risk allele is lower prevalence in the high risk population • (Other genes or environment account for the increased risk of the high risk population) • Population stratification results in the high risk allele being UNDER-represented among cases • Decreases Power!!

Summary • Population stratification is defined as the association of a subpopulation with phenotype (disease) • Population stratification can lead to false positive and false negative associations. • Positive confounding: any allele higher frequency in the high risk population (increased Type I error) • Negative confounding: a real risk allele is higher frequency in the low risk population (increased Type 2 error) • Effect of population stratification worsens with • increased risk difference between populations • increased allele frequency difference between populations • Increased sample size

Alternatives • Matching by ethnicity • Family based studies • “Genetic” adjustment methods

Limitations of TDT • Lower power per person genotyped • High single parent rate • African Americans 70% • Puerto Rican Americans 60% • Mexican Americans 38% • Diseases of late onset require sib-based controls • Power for gene-environment interactions limited

III. Detecting and Adjusting for Stratification Using MarkersA. Detecting StratificationB. Genomic ControlC. Model Based Methods

Identifying confounding due to stratification • Population stratification = confounding • Measured confounders can be adjusted • Use genetic markers to adjust Pritchard & Rosenberg AJHG 1999

Detecting stratification • Genotype additional unlinked markers • For each marker calculate c2 between cases and controls • Compare allele frequencies of cases vs. controls Sc2 Pritchard & Rosenberg AJHG 1999

Detecting Stratification Probabilitythat stratification will bedetected, with use ofunlinked microsatellite markers (atthe .05 significance level). Totalsample size of 200individuals. The two linesat tau = .0are on top ofone another. Population divergence (tau =.2) corresponds to 80,000years. Pritchard and Rosenberg AJHG 1999

Detecting Stratification

Controlling for stratification • Correct the test distribution • Devlin et al Biometrics 1999 • Reich et al Genetic Epi 2000 • Model based approaches: • Pritchard et al AJHG 2000 • Satten et al AJHG 2001 • Hoggart et al AJHG 2004 • Semi-Parametric Test Association Zhang et al 2003

Genomic Control • Determine the rate of association of unlinked markers with phenotype • Re-set the null hypothesis based on empirically derived distribution • Calculate a parameter l, that represents the overall difference between cases and controls • l represents the inflation of the variance of the test statistic. l = 1 -> no inflation -> no correction Devlin & Roeder 1999

Genomic Control

Genomic Control • Advantages • No need to identify substructure • May require fewer markers • Limitations • Need RANDOM markers • Ancestry informative markers produce conservative correction • Correction applied evenly to all markers • May overcorrect for some markers

Summary • Population stratification (difference between cases and controls) can be detected by unlinked markers • More markers required for more subtle differences between cases and controls • Genomic control can be used to adjust p value using a uniform correction factor estimated from markers • New data suggests that • Genomic control may under-correct with limited number of markers and over correct with large number of markers • To be safe with genomic control would use >200 markers. More work likely in the future…

Model based approach • Use model to assign each person to: • Unique subgroup (discrete model) • Fraction/% of ancestry (admixture model) • Adjust association by membership in subgroup • Similar to adjusting for confounding in multivariate logistic regression (Ancestry is viewed as a confounder)

Model based approach • Advantages • Applies correction based on individual markers • May be more informative about relationship between ancestry and phenotype • Limitations • Need to correctly define subgroups • Difficult with limited # markers • Difficult when subpopulations are genetically similar

Model based approach • Maximum Likelihood measurement of Ancestry • Use allele frequency from ancestral populations to estimate each individual’s ancestry • Structure/Strat • Use clustering algorithm to derive population subgroups • Assign each person to one or more subgroups • Adjust association by membership in subgroup • Satten et al: Missing Latent Variable • Logistic/Linear regression framework with a missing latent variable estimated by markers • ADMIXMAP • SPTA/Eigenstrat • Uses Principal Components Analysis Framework Pritchard et al AJHG 2000 Satten et al AJHG 2001 McKeigue et al Annals of Hum Gen 2000 Zhang et al Gen Epi 2003 Price et al Nat Gen 2006

Maximum Likelihood Ancestry • Assume 2 population admixture • Assume P1A, P1a, P2A, P2a, the frequency of A, a alleles in population 1 and 2 respectively are known • The proportion of ancestry for each individual from population 1 and 2 are m, 1-m and are unknown • For each genotype there is a probability of that genotype based in any individual based on a certain ancestral combination m, 1-m • P(A) = mP1A + (1-m)P2A • P(AA) = [mP1A + (1-m)P2A]2 • Calculate P(Aa), P(aa) similar way

Single Marker Likelihood • Assume P1(A) = 0.6, P2(A) =0.1 (d = 0.5) • P(AA) = [m(.6) + (1-m)(.1)]2 • P(aa) = [m(.4) + (1-m)(.9)]2 • P(Aa) = 1-P(AA) – P(aa)

Multi-Marker Likelihood • Suppose a person has N genotypes: A1A1, A2a2, a3a3,..ANaN • P(All Genotypes) = P P(Gi) (Likelihood) • Log L = Log P P(Gi) = Log P(G1) + Log P(G2)… • Max Log L with respect to m (proportion ancestry) Likelihood assuming individual with mean of 50% ancestry and 10, 50, 100 markers each with frequency P1=0.6, P2=0.1

Accuracy of MLE of Ancestry NO. OF MARKERS REQUIREDFOR AN SD OF .2 .1 .05 .01 .9 4 16 62 1,544 .8 5 20 79 1,954 .7 7 26 103 2,552 .6 9 35 139 3,473 .5 13 50 200 5,000 .4 20 79 313 7,813 .3 35 139 556 13,889 .2 79 313 1,250 31,250 .1 313 1,250 5,000 125,000 Average delta Rosenberg et al AJHG 2003

Limitations of ML methods • Error in individual ancestry estimates due to finite # of markers • Underestimates the association between ancestry (confounder) and phenotype • Leads to under-correction of marker association • Error in specifying ancestral allele frequencies • Leads to systematic bias in estimating ancestry • Could lead to under or over-correction

STRUCTURE/STRAT • Model of population substructure with admixture possible • Model is estimated in a Bayesian Framework. Can incorporate prior information about ancestral populations, but can “correct” these • Estimated using the association between unlinked markers and discrepancies from HWE • The higher the sample size (of individuals and markers) the better the estimation • STRAT uses a likelihood ratio test to determine if genotype is associated with disease independently of genetic background

STRUCTURE/STRAT Pritchard et al AJHG 2000

STRUCTURE/STRAT with Ancestry Informative Markers Inferred population structurewith five clusters, basedon markers of highestand lowest informativeness Rosenberg et al AJHG 2003

SPTASemi Parametric Test of Association • Uses a principal components method to estimate genetic background • Uses individual scores on principal components to correct association in the context of linear/logistic regression • Uses a semi-parametric method to incorporate affect of genetic background into statistical test Zhang et al Gen Epi 2003 Chen et al Ann Hum Gen 2003

SPTA Type-I error comparisons of the four tests at the nominal value of 5% under discrete populat... Chen et al Ann Hum Genetics2003

Model based approach • Advantages • Applies correction based on individual markers • May be more informative about relationship between ancestry and phenotype • Limitations • Need to correctly define subgroups • Difficult with limited # markers • Difficult when subpopulations are not genetically similar

Summary • Several model based approaches • All require estimating background genetic ancestry from markers • Use background genetic ancestry to adjust association at candidate marker • In comparison to genomic control • May require more markers • Can use ancestry informative markers • Can identify “negative confounding” • In comparison to each other • Possible advantage of method of Zhang et al • More work required

IV. Admixture Mapping

Recombination with Admixture

Linkage Disequilibrium in Admixed Populations

Association Studies in Unrelated populations: Perils and Opportunities

Association Studies in Unrelated populations: Perils and Opportunities

Presentation Transcript

Opportunities and Perils in Ubiquitous Data Availability for the Open Access Environment

Statistical Issues in Genetic Association Studies

Sport Studies Association

Spectroscopic Studies: Galactic Disk Populations

SEEMINGLY UNRELATED REGRESSION

Unrelated Incidents

Unrelated Incidents

Populations Studies in the Fermi Era

Computational and Statistical Challenges in Association Studies

Unrelated Incidents

Genome wide association studies of complex traits in outbred and isolated populations CM van Duijn

Association Opportunities

Populations Perils

ASSOCIATION ANALYSIS OF UNRELATED INDIVIDUALS USING POLYMORPHIC GENETIC MARKERS

Analysis of whole genome association studies in pedigreed populations

Genomewide Association Studies

Unrelated Set

Populations Perils

Unrelated Incidents