1 / 56

Association Studies in Unrelated populations: Perils and Opportunities

Association Studies in Unrelated populations: Perils and Opportunities. Elad Ziv, M.D. Outline. I. Population Structure, Admixture II. Population stratification: Confounding due to population structure/admixture III. Detecting and controlling for stratification

maleah
Download Presentation

Association Studies in Unrelated populations: Perils and Opportunities

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Association Studies in Unrelated populations:Perils and Opportunities Elad Ziv, M.D.

  2. Outline I. Population Structure, Admixture II. Population stratification: Confounding due to population structure/admixture III. Detecting and controlling for stratification IV. Recent admixture and linkage disequilibrium

  3. Population Structure and AdmixtureDefinitions and Background

  4. Definitions Population Structure -distinct non-randomly mating subpopulations Population Admixture -distinct populations mix -mating may be random or non-random

  5. Admixture • Two or more populations mix (individuals of mixed ancestry) Population 1 Population 2

  6. Admixture • Under random mating

  7. Origins of Population Structure 100-150,000 yrs

  8. Human Population Structure Rosenberg et al Science 2002

  9. Genetic vs. Geographic Distance Watkins et al Genome Res 2003

  10. Population Stratification:A Special Case of Confounding

  11. Population Stratification • Genetic background ancestry associated with phenotype • Leads to spurious associations

  12. Diabetes in Pima Indians • Case-control study of HLA in DM • All participants Pima Indians • Gm (3,5,13,14) protective of DM (RR 0.27) • Gm (3,5,13,14) also associated with having a Caucasian parent Knowler et al AJHG 1987

  13. DM in Pima Indians • Stratifying by Caucasian ancestry -> no association • Real association: partial Caucasian ancestry is protective of DM

  14. Confounding Genetic Ancestry Candidate Polymorphism Trait/Disease

  15. Confounding SES, culture, diet, other environmental factors Genetic Ancestry Candidate Polymorphism Trait/Disease

  16. Magnitude of Confounding Ziv and Burchard, Pharmacogenomics 2003

  17. Sample Size and Population Stratification • Simulations of the effect of population stratification as sample size increases • (a) 3 populations are included in the study with differences approximating those of East Asians, Europeans and African Americans. Relative Risk of disease in 3 populations is 1:1:3 • (b) same as (a), but 80% of the cases comes from only one population 10% from each of the other 2 populations Marchini et al Nat Gen 2004

  18. Sample Size and Population Stratification • Simulations of the effect of population stratification as sample size increases with 2 populations which simulate differences among Asian subpopulations • (c) RR 1:1.3 • (d) RR 1:1.5 • (e) RR 1:2

  19. “Negative” Confounding • Assume a true association with a polymorphism • Assume the high risk allele is lower prevalence in the high risk population • (Other genes or environment account for the increased risk of the high risk population) • Population stratification results in the high risk allele being UNDER-represented among cases • Decreases Power!!

  20. Summary • Population stratification is defined as the association of a subpopulation with phenotype (disease) • Population stratification can lead to false positive and false negative associations. • Positive confounding: any allele higher frequency in the high risk population (increased Type I error) • Negative confounding: a real risk allele is higher frequency in the low risk population (increased Type 2 error) • Effect of population stratification worsens with • increased risk difference between populations • increased allele frequency difference between populations • Increased sample size

  21. Alternatives • Matching by ethnicity • Family based studies • “Genetic” adjustment methods

  22. Limitations of TDT • Lower power per person genotyped • High single parent rate • African Americans 70% • Puerto Rican Americans 60% • Mexican Americans 38% • Diseases of late onset require sib-based controls • Power for gene-environment interactions limited

  23. III. Detecting and Adjusting for Stratification Using MarkersA. Detecting StratificationB. Genomic ControlC. Model Based Methods

  24. Identifying confounding due to stratification • Population stratification = confounding • Measured confounders can be adjusted • Use genetic markers to adjust Pritchard & Rosenberg AJHG 1999

  25. Detecting stratification • Genotype additional unlinked markers • For each marker calculate c2 between cases and controls • Compare allele frequencies of cases vs. controls Sc2 Pritchard & Rosenberg AJHG 1999

  26. Detecting Stratification Probabilitythat stratification will bedetected, with use ofunlinked microsatellite markers (atthe .05 significance level). Totalsample size of 200individuals. The two linesat tau = .0are on top ofone another. Population divergence (tau =.2) corresponds to 80,000years. Pritchard and Rosenberg AJHG 1999

  27. Detecting Stratification

  28. Controlling for stratification • Correct the test distribution • Devlin et al Biometrics 1999 • Reich et al Genetic Epi 2000 • Model based approaches: • Pritchard et al AJHG 2000 • Satten et al AJHG 2001 • Hoggart et al AJHG 2004 • Semi-Parametric Test Association Zhang et al 2003

  29. Genomic Control • Determine the rate of association of unlinked markers with phenotype • Re-set the null hypothesis based on empirically derived distribution • Calculate a parameter l, that represents the overall difference between cases and controls • l represents the inflation of the variance of the test statistic. l = 1 -> no inflation -> no correction Devlin & Roeder 1999

  30. Genomic Control

  31. Genomic Control • Advantages • No need to identify substructure • May require fewer markers • Limitations • Need RANDOM markers • Ancestry informative markers produce conservative correction • Correction applied evenly to all markers • May overcorrect for some markers

  32. Summary • Population stratification (difference between cases and controls) can be detected by unlinked markers • More markers required for more subtle differences between cases and controls • Genomic control can be used to adjust p value using a uniform correction factor estimated from markers • New data suggests that • Genomic control may under-correct with limited number of markers and over correct with large number of markers • To be safe with genomic control would use >200 markers. More work likely in the future…

  33. Model based approach • Use model to assign each person to: • Unique subgroup (discrete model) • Fraction/% of ancestry (admixture model) • Adjust association by membership in subgroup • Similar to adjusting for confounding in multivariate logistic regression (Ancestry is viewed as a confounder)

  34. Model based approach • Advantages • Applies correction based on individual markers • May be more informative about relationship between ancestry and phenotype • Limitations • Need to correctly define subgroups • Difficult with limited # markers • Difficult when subpopulations are genetically similar

  35. Model based approach • Maximum Likelihood measurement of Ancestry • Use allele frequency from ancestral populations to estimate each individual’s ancestry • Structure/Strat • Use clustering algorithm to derive population subgroups • Assign each person to one or more subgroups • Adjust association by membership in subgroup • Satten et al: Missing Latent Variable • Logistic/Linear regression framework with a missing latent variable estimated by markers • ADMIXMAP • SPTA/Eigenstrat • Uses Principal Components Analysis Framework Pritchard et al AJHG 2000 Satten et al AJHG 2001 McKeigue et al Annals of Hum Gen 2000 Zhang et al Gen Epi 2003 Price et al Nat Gen 2006

  36. Maximum Likelihood Ancestry • Assume 2 population admixture • Assume P1A, P1a, P2A, P2a, the frequency of A, a alleles in population 1 and 2 respectively are known • The proportion of ancestry for each individual from population 1 and 2 are m, 1-m and are unknown • For each genotype there is a probability of that genotype based in any individual based on a certain ancestral combination m, 1-m • P(A) = mP1A + (1-m)P2A • P(AA) = [mP1A + (1-m)P2A]2 • Calculate P(Aa), P(aa) similar way

  37. Single Marker Likelihood • Assume P1(A) = 0.6, P2(A) =0.1 (d = 0.5) • P(AA) = [m(.6) + (1-m)(.1)]2 • P(aa) = [m(.4) + (1-m)(.9)]2 • P(Aa) = 1-P(AA) – P(aa)

  38. Multi-Marker Likelihood • Suppose a person has N genotypes: A1A1, A2a2, a3a3,..ANaN • P(All Genotypes) = P P(Gi) (Likelihood) • Log L = Log P P(Gi) = Log P(G1) + Log P(G2)… • Max Log L with respect to m (proportion ancestry) Likelihood assuming individual with mean of 50% ancestry and 10, 50, 100 markers each with frequency P1=0.6, P2=0.1

  39. Accuracy of MLE of Ancestry NO. OF MARKERS REQUIREDFOR AN SD OF .2 .1 .05 .01 .9 4 16 62 1,544 .8 5 20 79 1,954 .7 7 26 103 2,552 .6 9 35 139 3,473 .5 13 50 200 5,000 .4 20 79 313 7,813 .3 35 139 556 13,889 .2 79 313 1,250 31,250 .1 313 1,250 5,000 125,000 Average delta Rosenberg et al AJHG 2003

  40. Limitations of ML methods • Error in individual ancestry estimates due to finite # of markers • Underestimates the association between ancestry (confounder) and phenotype • Leads to under-correction of marker association • Error in specifying ancestral allele frequencies • Leads to systematic bias in estimating ancestry • Could lead to under or over-correction

  41. STRUCTURE/STRAT • Model of population substructure with admixture possible • Model is estimated in a Bayesian Framework. Can incorporate prior information about ancestral populations, but can “correct” these • Estimated using the association between unlinked markers and discrepancies from HWE • The higher the sample size (of individuals and markers) the better the estimation • STRAT uses a likelihood ratio test to determine if genotype is associated with disease independently of genetic background

  42. STRUCTURE/STRAT Pritchard et al AJHG 2000

  43. STRUCTURE/STRAT with Ancestry Informative Markers Inferred population structurewith five clusters, basedon markers of highestand lowest informativeness Rosenberg et al AJHG 2003

  44. SPTASemi Parametric Test of Association • Uses a principal components method to estimate genetic background • Uses individual scores on principal components to correct association in the context of linear/logistic regression • Uses a semi-parametric method to incorporate affect of genetic background into statistical test Zhang et al Gen Epi 2003 Chen et al Ann Hum Gen 2003

  45. SPTA Type-I error comparisons of the four tests at the nominal value of 5% under discrete populat... Chen et al Ann Hum Genetics2003

  46. Model based approach • Advantages • Applies correction based on individual markers • May be more informative about relationship between ancestry and phenotype • Limitations • Need to correctly define subgroups • Difficult with limited # markers • Difficult when subpopulations are not genetically similar

  47. Summary • Several model based approaches • All require estimating background genetic ancestry from markers • Use background genetic ancestry to adjust association at candidate marker • In comparison to genomic control • May require more markers • Can use ancestry informative markers • Can identify “negative confounding” • In comparison to each other • Possible advantage of method of Zhang et al • More work required

  48. IV. Admixture Mapping

  49. Recombination with Admixture

  50. Linkage Disequilibrium in Admixed Populations

More Related