1.07k likes | 1.27k Views
Case-control association techniques in genetic studies. March 10, 2011. Karen Curtin, Ph.D. Division of Genetic Epidemiology and HCI Pedigree & Population Resource (PPR). Presentation outline. Background (genetics concepts). Basic case-control association.
E N D
Case-control association techniques in genetic studies March 10, 2011 Karen Curtin, Ph.D. Division of Genetic Epidemiology and HCI Pedigree & Population Resource (PPR)
Presentation outline • Background (genetics concepts) • Basic case-control association • Complex case-control association • Genome-wide association
The Human Genome: 6 billion DNA bases(Adenine, Cytosine, Guanine, or Thymine) License: Creative Commons Attribution 2.0
…AGCCAAACTGAATTC… …AGCCAAATTGGATTC… At any locus (position on a chromosome): Read across both chromosomes Genotype CT CA T G Read along a chromosome Haplotypes: C-A and T-G Genotype and Haplotype If allele T can predict allele G, two alleles are in Linkage Disequilibrium (LD)
90% of genomic variants are SNPs Single Nucleotide Polymorphsim Two alternate forms (alleles) that differ in sequence at one point in a DNA segment Source: David Hall, Creative Commons Attribution 2.5 license
Genetic variants: Germline v Somatic • Germline variant/mutations • Inherited/In-born mutation • In all cells • In particular, in germline haploid cells • Heritable • Cell division - meiosis • Somatic variants/mutations • Acquired mutation • Only in an isolated number of cells (tumor site) • Generally not heritable • Cell division - mitosis
Hereditary mutation - meiosis Parent germ cells Daughter cells HAPLOID X New zygotes DIPLOID
Presentation outline • Background (genetics concepts) • Basic case-control association • Complex case-control association • Genome-wide association
Genetic variants in association studies Association: two characteristics (disease& genetic variant) occur more often together than expected by chance • Direct Association / Causal Functional variant Disease • Functional variant is involved in disease • Functional variant is associated with the disease • Indirect Association Genetic variant Functional variant Disease • Genetic variant (SNP) is associated/correlated with underlying functional variant • Functional variant is involved in disease • Genetic variant (marker) is associated with disease (initial step.. Ultimate goal is to discover causal variant)
Genetic association study Designs • Observational • Exposure variables • Genetic variants • Environmental factors • Classical association study designs • Unit of interest is an individual • Cohort study (cross-sectional or longitudinal) • Case-control study • Family-based association study • Unit of interest is a family unit
Case-Control Study • Sample individuals based on to disease status and without knowledge of exposure status (e.g. genotype) • CASES (with disease) • CONTROLS (no disease) • Usually balanced design (#cases = #controls) • Retrospective • Neither prevalence nor incidence can be estimated
Types of Case-Control Study • Population-based • Risk estimates can be extrapolated to the source population • Could be nested in a cohort study • Selected sampling • Increases power to detect associations • Antoniou & Easton (2003) • Tests of independence are valid • True positive risks are exaggerated • Can not be extrapolated
Case-Control: Population-based • Source population • All individuals satisfying predefined criteria • Source cohort • A group that is ‘representative’ of the source population • CASES and CONTROLS occur in relation to population prevalence • CASES • Cases selected are ‘representative’ of cases in the source cohort • In particular, in terms of the exposure variables • CONTROLS • Controls selected are ‘representative’ of controls in the source cohort • In particular, in terms of the exposure variables • Odds Ratio (estimate of the relative risk) can be extrapolated back to the source population • Population Attributable Risk (PAR)
Case-Control: Selected Sampling • Source population • All individuals satisfying predefined criteria • Source cohort • A group that is ‘representative’ of the source population • CASES and CONTROLS occur in relation to population prevalence • CASES • Cases selected are in effect selectively sampled from cases in source cohort • Family history of disease, severe disease, early onset,… • CONTROLS • Cases selected are in effect selectively sampled from controls in source cohort • Screened negative, no family history,… • Association analyses are still valid and power may be increased • BUT… • Odds Ratio (estimate of the relative risk) can not be extrapolated back to the source population
Case-Control Study: Odds Ratio Exposure Yes No Disease Cases (Yes) a b Controls(No) c d Odds Ratio (OR) = a / b = a × d c / d b × c H0: OR = 1 same risk (no association) OR > 1 indicates increased risk OR < 1 indicates decreased risk (protective)
95% confidence intervals for the Odds Ratio Lower and Upper bounds for the risk estimates. Two common methods: • eln(OR) – 1.96se(ln(OR)), eln(OR) + 1.96se(ln(OR)) where se(ln(OR)) = 1/a+1/b+1/c+1/d 2) OR1-1.96/, OR1+1.96/
chi-square test Compares observed values (O) with those expected under independence between rows and columns Expected (E) = row total column total N chi-square statistic, with (rows-1) (columns-1) degrees of freedom 2 = (O – E)2 ~ 2(rows-1) (columns-1) E
Test for Non-independence H0: Disease and exposure (genotype) are independent chi-square tests: contingency tables 2×3 genotype table (2 df) 2×2 grouped genotype table (1 df) • Dominant or recessive 2×3 ‘dose-dependent’ table • Armitage test for trend (1 df) 2×2 allele table (1 df)
Modeling genetic exposures • Exposure = genotype • Single variant with 2 alleles (SNP) • Three genotypes: CC, CT, TT • 23 contingency table • Chi-sq 2df • Chi-sq 1df (impose a linear dependency between columns) CC CT TT Controls Cases
Mode of Expression / Inheritance • Let allele C be disease causing • Examples of modes of expression are: • Dominant TT TCCC • Individuals heterozygous or homozygous for the C allele gives rise to the disease • Recessive TT TC CC • Only homozygous individuals for the C allele results in disease • Codominant TT TCCC • All three genotypes can be distinguished phenotypically • ‘Additive’ model – TC has r-fold risk, CChas 2r effect
chi-square test CC CT TT Totals Chi-stat= (120-120)2 + (40-50)2 + (20-30)2 +(120-120)2 +(60-50)2 + (40-30)2 120 50 30 120 50 30 Chi-statistic = 10.67 p-value=0.0048 (for a chi-square distribution with 2 df) Controls 200 120 50 30 Cases 200 120 50 30 400 240 100 60 Totals
Genotypic relative risk • Assess risk (OR) for each genotype relative to the homozygous common genotype ORhet = a × e ORhzv = a × f CT vs. CC b × d TT vs. CC c × d Genotype (exposure) CC CT TT Controls Cases
chi-square test / genotypic relative risk CC CT TT Totals Chi-stat= (120-120)2 + (40-50)2 + (20-30)2 +(120-120)2 +(60-50)2 + (40-30)2 120 50 30 120 50 30 Chi-statistic = 10.67 p-value=0.0048 (for a chi-square distribution with 2 df) OR het CT vs. CC = 1.5 OR hzv TT vs. CC = 2.0 Controls 200 120 50 30 Cases 200 120 50 30 400 240 100 60 Totals
Test for Non-independence H0: Disease and exposure (genotype) are independent chi-square tests: contingency tables 2×3 genotype table (2 df) 2×2 grouped genotype table (1 df) • Dominant or recessive 2×3 ‘dose-dependent’ table • Armitage test for trend (1 df) 2×2 allele table (1 df)
Dominant model for exposure Exposure = CT&TT genotypes - 22 test with 1 df ORdom = a × (e+f) = 1.67 d × (b+c) Genotype CC CT TT (b+c)= Controls Cases (e+f)=100
Recessive model for exposure Exposure = TT genotype (vs. CC&CT) - 22 test w/1 df ORrec = (a+b) × f = 1.78 (d+e) × c Genotype CC CTTT Controls (a+b)=160 Cases (d+e)=180
Test for Non-independence H0: Disease and exposure (genotype) are independent chi-square tests: contingency tables 2×3 genotype table (2 df) 2×2 grouped genotype table (1 df) • Dominant or recessive 2×3 ‘dose-dependent’ table • Armitage’s trend test (1 df) 2×2 allele table (1 df)
Armitage Trend Test (23 with 1df) Assess departures from a fitted trend CC (x1=0) CT (x2=1) TT (x3=2) R Controls Cases n1 n2 n3 N
Example – genotypic relative risk and trend test Shephard et al. Cancer Res 2009
Test for Non-independence H0: Disease and exposure (genotype) are independent chi-square tests: contingency tables 2×3 genotype table (2 df) 2×2 grouped genotype table (1 df) • Dominant or recessive 2×3 ‘dose-dependent’ table • Armitage’s trend test (1 df) 2×2 allelic table (1 df)
Allelic Test • Exposure = Allele (T vs. C) • 2 x 2 table (1 df) for a single SNP • Count every allele (2 per person) • Doubles the sample size ORallele = (2a+b)×(2f+e) (2c+b)×(2d+e) Allele C T Controls OR = 1.633 T vs. C allele Cases
Example – allelic association 11 12 22 11 12 22 Xue et al. Arch Oral Bio 2009
More flexible techniques • If other factors may have an effect on disease status (affected/unaffected, case/control) • We want to account for these as covariates • We want to adjust for matching variables (age, sex, etc.) • Logistic regression • Logistic transformation (logit) • ln(p/(1-p)) = + 1x1 + 2x2 + …. • Coefficients and ’s are estimated using maximum likelihood estimation (MLE) • Test H0: =0 against H1: = using a likelihood ratio test (LRT) • Must decide on how to model the genetic exposure • genotype categories (i.e. CC, CT,TT), dominant, recessive, additive (allele dose).. ~ ~ ^
Example of logistic regression model with genetic exposure and covariates Slattery et al. IJC 2010
Assumptions for Validity • Independence of all individuals • Independent and identically distributed (iid) • Reasonable sample sizes • Contingency tables • Expected values all > 1 and 80% > 5 • Logistic regression • Minimum of 15-20 individuals per group • If violated • Simulate the null distribution for testing • Permutation test • e.g. Fishers exact test is an exhaustive permutation test • Monte Carlo simulation
Presentation outline • Background (genetics concepts) • Basic case-control association • Complex case-control association • Genome-wide association
Performing haplotype analyses • Single locus • We observe genotypes, so testing is straight-forward counting into a contingency table CC CT TT Controls Cases
Performing haplotype analyses • Multi-locus • Haplotypes are not directly observed • But can be estimated (EM/Bayesian…) • For some individuals, their haplotype pair can be inferred unambiguously • For many individuals they can not • “Phase uncertainty” • All analyses of haplotypes must take into account the phase uncertainty in the data • Otherwise, increase in type 1 errors
Haplotypes / Genotypes Two-locus Haplotypes: The haplotype pair must be: C-G and C-G UNAMBIGUOUS …AGCTAAACTGGATT… …AGCCAAACTGGATT… CG CG
Estimating haplotypes Genotypes Locus 1 Locus 2 Haplotypes CCGGC-G&C-G CCGAC-G&C-A CCAAC-A&C-A CTGGC-G&T-G CTGA?(C-G&T-A) or (C-A&T-G)? CTAAC-A&T-A TTGGT-G&T-G TTGAT-G&G-A TTAAT-A&T-A
Estimating haplotypes • Expectation-maximization (EM) algorithm • SNPHAP (Johnson et al 2001) • GCHap (Thomas 2003) • Bayesian MCMC approach • PHASE (Stephens et al 2001) • Both approaches assume independent individuals • Use to estimate • Population haplotype frequencies estimated from a set of individuals • Most likely haplotype pair for each individual
Traditional methods for phase uncertainty • Likelihood based approach • Each individual can have multiple different haplotype pairs that are consistent with the genotype data • Some pairs of haplotypes are more or less likely than others • Each pair is given a weight • All possible haplotype pairs are considered in the case-control analysis • weighted by their probabilities
Simulation methods for phase uncertainty • Sample over the observed data • Instead of weighting all the possible haplotype pairs for every individual and incorporating all at once into the analysis • Sample one pair of each individual • Randomly and in proportion to the weights, select a haplotype pair for each individual • Perform the analysis as if those were observed • Repeat 1,000 times… • Average • SIMHAP (McCaskie et al.)
Simulation methods for phase uncertainty • Monte Carlo testing • Simulate the null –matched to the real data • Instead of weighting all the possible haplotype pairs for every individual and incorporating all at once into the analysis • Assign each individual their most likely haplotype pair • Cases and controls separately • Simulate null haplotype data • Null: Convert haplotypes to genotypes • Null: Estimate haplotypes • Null: Assign each individual their most likely haplotype pair • Real and null are matched • Test real data (with most likely haplotype pairs assigned) against the simulated null • hapMC (Thomas et al.)
Exponential explosion… high dimensional data • 1 SNP • 2 alleles 1 test • 3 genotypes 1+ tests • 2 SNP loci • 4 haplotypes • 3 SNP loci • 8 haplotypes • 10 SNP loci • 1024 haplotypes many tests..
Multi-locus… but how many, and which loci to test? • For example…20 tSNPs • Only perform single SNP analyses? • Perform tests on all 20-locus haplotypes? • Group all ‘rare’ haplotypes together • Cluster to reduce dimension • Multi-locus tests with subsets of 20 SNPs? • Subsets of which SNPs?
Data mining approach to haplotype construction – hapConstructor(Abo et al.) • Automatically builds haplotypes (or composite genotypes) • Non-contiguous SNPs • In a case-control framework • All SNP haplotypes are phased during 1st stage and used in all subset analyses • Starts with each single SNP locus • Forward-backward process driven by significance thresholds • Significance and false discovery rates (p-values and q-values) reported for the building process • Computationally challenging, potentially time intensive
Multilocus model building example using hapConstructor 16 SNPs Curtin et al. BMC Med Genet 2010
Multilocushaplotype association using hapConstructor Curtin et al. BMC Med Genet 2010