310 likes | 407 Views
Strategies for gene identification in complex traits. --- Association studies ---. Phenotypic variation - Presence/Absence of a disease - Levels of a disease-related trait. Genomic Variation at one or more sites. What is an association study?. Objective: Is there a statistical relation?.
E N D
Strategies for gene identification in complex traits --- Association studies ---
Phenotypic variation • - Presence/Absence of a disease- Levels of a disease-related trait • Genomic Variation • at one or more sites What is an association study? Objective:Is there a statistical relation? Principle:Compares 2 groups that are expected to differ in their prevalence of disease-susceptibility alleles
Analytical Issues in Genetic Association Studies • Sampling Design • Markers (typed; Map density) • Unit of Analysis • Statistical testing
Linkage disequilibrium between 2 tightly linked loci Marker 2 Allelic association f(i,j) f(i) x f(j) • Haplotype frequency product of allele frequencies LD decays with time/generations and genetic distance (recombination) Marker 1
D’ =1; r2 <1 D’ =1; r2 =1 Measures of allelic association D’ (Lewinson’s); r2 (correlation) 0 r2D’ 1 D’~ recombinational events in the genomic regionr2 ~ The 2 SNPs carry same information D’ can be high but not r2
TDT Genotype 3 Subjects/Family Phenotype 1 Subject/Family Increased power with multiple affected sibs Generally, Immune to population stratification Family structure provides some error-checking and haplotype information Full trios may not be available Case-Control Genotype 2 Subjects to equal one trio Phenotype 2 Subjects to equal one trio Increased power with 3:1 controls:cases Susceptible to population stratification Power in Population-based vs. Family-basedAnalysis
Most common forms of markers • Repeated sequences of 2,3 or 4 nucleotide (Microsatellites) • reasonably frequent in genome • highly polymorphic/informative useful in linkage analysis • few disease susceptibility gene variants are likely STRs • Single Nucleotide Polymorphisms (SNPs) “one” letter of the code is altered • very frequent in genome (1/500 to 1/1000 base pairs) • Exonic SNPs may or may not cause an amino acid change • many disease susceptibility gene variants are likely SNPs
Unit of Analysis in Genetic Association Studies • Allele vs. Genotype • Dominance can be considered in genotype analysis • Extra degree of freedom in genotype analysis • Not clear which is optimal • Single SNP vs. Haplotype • Haplotypes capture evolutionary history • Need for haplotype imputation • Single SNP optimal if functional SNP is included
OR The typed variant A second variant Marker= Causal VariantDirect Association Markerin LD with Causal variantIndirect Association What are we hoping from a genetic association study? Situation of Interest: Trait variation is influenced by
Likelihood of detecting a true association? • Genetic effects of the causal allele on trait susceptibility/variation --Relative Risk & allele frequency • LD between the marker and the causal variant (Marker map & LD patterns in the genomic region of the causal variant)
Detectable Genetic effects (1) Power under different Nominal P-values N=2,000 (1,000 cases + 1,000 controls)
Detectable Genetic effects (2) Power under different Nominal P-values N=2,000 (1,000 cases + 1,000 controls)
Detectable Genetic effects? Association is powerful to detect causal variants that are - Common (>10%) with relatively modest effects (RR) - Less common (~5%) but with substantial effects (RR>2)
r2=0 r2=1 Direct 0< r2<1 • For a given N, PowerMaxnul Likelihood of detecting a true association? • For a given Power, required N with 1/r2 • r2= 1 0.8 0.5 0.20 0N= 1,000 1,250 2,000 5,000
Hot spots and Haplotype blocks • LD is variable : Recombination does not occur with equal probability at all points in the genome ---- there are « hot » and « cold » spots • Recently, it has been suggested that the genome falls into « blocks », with little haplotype diversity within blocks: Mean block size seems to be about ~14kb in Caucasians, and ~8 kb in Africans (but very variable; there are blocks up to 200kb in size)
Detectable Causal Variants? • Causal polymorphism is known and typed (direct association) or • There are markers that are highly correlated to the causal variant: - The causal locus lies in a « cold » spot (« LD blocks »)- The « best » map density to be used will depend on the LD patterns of the region implications on statistical significance (multi-test correction)
Human Genome • The human genome consists of about 3x109base pairs (3-6 x106 SNPs) and contains about 25,000 genes • Much of the DNA is either in introns or in intergenic regions • Trait variation: A few hundred of (functional) variants may make a meaningful contribution to variation in any single phenotype Prior probability that a variant selected at random will influence a given trait is very low
Genetic variants to be typed? --- Choices have to be made --- Two complementary approaches: • Functional: incorporates assessments of the likely functional effect of variation within a gene or region of interest. • Tagging: exploits presence of LD in many parts of the genome.
Significance of association withAD, for SNPs immediatelysurrounding APOE (<100 kb)[Martin et al., AJHG, 2000]
Selection of variants: Functional approach Target polymorphisms which are themselves putative causal variants. Critical issues: • Identification of candidate polymorphisms • Beyond mutations altering aminoacid sequence (nSNPs), little is known on the potential effect of non-coding sequence on gene regulation & expression? • MAF of functional variants is skewed (MAF<5%)Power to detect uncommon variants with modest effects? Potential to be the most powerful (Direct association) design, but may be limited to the discovery of some of the genetic causes of disease-related traits.
Selection of variants: Indirect Association The polymorphism is a surrogate for the causal variant But, necessary to type several surrounding markers to have a high chance of picking up the indirect association Questions: Do we need to type all markers in the region? Can we reduce genotyping costs & multi-test burden without decreasing « too much » the power?
Tagging approaches Type a subset of variants that captures a high amount of the information in common regional haplotypes Various strategies ---SNP & haplotype tagging --- but still debate as to the best methods [Johnson et al. Nat Genet, 200]
r2=0.8 r2=1 r2=0.3 random Power as a function of average spacing of tags[De Bakker, Nat Genet, Nov 2005] A marker map density of ~1 tagSNP/5kb (r2>0.8) captures >80% of common variation kb Tags picked at r2 = 1,0.8, 0.5 and 0.3
Tagging approach: Limits • Less powerful than direct studies, • There cannot be a definite negative result, since we cannot exclude the possibility that a causal variant exists but is not picked up by the markers chosen, • Intrinsic biological merit of tagSNPs as markers for complex trait susceptibility variants? « Common disease, common variant » hypothesis Supported by the few variants consistently shown to be associated to common diseases: -- APOE & Alzheimer --- Macular degeneration & Complement Factor H
Inpractical terms, an observedstatistical association will be due to … • Direct association: The allele itself is functional and directly affects the expression of the phenotype • Indirect association: The allele is in linkage disequilibrium with an allele at another locus that directly affects the expression of the phenotype • The finding could be due to chance or artifact, e.g., confounding or selection bias Study design aims to maximize detection of “true” findings while controlling (minimizing) rate of “false” findings
“False” Association findings • Chance: measured by the nominal P value of the test, i.e., prior probability that a typed marker is found associated when HO (no association) is true. Multi-test problem: The rate of “false” findings of a given experiment increases with the number of markers tested. • Solutions • Simulation: Empirical p-values • Replication and/or use Multi-Phases design
Multi-phase designs Are efficient to reduce the multi-test problem For example: 1. 2,000 cases + 2,000 controls with 500,000 SNP chip 2. Further 2,000 + 2,000 for best 100,000 SNPs • Further 4,000 + 4,000 for best 10,000 SNPs • Computation of the characteristics of such designs requires Monte Carlo integration --optimization is computationally intensive
“False” Association findings • Artifact (confounding, selection bias, pop stratification, genotyping): affects the Prior probability of a “chance” finding The significance of a finding is no longer controlled by the nominal P-value. • Solutions - Careful matching of cases & controls- use homogenous populations- use family-based controls- use genomic control or other similar methods- use QC methods for scoring genotyping errors (Clayton et al., Nat Genet, 2005)
Prospects for whole-genome screens: Estimated numbers of «common» SNPs (MAF>5%) • Direct studies of nsSNPs: ~30,000 - 50,000 SNPs • Indirect studies of genes: ~300,000 -500,000 SNPs • «Nearly» whole genome: 500,000 - 1,000,000 • Whole genome: ~ 2,000,000 – 4,000, 000 Choice of markers • Optimal choice of markers requires detailed mapping of LD, e.g. based on HapMap data • Truly optimal solutions are computationally intensive. Current chip designers are using single marker r2 cluster-based algorithms
Choices of markers have to be made • The strategy used to define the subsets of variants to be typed has a substantial effect on the power & the quality of the study. • Greater understanding of genomic variation has allowed more logical choices. Nonetheless, variant selection is always a pragmatic compromise.
Research key questions • Are common human diseases due to common variants or multiple rare variants? • Will rare or common SNPs be better candidates for a particular disease? • Can large differences between populations in the frequency of an allele be merely dueto chance?