670 likes | 853 Views
Genomewide Association Studies. Genomewide Association Studies. 1. History Linkage vs. Association Power/Sample Size 2. Human Genetic Variation: SNPs 3. Direct vs. Indirect Association Linkage Disequilibrium 4. SNP selection, Coverage, Study Designs 5. Genotyping Platforms
E N D
Genomewide Association Studies • 1. History • Linkage vs. Association • Power/Sample Size • 2. Human Genetic Variation: SNPs • 3. Direct vs. Indirect Association • Linkage Disequilibrium • 4. SNP selection, Coverage, Study Designs • 5. Genotyping Platforms • 6. Early (recent) GWA Studies
Risch and Merikangas 1996 Sample Size Association < Sample Size for Linkage
Sample Size Required • Linkage Analysis with affected sib pairs • Transmission Disequilbrium Test (TDT) • TDT with affected sib pairs
Affected Sib Pair Linkage Analysis • 2 siblings/family • Both sibs affected • IBD at the marker locus • Expect 50% on average
Identity By Descent Sibling 1 A A 2 1 1 0 A A a A A a a a
Identity By Descent Expected number of alleles IBD is = 2*25% + 1*50% + 0*25% = 1 allele = 50% sharing
Sample Size Calculation Exposure Frequency Effect Size Identity By Descent (IBDM) Sample Size Required
Sample Size Calculation Exposure Frequency Effect Size Identity By Descent (IBDM) Sample Size Required High IBD sharing Low IBD sharing
TDT Transmitted alleles vs. non-transmitted alleles M1 M2 M2 M2 M1M2
TDT Transmitted alleles vs. non-transmitted alleles TDT = (n12 - n21)2 (n12 + n21) Asymptotically c2 with 1 degree of freedom
TDT Transmitted alleles vs. non-transmitted alleles M1 M2 M2 M2 M1M2
TDT For this one Trio: TDT = (1 - 0)2 (1 + 0) p-value = 0.32 = 1
TDT For one hundred Trios: TDT = (50 - 45)2 (50 + 45) p-value = 0.01 = 6.58
Linkage • Good for Large Effect Sizes • Genomewide Association • Good for Modest Effect Sizes • Not good for rare disease alleles
Two Hypotheses • Common Disease-Common Variant • Common variants • Small to modest effects • Rare Variant • Rare variants • Larger effects
GWA Issues • Cost • Sample Size • Effect Size • Disease Allele Frequency • Multiple Testing • SNP selection • How many? • Which SNPs? • Available Genotyping Platforms
Types of Variants • Single Nucleotide Polymorphism (SNP) • Insertion/Deletion (indel) • Microsatellite or Short Tandem Repeat (STR)
What is a SNP? TTCAGTCAGATCCTAGCCC AAGTCAGTCTAGGATCGGG Chromosome 1 TTCAGTCAGATCCCAGCCC Chromosome 2 AAGTCAGTCTAGGGTCGGG SNP
What is an insertion/deletion? TTCAGTCAGATCCTAGCCC AAGTCAGTCTAGGATCGGG Chromosome 1 TTCAGTCAGATCCCTAGCCC Chromosome 2 AAGTCAGTCTAGGGATCGGG Insertion/Deletion
What is an microsatellite? TTCACAGCAGCAGCAGAGCCC AAGTGTCGTCGTCGTCTCGGG Chromosome 1 TTCACAGCAGCAGAGCCC Chromosome 2 AAGTGTCGTCGTCTCGGG 3 vs. 4 trinucleotide repeats
How many SNPs? • 6 billion humans • 12 billion chromosomes • 1% frequency SNP • 120 million copies of the minor allele
How many of these SNPs have we found? • dbSNP: http://www.ncbi.nlm.nih.gov/projects/SNP/ • 10,430,753 SNPs • 4,868,126 are “validated”
What Risch and Merikangas proposed: • 5 genetic polymorphisms per gene • 100,000 genes (1996) • = 500,000 genotypes per subject • Candidate Gene Study Design • All genes are candidates • Direct or Sequence-based approach • Causal variant is one of the variants tested
Indirect Association relies on LD Decay • Variants that are close will have high LD • Variants that are far apart will have low LD • Indirect Association is a form of Positional Cloning
LD Decay E(Dt) = D1 * (1-q)t where Dt is the current amount of LD and t is the number of generations If q = 0.5, LD decays at a rate of 50% per generation If q < 0.5, LD decay is slower
Linkage Disequilibrium r2 = (pAB*pab – pAb*paB)2 A B pA * pa * pB * pb a b A b a B
Indirect Association and LD • Sample size required for Direct Association, n • Sample size for Indirect Association = n/ r2 • For r2 = 0.8, increase is 25% • For r2 = 0.5, increase is 100%
Coverage • Percent of all SNPs captured by genotyped SNPs • More genotyped SNPs = better coverage
Diminishing Marginal Returns(Wang and Todd 2003) r2 = 0.5 1,500,000 SNPs 600,000 SNPs r2 = 0.8
Number of SNPs needed to capture all SNPs • Depends on: • Population studied • Minor allele frequency of causal SNP • Level of LD (r2) used as a cutoff • 1.4 million selected SNPs for • Caucasians/Asians • 5% and above • r2 = 0.8
The HapMap Project • Initial Goal: • 600,000 SNPs for indirect association • LD information between SNPs • Phase 1: 1 million SNPs • Phase 2: additional 2.9 million SNPs
HapMap • 270 subjects • 45 Chinese • 45 Japanese • 90 Yoruban and 90 European-American • 30 Trios • 2 parents, 1 child
HapMap • SNPs from dbSNP were genotyped • Looked for 1 every 5kb • SNP Validation • Polymorphic • Frequency • Haplotype Estimation • Haplotype tagging SNPs
Two approaches • Positional cloning • expand LD mapping to entire genome • Tool: HapMap SNPs • Candidate gene or Gene-based • Expand the number of genes to all genes • 25,000 genes • Tools: jSNPs, SeattleSNPs, NIEHSSNPs
Genome-wide Association LD Based Gene Based
Potentially Functional Regions of a Gene cis regulator ? promoter Amino acid coding RNA processing Transcription regulation
Comparison of Gene-based and Positional Cloning Designs • Positional Cloning • Agnostic (no biological knowledge needed) • Regulatory regions • SNP sets currently incomplete • Expensive • Gene-based • Efficient: Less SNPs need to be genotyped • May miss regulatory regions • Not all SNPs are known