Hunting Disease Genes in the Wilds of the Genome -- II

Hunting Disease Genes in the Wilds of the Genome -- II Richard A. Spritz, M.D. April 8, 2010 richard.spritz@ucdenver.edu 303-724-3107 HMGP

Why Find Disease Genes?

The Future? Personalized Medicine • Optimized individualized treatments based on genetic diagnosis of disease susceptibilities • Preventative treatments tailored to one’s specific disease risks (“personalized medicine”)

I. Hypothesis-driven approaches • Candidate gene association • Candidate gene sequencing II. Hypothesis-free approaches Genomewide association (Genomewide expression) Genomewide sequencing Exome Full-genome • Most hypotheses wrong! How Do You Find Disease Genes?

Common, Complex Diseases • Asthma • Autism • Obesity • Preterm birth • Cleft lip/palate • IBD • Diabetes • Cancers • Common traits like height

Common, Complex Diseases Utility of Experimental Approaches Common RISK ALLELE FREQUENCY Rare GWAS Re-Sequencing Linkage Small EFFECT SIZE (OR) Large

Candidate genes Depends on: biological hypothesis (biological candidate) positional hypothesis / information (positional candidate) • Sometimes successful in Mendelian disorders • Low yield in polygenic, multifactorial (“complex”) disorders—pathogenic sequence variants not obvious, often present in normal individuals • Most hypotheses wrong! Hypothesis-Driven Approaches

Concept: Causal disease variation in gene suggested by known biology ‘tagged’ by nearby polymorphic DNA markers; test for co-occurrence. Because: DNA sequence variations very close together on the same piece of DNA will tend to not be separated by recombination over long periods, and so will be non-randomly co-inherited (“linkage disequilibrium”). Therefore: Genotype known variants in a candidate gene as surrogates for unknown disease-causing variants; can’t discover ‘new’ genes; most hypotheses wrong! Candidate Gene Association Study

Candidate Gene Association Studies • Typically compares SNP allele (or genotype) frequencies in cases versus controls (“case-control” study design) • Easy statistics (Fisher exact test, Chi-square) • Must Bonferroni correct for multiple-testing • Must ethnically match cases and controls • Easy, cheap • Most powerful for common risk alleles • Can detect common alleles with small allele-specific effects (i.e. “complex”, polygenic traits) • Most common published type of “genetic study” • Most hypotheses wrong!

Two Fatal Flaws in Gene-by-Gene Case-Control Design • Must apply multiple-testing correction; true denominator often not known • Must ethnically match cases & controls; otherwise, differences in allele frequencies may reflect different genetic backgrounds of cases vs. controls, not disease association • Difficult or impossible even in “homogeneous” population, occult admixture (“stratification”), can lead to false-positives • Even true associations vary between populations • ~96% of published positive case-control associations are false-positives due to population stratification and publication bias

“Population stratification” and false-positive case-control genetic association studies Population 1Population 2 blue/green just indicates overall genetic background Disease Admixed Study Population 1/2 Prof. Wizard’s Case-Control Study CasesControls Eureka!

“Family-based” association studies: • Compare allele transmission from parents to patients • Much less prone to false-positives • Require nuclear families; difficult for adult disease (parents often not available/living)

“Family-Based” Association Studies Avoids stratification; each family is its own control • “Transmission disequilibrium test” (TdT) compares transmission frequency of marker alleles from parents to affected offspring in “trios” to theoretical 50%

Hypothesis-Free Approaches Genome-Wide Association Studies (GWAS) • Relatively recent approach (>300 published): • Genotype hundreds of thousands to millions of SNPs across genome using microarrays; extremely expensive • Case-control or family-based (trio) design • Requires no hypotheses about pathogenesis; can discover new genes • Can discover common alleles with small effects • Can provide very fine localization

Genome-wide association studies (GWAS) • Can apply appropriate multiple testing correction • - “Genomewide significance” P < 5 x 10-8 • Still requires ethnic matching of cases and controls • - Can correct for population stratification • “Principal components” analysis • Genomic inflation factor, “genomic control” • Can discover new, unknown genes; power similar to candidate gene case-control study • Case-control “associations” require independent confirmation Hypothesis-free approaches

The Genomewide Association Study (GWAS) Manolio TA. N Engl J Med 2010;363:166-176.

Meta-Analysis of Genomewide Association Studies Manolio TA. N Engl J Med 2010;363:166-176.

Genomewide Dataset “Quantile-Quantile (QQ) Plot” Genomic Inflation Factor 1.11Genomic Inflation Factor 1.00 Correct Test Statistics by “Genomic Control” method

Genome-Wide Association Studies“Manhattan plot” Per-SNP -log(P values) across genome for association of SNP allele freq. differences between patients with generalized vitiligo versus controls (all Caucasian)

Genome-Wide Association Studies • Very large number of SNPs tested (500,000 – 2,000,000) presents huge multiple-testing problem; requires at least ~1000 cases and ~1000 controls • Many SNPs in linkage disequilibrium (i.e. correlated); simple Bonferroni correction too strict (assumes independence) • Can minimize # SNPs genotyped by genotyping “tagSNPS” (SNPs that ‘tag’ specific haplotype blocks from HapMap) • “Significant” associations require confirmation by independent follow-up association study of specific SNPs to reduce multiple-testing complexity

Personalized Medicine The case of the ‘missing heritability’ • Disease risk genes found by GWAS • account for only a small fraction of genetic risk • >Type 1 diabetes-- ~50 genes, ~6.5% of genetic risk • Are there a virtually unlimited number of additional genes, each conferring small additional risk? • >Maybe, but probably not • Have we under-estimated fraction of genetic risk already accounted for? • >Maybe. GWAS misses rare risk alleles • Have we over-estimated total genetic component of risk? • >Maybe, but not ten-fold

Hypotheses of Common, “Complex” Disease • Common disease, common variant hypothesis (Reich & Lander, 2001) • versus • Rare variant hypothesis (Pritchard, 2001; Prixhard and Cox, 2002)

Complex Diseases Utility of Experimental Approaches Common RISK ALLELE FREQUENCY Rare GWAS Re-Sequencing Linkage Small EFFECT SIZE (OR) Large

Disease risk genes found by GWAS • account for only a small fraction of genetic risk • >Type 1 diabetes-- ~50 genes, ~6.5% of genetic risk • Implies that detailed prediction via personalized medicine may not be realistic • Are there a virtually unlimited number of additional genes, each conferring small additional risk? • >Maybe, but probably not • Have we under-estimated fraction of genetic risk already accounted for? • >Maybe. GWAS misses rare risk alleles • Have we over-estimated total genetic component of risk? • >Maybe, but not ten-fold • What does that mean for Personalized Medicine. Will it work? • >Maybe. Odds Ratio v. Population Attributable Risk Personalized Medicine The case of the ‘missing heritability’

Deep re-sequencing Combined hypothesis-based and hypothesis-free approaches • High-throughput DNA sequencing • Biological candidate genes • GWAS signals (specific genes or genes within regions) • Must distinguish potentially causal variants from non-pathological variation (1000 Genomes Project data will help) • Prioritize for follow-up functional analyses

Exome/Genome sequencing Hypothesis-free approach • High-throughput DNA sequencing • - Genome • - Exome (1% of genome) • Must distinguish potentially causal variants from non-pathological variation (1000 Genomes Project data will help) • Predict based on Mendelian inheritance • Compare across unrelated families • Prioritize for follow-up functional analyses

Missense (non-synonymous) substitutions • Most rare (<1%) missense may be deleterious • > MAQ, Bowtie, SOAP2 • Nonsense, frameshift mutations • Splice junction mutations • Exonic splice enhancer mutations • > SKIPPY • INDELs, CNVs, translocations • > GSNAP • ENSEMBL Regulatory Feature variants Variant Prioritization in Exome/Genome Sequencing

GENETICS

Hunting Disease Genes in the Wilds of the Genome -- II

Hunting Disease Genes in the Wilds of the Genome -- II

Presentation Transcript

Genes and Disease

Finding Disease Genes

The Hunting of the Snark

Genes, Genome and DNA

Are My Genes Mutated? Analyzing Loss of Function Variants in the Human Genome

Animals At The Wilds

The E. Coli genome includes approximately 4,000 genes

Finding genes in the genome

Finding Genes in the Rice Genome

Sickle Cell Disease Research in the Genome Era

The Structure of the Genome

Finding Genes In a Genome

Genome-Wide Association Studies: Hunting for Genes in the New Millennium

I: Human genome maps and localization of disease genes

Hunting Disease Genes in the Wilds of the Genome -- I

Hunting Disease Genes in the Wilds of the Genome -- II

Bedouin wilds of Sinai

The Content of the Genome

The E. Coli genome includes approximately 4,000 genes

Chapter 6 Gene Prediction: Finding Genes in the Human Genome