Whole Genome Sequencing for Colorectal Cancer

Whole Genome Sequencing for Colorectal Cancer Ulrike (Riki) Peters Fred Hutchinson Cancer Research Center University of Washington

Overview • Significance and rationale • Current efforts on rare and less frequent variants • Specific aims and design of whole genome sequencing grant

Structure Biology Biology Advancing Improving of genomes of genomes of diseases medicine healthcare & prevention Progress of Genomic Research (adapted from Green and Guyer Nature 2011) 1990-2003 Human Genome Project 2004 - 2010 2011- 2020 Beyond 2020

Examples of GWAS for Drug Targets Examples of GWAS for Drug Repositioning For additional examples, see Sanseau et al. Nat Biotechnol 2012

Use of GWAS Findings to Inform Screening Decisions (using breast cancer as example) Colors show 10-year risk of breast cancer at different risk percentiles based on 13 GWAS loci Average 10-year risk of breast cancer for a 50-year-old woman is 2.4% So et al. Am J Hum Genet 2010

What is Known About the Genetic Contribution of Colorectal Cancer Scandinavian Twin Registry, Lichtenstein et al. New Engl J Med 2000

Published and Newly Discovered Colorectal Cancer Susceptibility Loci Colorectal Cancer GWAS • 21 GWAS loci • Each SNP associated with a modest increase in risk Identified within GECCO Houlston Nat Genet 2010; Tomlinson Nat Genet 2008; Zanke Nat Genet 2007; Haiman Nat Genet 2007; Hutter. BMC Cancer 2010; Tomlinson Nat Genet 2008;Tenesa Nat Genet 2008; Tomlinson Nat Genet 2011; COGENT Nat Genet 2008; Jaeger Nat Genet 2008; Broderick Nat Genet 2007; Peters, Hunter Hum Genet 2011; Dunlop Nat Genet 2012; Peters Gastroenterol (submitted)

Estimated Total Number of GWAS Hits Park et al. Nat Genet 2010 => Known familial syndromes, such as FAP and Lynch Syndrome explain less than 3-5%

What Explains Missing Heritability of Cancer? • Additional familial syndromes • Heritable epigenomic variability • Gene-gene and gene-environment interaction • Less frequent and rare variants • Structural variations/ Copy number variation (CNV) • Others or heritability may be overestimated

Most Genetic Variation is Rare Green ESP Orange ENCODE Blue HapMap Next-Generation sequencing can identify rare variants all rare variants all rare variants GWAS only investigated ~15% of genetic variation Minor allele frequency

Feasibility to Identify Genetic Variants by Risk Allele Frequency and Strength of Genetic Effect Manolio et al. Nature 2009

Overview • Significance and rationale • Current efforts on rare and less frequent variants • Specific aims and design of whole genome sequencing grant

Current efforts in GECCO to Search for Less Frequent and Rare Variants (Genetics and Epidemiology of Colorectal Cancer Consortium) • Imputation to 1000 Genomes Project in ~28,000 samples with GWAS • Exome chip genotyping • On about 25,000 samples • CIDR Pilot • Whole exome sequencing on 130 high risk colorectal cancer cases + 30 controls FHCRC Coordinating Center The global view of genetic contribution to colorectal cancer ~30,000 subjects U01 and X01, Peters, 2009-2013

Whole Exome Sequencing of 7,000 European and African Americans to identify rare variants associated with common complex diseases • Sequencing centers • Broad • University of Washington • Cohorts • Women’s Health Initiative • HeartGo • ARIC, CARDIA, CHS, FHS, JHS, MESA • LungGo NHLBI - Exome Sequencing Project • Phenotypes • Early On-set MI • Early onset/FH+ Stroke • Extreme BMI/T2D • Extreme Lipids • Extreme Blood pressure • COPD • Pulmonary hypertension • Cystic fibrosis

Exome covers only 1-2% of genome • 88% of all GWAS findings are outside of the well-studied protein-coding regions • 78% of GWAS findings with MAF<5% Whole Exome vs Whole Genome

Junk No More: ENCODE Project Finds "Biochemical Functions for 80% of the Genome“ • The ENCODE Project Consortium, “An integrated encyclopedia of DNA elements in the human genome" • Nature2012

Overview • Significance and rationale • Current efforts in GECCO on rare and less frequent variants • Specific aims and design of whole genome sequencing grant

Aims of the U01 Sequencing Grant • Aim 1. To identify novel CRC susceptibility variants across the genome, mainly variants with allele frequency 0.1-5% • Rare variants <1% • Less frequent variants 1-5% • Common variants >5% • Aim 2. To investigate whether known environmental risk factors for CRC modify genetic susceptibility to CRC (Gene-Environment interactions)

Study Design Overview R01; PI: Peters

Funding Information • 17% Budget Cut • 4 year instead of 5 year • U01 designation • Expected start date: before 9/31/12 Total budget cut 33%

Aim 1.1 Aim 1.2 Whole Genome Sequencing N=1,600cases, 1,600 controls Imputation of WGS Data N=9,129 cases, 11,728 controls N=10,729 cases, 13,328 controls; ~18M variants Aim 1 Aim 1.2 Aim 2 Association Testing Individual & Aggregated Variants Gene-Environment Interaction Analyses 2-Stage Screening, Weighted Hypothesis, Empirical Bayes Aim 2 F Replication N=3,100cases,3,100controls; ~3,000 variants Aim 1.3 Total sample size is 13,829cases and 16,428controls

Classes of Genetic Variants Being Examined

Studies

Data Harmonization of Environmental Risk Factors • Collecting 74 variables in 11 categories • Multi-step collaborative process leading to common data elements with standardized definitions, permissible values and coding Meta-analysis across 15 studies

Sequencing and Genotyping • At Genome Science, University of Washington • Whole genome-sequencing • At lower depth • IlluminaHiSeq • In years 1 to 3 • Total ~1,600 cases and 1,600 controls • Year 1: ~600 • Year 2: ~1,000 • Year 3: ~1,700 • Replication genotyping • In years 3 and 4 • 6,200 samples for 3000 SNPs • 2,400 samples for 384 SNPs

Variant Calling Based on Sequencing Data • Variant calling • Depended on depth of sequencing • Multi-sample calling improves accuracy and, hence, we will call in batches of increasing # of samples • Structural variation/copy number variant (CNV) calling • Indel and CNV calling is error prone and requires genotyping follow up • Follow-up genotyping on 384 SNPs in 1,600 samples

Imputation of Sequencing data into GWAS • Imputation • Use whole genome sequencing data as reference panel to impute into samples with only GWAS data • Important points raise: • Imputation accuracy improves with increasing sample size of reference panel (samples with whole genome sequencing data) • Imputation accuracy improves with increasing denser GWAS platform • Follow-up genotyping on 384 SNPs in 800 samples Whole genome sequence 3200 samples ~18M variants GWAS 19,000 samples

Statistical Analysis • Marginal and burden testing • Single variant test • Aggregated tests to test all rare variants across defined region, such as a gene • Motivation: • Mendelian diseases show that multiple different mutations can lead to disease • Rare variants tested individually have limited power to show association (unless highly penetrant) • Gene-environment interaction testing

Advisory Committee • NCI • Stephen Chanock • Daniela Seminara • Peggy Tucker • Suggestions for external investigators • Mike Boehnke(U of Michigan) • Elaine Mardis(Washington U in St. Lois) • Nicole Soranzo(Wellcome Trust / Sanger Inst) • Stephen Thibodeau (Mayo Clinic, Rochester)

Timeline

Whole Genome Sequencing for Colorectal Cancer