Human Sequencing

Human Sequencing Stefano Lise Bioinformatics & Statistical Genetics (BSG) Core The Wellcome Trust Centre for Human Genetics (WTCHG), Oxford Email: stefano@well.ox.ac.uk

Outline Human genetic variation in health and disease How do we identify pathogenic mutations amongst many genomic variants? The WGS500 project Whole-genome sequencing of 500 genomes of clinical significance

Human Genome The (haploid) reference human genome is about 3 x 109 bases Human genome is diploid => ~ 2 x 3 x 3 109 bases The exome is ~ 30-60 Mb (1-2% of the genome) Some more numbers (from GENCODE, Nov 2012) 20,387 protein-coding genes 81,626 protein-coding transcripts 13,220 long non-coding RNA genes 9,173 small non-coding RNA genes 13,419 pseudogenes

Sequence Variants Single nucleotide variants (SNV) Small insertions/deletions (INDEL) Structural variants Large insertions/deletions Inversions Copy number variants Translocations ….

Functional Consequences From Ensembl

Human Genome Variation • The 1000 Genomes Project (www.1000genomes.org) provides a catalogue of all (most) types of human genetic variation • Population-scale genome sequencing • Phase 1 (October 2012) • High-throughput sequencing of 1092 human genomes • Identified up to 98% of all SNPs with a frequency > 1% in the population • 1,500 additional genomes in the next (final) phase

Human Genome Variation(1000 Genomes Project, Nature 491, 56-65, 2012 ) LOF=loss-of-function variant (stop-gain, frameshiftindel, essential splice site) Conserved sites = sites with GERP conservation score > 2

Rare and Common Diseases Only 1 or 2 causal variants adapted from TA Manolioet al. Nature461, 747-753 (2009)

Sequencing Strategies • Targeted sequencing • E.g. screening of known genes associated with cardiomyopathies or ataxia • Applications in clinical diagnostic • Whole exome sequencing • Protein coding regions • Whole genome sequencing • Can detect all types of information relevant to pathology in a single go • Still costly, but decreasing rapidly

Identifying causal variants: Assumptions and Filters • After variant calling, filter out low quality (confidence) calls • Variant is unique in patients or at least very rare in the general population, e.g. < 1% • Use of in-house databases too • Variant has complete penetrance: every carrier will have the phenotype • In general these steps will not identify the pathogenic variant uniquely but will restrict the list of candidates. Further analysis required

Ideal Scenario • Variant is common amongst all affected and absent in all unaffected • Variant is in a gene with known function and disrupts the protein

(Almost) Ideal Scenario

Variant Prioritization • Focus first on protein-coding regions (exome) • Nonsense and missense mutations • Frame-shift indels • Essential splice sites disruptions • Easier to interpret the consequences of the variant • E.g. mutation affects catalytic residues in an enzyme • Targeted exome sequencing has been very successful in disease gene discovery • Cautionary note: on average each “normal, healthy” individual carries • 10-20 rare LOF variants • 2-5 rare, disease-associated variants

Non-coding variants • Many functional elements lie outside protein-coding regions (ENCODE) • Variants can disrupt • Regulatory elements, e.g. transcription factor binding sites • Splicing regulatory elements (branch sites, intronic splicing enhancers/inhibitors, …) • ncRNA transcripts • … • Many non-coding variants in individual genome sequences lie in ENCODE-annotated functional regions • At least as many as in protein-coding genes

Disease models • Diseases can be • Mendelian • Dominant, recessive or X-linked • Sporadic • De novo mutation • Cancer • Driver mutations • Analysis strategy needs to be adjusted to each disease category

Autosomal Dominant Disease • Familiar, inherited disorder • Search for heterozygous variants • Present in affected individuals, absent in non-affected ones • Linkage analysis can substantially narrow the genomic search space • E.g. SNP array all family members and sequence one or two affected members

Recessive Disease • Suspected consanguinity • Search for homozygous variants • Heterozygous in parents • Homozygosity mapping by SNP arrays can substantially reduce the number of variants for follow-up • No indication of consanguinity • Search for compound heterozygous variants • Affected individual carries two separate variants in the same gene • Each parent carries one of the two variants

Sporadic Genetic Disease • Dominant disorder, parents are unaffected • Search for de novo mutations • Present in child and not in parents • Expect 50-100 de novo mutations in “normal, healthy” individual • Father’s age effect, 2 extra mutations per year (Kong et al, Nature 488, 471–475, 2012) • Sometimes difficult to distinguish from a recessive disease

Cancer • Matched normal to tumour samples • Search for somatic variants • Present in tumour(s), absent in normal sample • Identify driver mutations • More on this tomorrow, JB Cazier’s lecture

Predicting Phenotypic Consequences • Methods based on comparative genomics • Evolution as a measure of deleteriousness • Variants at conserved positions more likely to be deleterious • Several conservation scores • phyloP- single-site score (http://compgen.bscb.cornell.edu/phast/) • GERP - single-site score (http://mendel.stanford.edu/sidowlab/downloads/gerp/index.html) • phastCons – region-based score (http://compgen.bscb.cornell.edu/phast/) • …

Conservation ScoresBenign vs Pathogenic Variants Gilissen et al, European Journal of Human Genetics (2012) 20, 490–497;

b-haemoglobin locus From GM Cooper & J Shendure, Nature Reviews Genetics 12, 628-640 (2011)

Protein Sequence Variants • Most establishedmethods. They exploit • Amino acid properties, e.g. charge, size, … • Structural information, e.g. local secondary structure, surface/core amino acid, … • Evolutionary information, e.g. pattern of observed substitutions • Database information, e.g. known binding site • Several methods available • SIFT (http://sift.bii.a-star.edu.sg/) • Polyphen-2 (http://genetics.bwh.harvard.edu/pph2/) • …

PolyPhen-2(http://genetics.bwh.harvard.edu/pph2/) • Prediction based on sequence, phylogenetic and structural information characterizing the substitution • 8 sequence-based properties • 3 structure-based properties • The 11 properties (features) used as input of a probabilistic classifier • Trained to differentiate benign from pathogenic variants

Non-coding variants • A substantial fraction of disease causing mutations are not exonic • Probably under-represented in databases • Regulatory variants can have a large effect • More difficult to discover • Non-coding positions less conserved than coding positions • ENCODE has provided a detailed map of regulatory regions • Search for variants that disrupt a consensus sequence motif within a known binding site

Gene Prioritization Methods • Methods focus on genes rather than on variants • Identify the genes most likely to cause a given disease in a list of candidates • Methods combine heterogeneous pieces of information • Shared biological pathways with other disease genes • Orthologues genes involved in similar diseases in model organisms • Localization in affected tissue • …

Follow up • Definite proof of pathogenicity requires • Validation in independent patient cohort • But many diseases are genetically heterogeneous and caused by extremely rare variants • In vitro functional experiments • Evaluate molecular consequences, e.g. disruption of expression or protein folding • In vivo experiments in model organisms • Is the human phenotype reproduced in, e.g., a knock-out mouse?

Bioinformatics Challenges • How reliably can we read and annotate an individual’s genome? • How well can we interpret genetic variation in the context of a clinical presentation? • Community experiment to objectively assess computational methods • Critical Assessment of Genome Interpretation (CAGI 2012) • Distinguish between exomes of Crohn’s disease patients and healthy individuals • PGP genomes: predict clinical phenotypes from genome data, and match individuals to their health records • Whole genomes of a family affected by primary congenital glaucoma: discover the genetic basis of the disease • ... • Critical Assessment of Massive Data Analysis (CAMDA 2013) • Reliable variant calling • …

The WGS500 Project • Collaboration involving the WTCHG, Oxford BRC, Oxford University Hospitals and Illumina • Sequence 500 genomes of clinical significance • Mendelian diseases • Immunological disorders • Cancers • Target coverage: 25x (50x for cancer) • Diverse set of experimental designs • Familial: Linkage information • De novo: trios • Cancer: Tumour-normal, metastases, multiple-mets, .. • Substantial follow-up (screening and functional) to establish candidacy

Overview of processing 400 genomes 100 genomes Oxford Genomics Illumina QC Read alignment and calls (Eland/ Casava) Large-scale CNV scan Read alignment (Stampy) Individual/group variant calls (Platypus) Homozygosity scan Reference-compressed Archive Union file Individual genotypes • Frequency (1000G, EVS) • Conservation • Coding consequence (x2) • Predicted effect (x3) • Pathogenicity (HGMD) • Regulatory annotation Annotated genotypes Web server

Case StudyPI: Dr A Nemeth • 3 affected individuals from a highly consanguineous family • Childhood developmental ataxia • Cognitive impairment

Targeted Sequencing • Targeted sequencing on V3 using a panel of > 100 known ataxia genes • Found an homozygous stop codon in SPTBN2 • Mutation present as homozygous in all 3 affected individuals and as heterozygous in parents of V3, by Sanger sequencing • Mutations in SPTBN2 cause spinocerebellarataxia type 5 (SCA5) • Sometimes referred to as “Lincoln ataxia” • Autosomal dominant, slowly progressing, adult onset • Is the cognitive impairment due to the mutation in SPTBN2? • Could be caused by mutations in a second gene (homozygous or compound heterozygous) • Investigated this possibility using a combination of SNP array and whole genome sequencing

Homozygosity Mapping • SNP array genotyped V1, V2, V3, IV3 and IV4 (~300K SNPs) • Identified regions of homozygosity (ROH) shared by V1, V2 and V3 and not present in either IV3 or IV4 • Homozygosity mapping with PLINK • Found 23 regions totalling 28.7 Mb • Largest segments on chromosome 11

Chromosome 11

Whole Genome Sequencing • Searched for rare, homozygous variants in shared ROH • Present in 1000 Genomes with an allele frequency < 1% • Not observed in other WGS500 samples • Found 68 candidate variants • Based on evolutionary conservation and available information in databases (egHGMD) the only likely pathogenic variant is the stop codon in SPTBN2 • Excludedalsoa compound heterozygous model (data not shown)

SPTBN2 variant • The position is actually not well conserved • E.g. G->A in gorilla, baboon and mouse • GERP = -6.71 • PhyloP = -1.28 • TGT and TGC encode for cysteine • TGA is a stop codon

SPTBN2knock-out mouse • Investigated a mouse knock-out of SPTBN2 (Mandy Jackson Lab, Edinburgh) • Ataxia (previously reported) • Morphological abnormalities in neurons from prefrontal cortex, an area believed to be important in human for cognitive tasks • Deficits in object recognition tasks • The mouse model supports the hypothesis that both ataxia and cognitive impairment are caused by the recessive mutation in SPTBN2

WGS500 overview of findings (as of Dec 2012) • Project about 75% complete, with 292 samples (195 case studies) over 38 projects with initial analysis • 75/195 cases there is at least one candidate viewed by the PI and analysts as a strong candidate for causing (strongly contributing to) the phenotype • 45/82 in Mendelian • 19/61 in Immune • 11/52 in Cancer • Papers in press/submitted to date on • Ataxia, CMS, CLL, Multiple adenomas

Human Sequencing

Human Sequencing

Presentation Transcript

DNA sequencing

Hierarchical Sequencing

Instructional Sequencing

Sequencing

Next Generation Sequencing and Human Genome Databases

DNA Sequencing and the Human Genome Project

SEQUENCING

Identifying and sequencing stages of human embryonic development.

Sequencing

Sequencing

Sequencing extinct human ancestors

Sequencing Technologies and Human Genetic Variation

Sequencing

Sequencing extinct human ancestors

Sequencing

Sequencing

Initial sequencing and analysis of the human genome

Sequencing

Sequencing

Targeted Sequencing of Human Genomes, Transcriptomes, and Methylomes

Shotgun sequencing

Sequencing the Human Genome