380 likes | 544 Views
Human Sequencing. Stefano Lise Bioinformatics & Statistical Genetics (BSG) Core The Wellcome Trust Centre for Human Genetics (WTCHG), Oxford Email: stefano@well.ox.ac.uk. Outline. Human genetic variation in health and disease
E N D
Human Sequencing Stefano Lise Bioinformatics & Statistical Genetics (BSG) Core The Wellcome Trust Centre for Human Genetics (WTCHG), Oxford Email: stefano@well.ox.ac.uk
Outline Human genetic variation in health and disease How do we identify pathogenic mutations amongst many genomic variants? The WGS500 project Whole-genome sequencing of 500 genomes of clinical significance
Human Genome The (haploid) reference human genome is about 3 x 109 bases Human genome is diploid => ~ 2 x 3 x 3 109 bases The exome is ~ 30-60 Mb (1-2% of the genome) Some more numbers (from GENCODE, Nov 2012) 20,387 protein-coding genes 81,626 protein-coding transcripts 13,220 long non-coding RNA genes 9,173 small non-coding RNA genes 13,419 pseudogenes
Sequence Variants Single nucleotide variants (SNV) Small insertions/deletions (INDEL) Structural variants Large insertions/deletions Inversions Copy number variants Translocations ….
Functional Consequences From Ensembl
Human Genome Variation • The 1000 Genomes Project (www.1000genomes.org) provides a catalogue of all (most) types of human genetic variation • Population-scale genome sequencing • Phase 1 (October 2012) • High-throughput sequencing of 1092 human genomes • Identified up to 98% of all SNPs with a frequency > 1% in the population • 1,500 additional genomes in the next (final) phase
Human Genome Variation(1000 Genomes Project, Nature 491, 56-65, 2012 ) LOF=loss-of-function variant (stop-gain, frameshiftindel, essential splice site) Conserved sites = sites with GERP conservation score > 2
Rare and Common Diseases Only 1 or 2 causal variants adapted from TA Manolioet al. Nature461, 747-753 (2009)
Sequencing Strategies • Targeted sequencing • E.g. screening of known genes associated with cardiomyopathies or ataxia • Applications in clinical diagnostic • Whole exome sequencing • Protein coding regions • Whole genome sequencing • Can detect all types of information relevant to pathology in a single go • Still costly, but decreasing rapidly
Identifying causal variants: Assumptions and Filters • After variant calling, filter out low quality (confidence) calls • Variant is unique in patients or at least very rare in the general population, e.g. < 1% • Use of in-house databases too • Variant has complete penetrance: every carrier will have the phenotype • In general these steps will not identify the pathogenic variant uniquely but will restrict the list of candidates. Further analysis required
Ideal Scenario • Variant is common amongst all affected and absent in all unaffected • Variant is in a gene with known function and disrupts the protein
Variant Prioritization • Focus first on protein-coding regions (exome) • Nonsense and missense mutations • Frame-shift indels • Essential splice sites disruptions • Easier to interpret the consequences of the variant • E.g. mutation affects catalytic residues in an enzyme • Targeted exome sequencing has been very successful in disease gene discovery • Cautionary note: on average each “normal, healthy” individual carries • 10-20 rare LOF variants • 2-5 rare, disease-associated variants
Non-coding variants • Many functional elements lie outside protein-coding regions (ENCODE) • Variants can disrupt • Regulatory elements, e.g. transcription factor binding sites • Splicing regulatory elements (branch sites, intronic splicing enhancers/inhibitors, …) • ncRNA transcripts • … • Many non-coding variants in individual genome sequences lie in ENCODE-annotated functional regions • At least as many as in protein-coding genes
Disease models • Diseases can be • Mendelian • Dominant, recessive or X-linked • Sporadic • De novo mutation • Cancer • Driver mutations • Analysis strategy needs to be adjusted to each disease category
Autosomal Dominant Disease • Familiar, inherited disorder • Search for heterozygous variants • Present in affected individuals, absent in non-affected ones • Linkage analysis can substantially narrow the genomic search space • E.g. SNP array all family members and sequence one or two affected members
Recessive Disease • Suspected consanguinity • Search for homozygous variants • Heterozygous in parents • Homozygosity mapping by SNP arrays can substantially reduce the number of variants for follow-up • No indication of consanguinity • Search for compound heterozygous variants • Affected individual carries two separate variants in the same gene • Each parent carries one of the two variants
Sporadic Genetic Disease • Dominant disorder, parents are unaffected • Search for de novo mutations • Present in child and not in parents • Expect 50-100 de novo mutations in “normal, healthy” individual • Father’s age effect, 2 extra mutations per year (Kong et al, Nature 488, 471–475, 2012) • Sometimes difficult to distinguish from a recessive disease
Cancer • Matched normal to tumour samples • Search for somatic variants • Present in tumour(s), absent in normal sample • Identify driver mutations • More on this tomorrow, JB Cazier’s lecture
Predicting Phenotypic Consequences • Methods based on comparative genomics • Evolution as a measure of deleteriousness • Variants at conserved positions more likely to be deleterious • Several conservation scores • phyloP- single-site score (http://compgen.bscb.cornell.edu/phast/) • GERP - single-site score (http://mendel.stanford.edu/sidowlab/downloads/gerp/index.html) • phastCons – region-based score (http://compgen.bscb.cornell.edu/phast/) • …
Conservation ScoresBenign vs Pathogenic Variants Gilissen et al, European Journal of Human Genetics (2012) 20, 490–497;
b-haemoglobin locus From GM Cooper & J Shendure, Nature Reviews Genetics 12, 628-640 (2011)
Protein Sequence Variants • Most establishedmethods. They exploit • Amino acid properties, e.g. charge, size, … • Structural information, e.g. local secondary structure, surface/core amino acid, … • Evolutionary information, e.g. pattern of observed substitutions • Database information, e.g. known binding site • Several methods available • SIFT (http://sift.bii.a-star.edu.sg/) • Polyphen-2 (http://genetics.bwh.harvard.edu/pph2/) • …
PolyPhen-2(http://genetics.bwh.harvard.edu/pph2/) • Prediction based on sequence, phylogenetic and structural information characterizing the substitution • 8 sequence-based properties • 3 structure-based properties • The 11 properties (features) used as input of a probabilistic classifier • Trained to differentiate benign from pathogenic variants
Non-coding variants • A substantial fraction of disease causing mutations are not exonic • Probably under-represented in databases • Regulatory variants can have a large effect • More difficult to discover • Non-coding positions less conserved than coding positions • ENCODE has provided a detailed map of regulatory regions • Search for variants that disrupt a consensus sequence motif within a known binding site
Gene Prioritization Methods • Methods focus on genes rather than on variants • Identify the genes most likely to cause a given disease in a list of candidates • Methods combine heterogeneous pieces of information • Shared biological pathways with other disease genes • Orthologues genes involved in similar diseases in model organisms • Localization in affected tissue • …
Follow up • Definite proof of pathogenicity requires • Validation in independent patient cohort • But many diseases are genetically heterogeneous and caused by extremely rare variants • In vitro functional experiments • Evaluate molecular consequences, e.g. disruption of expression or protein folding • In vivo experiments in model organisms • Is the human phenotype reproduced in, e.g., a knock-out mouse?
Bioinformatics Challenges • How reliably can we read and annotate an individual’s genome? • How well can we interpret genetic variation in the context of a clinical presentation? • Community experiment to objectively assess computational methods • Critical Assessment of Genome Interpretation (CAGI 2012) • Distinguish between exomes of Crohn’s disease patients and healthy individuals • PGP genomes: predict clinical phenotypes from genome data, and match individuals to their health records • Whole genomes of a family affected by primary congenital glaucoma: discover the genetic basis of the disease • ... • Critical Assessment of Massive Data Analysis (CAMDA 2013) • Reliable variant calling • …
The WGS500 Project • Collaboration involving the WTCHG, Oxford BRC, Oxford University Hospitals and Illumina • Sequence 500 genomes of clinical significance • Mendelian diseases • Immunological disorders • Cancers • Target coverage: 25x (50x for cancer) • Diverse set of experimental designs • Familial: Linkage information • De novo: trios • Cancer: Tumour-normal, metastases, multiple-mets, .. • Substantial follow-up (screening and functional) to establish candidacy
Overview of processing 400 genomes 100 genomes Oxford Genomics Illumina QC Read alignment and calls (Eland/ Casava) Large-scale CNV scan Read alignment (Stampy) Individual/group variant calls (Platypus) Homozygosity scan Reference-compressed Archive Union file Individual genotypes • Frequency (1000G, EVS) • Conservation • Coding consequence (x2) • Predicted effect (x3) • Pathogenicity (HGMD) • Regulatory annotation Annotated genotypes Web server
Case StudyPI: Dr A Nemeth • 3 affected individuals from a highly consanguineous family • Childhood developmental ataxia • Cognitive impairment
Targeted Sequencing • Targeted sequencing on V3 using a panel of > 100 known ataxia genes • Found an homozygous stop codon in SPTBN2 • Mutation present as homozygous in all 3 affected individuals and as heterozygous in parents of V3, by Sanger sequencing • Mutations in SPTBN2 cause spinocerebellarataxia type 5 (SCA5) • Sometimes referred to as “Lincoln ataxia” • Autosomal dominant, slowly progressing, adult onset • Is the cognitive impairment due to the mutation in SPTBN2? • Could be caused by mutations in a second gene (homozygous or compound heterozygous) • Investigated this possibility using a combination of SNP array and whole genome sequencing
Homozygosity Mapping • SNP array genotyped V1, V2, V3, IV3 and IV4 (~300K SNPs) • Identified regions of homozygosity (ROH) shared by V1, V2 and V3 and not present in either IV3 or IV4 • Homozygosity mapping with PLINK • Found 23 regions totalling 28.7 Mb • Largest segments on chromosome 11
Whole Genome Sequencing • Searched for rare, homozygous variants in shared ROH • Present in 1000 Genomes with an allele frequency < 1% • Not observed in other WGS500 samples • Found 68 candidate variants • Based on evolutionary conservation and available information in databases (egHGMD) the only likely pathogenic variant is the stop codon in SPTBN2 • Excludedalsoa compound heterozygous model (data not shown)
SPTBN2 variant • The position is actually not well conserved • E.g. G->A in gorilla, baboon and mouse • GERP = -6.71 • PhyloP = -1.28 • TGT and TGC encode for cysteine • TGA is a stop codon
SPTBN2knock-out mouse • Investigated a mouse knock-out of SPTBN2 (Mandy Jackson Lab, Edinburgh) • Ataxia (previously reported) • Morphological abnormalities in neurons from prefrontal cortex, an area believed to be important in human for cognitive tasks • Deficits in object recognition tasks • The mouse model supports the hypothesis that both ataxia and cognitive impairment are caused by the recessive mutation in SPTBN2
WGS500 overview of findings (as of Dec 2012) • Project about 75% complete, with 292 samples (195 case studies) over 38 projects with initial analysis • 75/195 cases there is at least one candidate viewed by the PI and analysts as a strong candidate for causing (strongly contributing to) the phenotype • 45/82 in Mendelian • 19/61 in Immune • 11/52 in Cancer • Papers in press/submitted to date on • Ataxia, CMS, CLL, Multiple adenomas