1 / 43

Canadian Bioinformatics Workshops

Canadian Bioinformatics Workshops. www.bioinformatics.ca. Module #: Title of Module. 2. Module 8 – Variants to Networks Part 1 – How to annotate variants and prioritize potentially relevant ones. Jüri Reimand Bioinformatics for Cancer Genomics May 25-29, 2015.

jaden
Download Presentation

Canadian Bioinformatics Workshops

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Canadian Bioinformatics Workshops www.bioinformatics.ca

  2. Module #: Title of Module 2

  3. Module 8 – Variants to Networks Part 1 – How to annotate variants and prioritize potentially relevant ones Jüri Reimand Bioinformatics for Cancer Genomics May 25-29, 2015 Informatics and Biocomputing Ontario Institute for Cancer Research

  4. Learning Objectives of Module I have detected somatic variants in a cancer sample. What information can I use to interpret them? • What variant annotations can I use? • How do impact prediction models work? • How to use an annotation tool: Annovar (LAB)

  5. Introduction

  6. Variant vs Gene Information We have to consider information at two levels: • Gene • Is the gene central to processes related to cancer? (e.g. proliferation, apoptosis, matrix degradation) • Is the gene sensitive to perturbation? (e.g. haploinsufficiency) • Variant • What is the variant effect on the gene product?

  7. Integrating Different Evidences Variant Recurrence Gene Product Function / Pathway Variant Gene Product Effect

  8. On Variant Size Small: 1-50 bp • SNV (Single Nucleotide Variants): 1 bp substitution, relatively easy to detect • Small In/dels: a bit more challenging to detect • Most available in databases; can be mapped by exact coordinates Medium: 100-1,000 bp • Insertions, Deletions, Translocations, Complex re-arrangements • Most challenging to detect • More tolerant mapping (e.g. partners of gene fusion) Large: > 5 kbp • Copy number variants relatively easy to detect using arrays, more challenging using next generation sequencing • More tolerant mapping (e.g. 50% reciprocal overlap, cytoband(s))

  9. Variant Annotation Components • Variant database mapping • Allele frequencies from reference data-sets (1000G, NHLBI-ESP, ExAC) • dbSNP (sequence variation database) • COSMIC (somatic variant database) • Gene mapping (coding/splicing, UTR, intergenic) • Gene product effect type(e.g. loss of function, missense) • Coding Missense Effect Scoring • SIFT • PolyPhen2 • MutationAssessor • Other Effect Scoring • PhyloP (conservation) • CADD • Splicing-regulatory predictions

  10. Variant databasesand allele frequencies

  11. 1000 Genomes (Phase 3) • Goal: • Identify all variants at > 1% frequency in represented human populations • Subjects: 2,504 • Apparently healthy • Ethnicities: caucasian European, admixed Latin Americans, African, South Asians, East Asians • Platform: Illumina • Low coverage (2-4x) whole genome • Exon (50x coverage)

  12. NHLBI-ESP • Goal: • discover heart, lung and blood disorder variants at frequency < 1% • Subjects: 6,503 (ESP 6500 release) • Not necessarily healthy (includes individuals with extreme subclinical traits and diseased) • Ethnicities: 2,203 African-Americans, 4,300 European-Americans • Platform: Illumina, exome sequencing (average 110x)

  13. ExAC (Exome Aggregation Consortium) • Goal: • Compile the largest set of exomes ever • Subjects: 60,706 (unrelated) • Not necessarily healthy: includes cardiovascular, autoimmune, schizophrenia and cancer, but removed individuals with severe pediatric disease • Ethnicities: non-Finnish European, Finnish, Latin Americans, African, South Asians, East Asians, Other • Platform: Illumina, exome • Variant calling: • GATK

  14. dbSNP • Broad scope repository of “small” genetic variation (e.g. NCBI counterpart for structural variants: dbVAR) • Submissions before and after NGS era • Includes polymorphisms found in general population • Includes rare germline disease-associated (or suspected to be) • Includes somatic variants (also in COSMIC) • Good to look up variants • If you want to use as filter, make sure you remove “clinically flagged” variants (somatic, germline)

  15. COSMIC • “Catalogue of Somatic Mutation In Cancer” • Reference database for somatic variation in cancer • Worth following up variants matching COSMIC entries • How many studies/samples was it found in? 1, many? • Does the variant overlap a hotspot? • Is the gene frequently mutated?

  16. Gene mapping

  17. Gene Mapping:Types of Genes Types of genes: • Protein-coding genes • Non-protein-coding RNA genes (e.g. miRNA) • Different functional relevance • Different knowledge of variant effects

  18. Gene Mapping:Parts of Genes • Protein-coding genes have these parts: • UTR (transcribed, not translated) • Coding exons (translated) • Introns (spliced out, not translated) • Splice sites Also: • Upstream, downstream transcribed gene • Inter-genic

  19. Gene Mapping: Annovar’s priority system • Gene types and parts: what if they overlap..? • Whenever more than one mapping is possible, Annovar will follow this priority system • You can also ask Annovar to report all possible effects

  20. Gene Mapping: Annovar’s priority system Protein Coding Gene G1 >> >> >>>> >>>> >> >>>>>>>>>> TSS of G1 (Transcription Start Site) Non-coding RNA ncR1 (e.g. miRNA)

  21. Gene Mapping: Annovar’s priority system >> >> >>>> >>>> >> >>>>>>>>>> G1 Intronic G1 Upstream G1 Intronic ncR1 G1 Exonic G1 Exonic G1 Exonic G1 UTR 3’ G1 UTR 5’ G1 Intronic G1 Exonic ncR1 G1 Downstream ncR1 Downstream G1 Splicing ** ** Splice sites after the first were omitted to avoid clutter

  22. Gene Mapping:Database • Goal: map our variants to (coding and non-coding) genes • RefSeq is the suggested database for transcribed gene and coding sequence definition • In the lab we will use Annovar with RefSeqdatabase • Other databases available: UCSC known genes, Ensembl

  23. Gene product effect type

  24. Gene Product Effect • Regulatory / other non-protein-coding sequences: difficult to establish what a change “means” (certain cases are easier, e.g. miRNA seed) • Protein-coding sequences: how is protein sequence affected? • Definitely easier to chase after protein effects • But should don’t forget other gene products exist…

  25. Gene Product Effect: Protein-coding • Stop-gain SNV: adds a STOP codon  truncated protein • Frameshift In/Del: shifts the reading frame  protein translated incorrectly from that point • Splicing: alters key sites guiding splicing • In-frame In/Del: removes/add one or more aminoacids • Stoploss: loss of STOP codon  extra piece in the protein • Missense SNV: modifies one amino acid • Synonymous: no amino acid change

  26. Loss of Function (LOF) Variants Definition: Stop-gain, Frameshift, Splicing These are the more disruptive, BUT: • What percentage of the protein is affected? • Are there multiple transcript isoforms? • Splicing effect difficult to predict • Cryptic splice sites • Frameshift can be rescued by another frameshift or bypassed by splicing

  27. Missense Variants: Tell Me More.. • How do we tell if a missense alters protein function? • Type of amino acid change (amino acid groups) • Conservation across species • Conserved protein domain • Secondary protein structure • Tertiary (3D) protein structure + simulation • Other functional features (e.g. phosphosite) • Machine learning model tying all of these together • What training set?

  28. Missense Example: Back to BRAF BRAF V600E T>C Somatic Pathogenic BRAF V600A T>A Somatic / germline Pathogenicity untested

  29. Conservation andMissense Variant Scoring Models

  30. Conservation • Conservation is a powerful and broadly used idea • How conserved is a given nucleotide or genomic interval, comparing different species to human? • How conserved is an amino acid in a protein sequence? • Available from UCSC (nucleotide conservation): • PhyloP score – useful to assess single variants • PhastCons score/element – useful to assess putative regulatory regions and genes not coding proteins • Multi-species alignment – generally useful

  31. Look for coding exons, UTRs and third nucleotide within codons

  32. PhyloP • PhyloP: test to detect if nucleotide substitution rates are faster or slower than expected under neutral drift • Only where aligned sequence available! • PhyloP score • Positive: conserved (e.g. PhyloP > 2) • Zero: neutral • Negative: more diverged than neutral • Species group: • All vertebrates • Only placental mammals • Only primates

  33. Conservation • Main caveat: • if you use conservation for a given position, this will not tell you directly what is the effect of your variants, but only if the position is important!

  34. Missense Variant Effect:Scoring Models Overview Criteria to keep in mind: • What features are used? • Nucleotide / amino acid conservation • Amino acid physicochemical properties • Direct scoring versus Machine learning • Machine learning models are heavily dependent on the training-set used • What data-set used for assessment / learning / optimization? • E.g. Activating / gain-of-function versus inactivating / loss-of-function mutations • E.g. Mendeliandisorders (prevailingly loss-of-function) versus cancer (some are unique to cancer, e.g. drug resistance)

  35. SIFT • Broadly used, relatively old (first published: 2001) • Designed for deleterious mutation (i.e. disruptive of protein function) • Based uniquely on protein sequence (amino acid) conservation • Start from query protein sequence • Identify similar protein sequences (PSI-BLAST) • Multiple alignment of protein sequences (orthologs and paralogs) • Amino acid x residue probability matrix (PSSM) • For every residue, amino acid probability reweighted by amino acid diversity at the position (sum of frequency rank * frequency)  Score: probability of observing amino acid normalized by residue conservation cut-off: 0.05 (based on case studies) Predicting deleterious amino acid substitutions. Ng PC, Henikoff S. Genome Res. 2001 May;11(5):863-74.

  36. PolyPhen2 • Integrates multiple features • 8 sequence-based, 3 structure-based (nucleotide and amino acid level) (e.g. side chain volume change, overlap with PFAM domain, multiple alignment metrics) • Machine learning method (Naïve Bayes)  Requires training set • Set 1: HumDiv • Positive: damaging alleles for known Mendelian disorders (Uniprot) • Negative: nondamaging differences between human proteins and related mammalian homologs • Performance 5-fold crossv: (TP ~ 80%, FP ~10%), (TP ~ 90%, FP ~ 20%) • Set 2: HumVar • Positive: all human disease causing mutations (Uniprot) • Negative: non-synonymous SNPs without disease association • Richer model than SIFT • More biased towards training set(s) than SIFT A method and server for predicting damaging missense mutations. Adzhubei IA, Schmidt S, Peshkin L, […], Bork P, Kondrashov AS, Sunyaev SR. Nat Methods. 2010 Apr;7(4):248-9.

  37. MutationAssessor • Direct / theoretical model (no machine learning) • Based on amino acid conservation also specifically modeling conservation unique to protein subfamilies (can be regarded as an enhanced SIFT) • Entropy-based score based on protein sequence alignment • Performs well for (recurrent) somatic variants Predicting the functional impact of protein mutations: application to cancer genomics. Reva B, Antipin Y, Sander C. Nucleic Acids Res. 2011 Sep 1;39(17):e118

  38. CADD • Intended as a measure of “deleteriousness” for coding and non-coding sequence, not biased to known disease variation • However does not model gene specific constrain in detail • Machine learning model (Linear SVM) • Negative training set: nearly fixed human alleles, variant if compared to inferred human-chimp ancestral genome • Positive training set: simulated variants based on mutation model aware of sequence context and primate substitution rates • Predictive features (63): VEP (Variant Effect Predictor) output, UCSC tracks, Encode tracks  includes missense predictions and nucleotide-level conservation • Performance assessment: using pathogenic variants from ClinVar performs a bit better PhyloP for all sites and PolyPhen/SIFT for missense coding A general framework for estimating the relative pathogenicity of human genetic variants. Kircher M, Witten DM, Jain P, O'Roak BJ, Cooper GM, Shendure J. Nat Genet. 2014 Mar;46(3):310-5.

  39. CADD Pathogenic ClinVar vs NHLBI-ESP > 5%

  40. Splicing Regulatory Predictions • Goal: predict how SNVs affect exon inclusion / exclusion • Strategy: • Learn “Wild Type” splicing code based on reference genome sequence motifs and experimentally-measured splicing patterns in human tissues • “Mutant” code: predicts splicing change when variant alters splicing-guiding sequence motif • Does not learn based on known disease splicing alterations Science 2015

  41. Phosphorylation and other protein modifications • Post-translational modifications (PTMs) extend protein function • Human: >130,000 PTM sites, 12% of protein sequence • Enriched in inherited disease and somatic cancer mutations • Negatively selected in population • Often not detected with mutation assessment tools Reimand et al, 2013 Mol Sys Bio; 2015 PLOS Genet

  42. Effect Scoring:Conclusive Remarks • Nucleotide-level conservation (PhyloP) is simple yet powerful, and multiple alignments can be additionally inspected • Missense scoring models are powerful, but their strengths and weaknesses need to be understood • Variants should be always reviewed putting all information in context • Consider conservation and effect scores using different models • Review the amino acid change and sequence context • Look for clusters of somatic variants and protein domain • Don’t forget gene-level information!

  43. We are on a Coffee Break & Networking Session

More Related