1 / 73

Canadian Bioinformatics Workshops

Canadian Bioinformatics Workshops. www.bioinformatics.ca. Module #: Title of Module. 2. Module 7 - Part.1 Annotation of Somatic Coding Variants. Daniele Merico Bioinformatics for Cancer Genomics May 26-30, 2014. Annovar. Informatics Facility, The Centre for Applied Genomics (TCAG)

rosajordan
Download Presentation

Canadian Bioinformatics Workshops

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Canadian Bioinformatics Workshops www.bioinformatics.ca

  2. Module #: Title of Module 2

  3. Module 7 - Part.1Annotation of Somatic Coding Variants Daniele Merico Bioinformatics for Cancer Genomics May 26-30, 2014 Annovar Informatics Facility, The Centre for Applied Genomics (TCAG) The Hospital for Sick Children

  4. Learning Objectives of Module I have detected somatic variants in a cancer sample. What information can I use to interpret them? • What do different annotations mean? • How do missense effect prediction models work? • How to use an annotation tool: Annovar (LAB)

  5. Introduction

  6. Cancer Driver Discovery:Biological Knowledge vs Frequency • Small data-sets (1-10 subjects) • Variant previously reported • Gene function/disease phenotype + variant effect • Large data-sets (> 100 subjects) • Variant recurrence • Over-represented pathways/networks

  7. Variant vs Gene Information We have to consider information at two levels: • Gene • Is the gene central to processes related to cancer? (e.g. proliferation, apoptosis, matrix degradation) • Is the gene sensitive to perturbation? (e.g. haploinsufficiency) • Variant • What is the variant effect on the gene product?

  8. Passengers and Drivers Important Activator + Activating Variant Cancer Drive Important Repressor + Loss-of-function Variant A Very Simplistic Model Cancer Drive Redundant Gene (or controlling unrelated process) No effect

  9. Passengers and Drivers Important Repressor + Silent Variant No effect Important Repressor + Loss-of-function Variant A Very Simplistic Model Cancer Drive

  10. Integrating Different Evidences Variant Recurrence Gene Product Function / Pathway Variant Gene Product Effect

  11. On Variant Size Small: 1-50 bp • SNV (Single Nucleotide Variants): 1 bp substitution, relatively easy to detect • Small In/dels: a bit more challenging to detect • Most available in databases; can be mapped by exact coordinates Medium: 100-1,000 bp • Insertions, Deletions, Translocations, Complex re-arrangements • Most challenging to detect • More tolerant mapping (e.g. partners of gene fusion) Large: > 5 kbp • Copy number variants relatively easy to detect using arrays, more challenging using next generation sequencing • More tolerant mapping (e.g. 50% reciprocal overlap, cytoband(s))

  12. SNV and In/Del Annotation

  13. Variant Annotation Components • Variant database mapping • Allele frequencies from reference data-sets (1000G, NHLBI-ESP, CGI-46) • dbSNP (sequence variation database) • COSMIC (somatic variant database) • Gene mapping (coding/splicing, UTR, intergenic) • Gene product effect type(e.g. loss of function, missense) • Effect Scoring • PhyloP (conservation) • CADD • Coding Missense Effect Scoring • SIFT • PolyPhen2 • MutationAssessor

  14. Variant Annotation Components • Variant database mapping • Allele frequencies from reference data-sets (1000G, NHLBI-ESP, CGI-46) • dbSNP (sequence variation database) • COSMIC (somatic variant database) • Gene mapping (coding/splicing, UTR, intergenic) • Gene product effect type(e.g. loss of function, missense) • Effect Scoring • PhyloP (conservation) • CADD • Coding Missense Effect Scoring • SIFT • PolyPhen2 • MutationAssessor

  15. Allele Frequencies Databases Use: • identify “suspect” somatic variants that show up as germline variants in these databases • 1000 Genomes • NHLBI-ESP • CGI-46 / CGI-69

  16. 1000 Genomes • Goal: • Identify all variants at > 1% frequency in represented human populations • Subjects: • 1092 with available variants • 2500 at project completion • Launch date: 2007 • Many revisions (e.g. increase coverage)

  17. 1000 Genomes * version: 30 April 2012 ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20110521/README.phase1_integrated_release_version3_20120430 • Phase 1: variants available (*) • 1092 apparently healthy subjects • Ethnicities: • Represented: European, Black African, East Asian, Mixed Americans • Missing: South-east Asians, Indians, Middle-east, North Africans • 38.2 M SNPs, 3.9 M In/Dels • Platform: Illumina + SOLID • Low coverage (2-4x) whole genome • Exon (50x coverage) • Variant calling: multiple methods including GATK Unified Genotyper • Phase 2: variant calling on-going • Phase 3: alignments just made available

  18. NHLBI-ESP • Goal: • discover heart, lung and blood disorder variants at frequency < 1% • Subjects: 6503 (ESP 6500 release) • Not necessarily healthy (includes individuals with extreme subclinical traits and diseased) • Ethnicities: 2203 African-Americans, 4300 European-Americans • Platform: Illumina, exome sequencing (average 110x) • Variant calling: • SNV: glfMultiples + ad-hoc quality filtering • In/Dels: GATK unified genotyper

  19. CG-46 and CG-69 • Goal: variation in controls • Subjects: • CG-46: 46 unrelated (recommended for allele frequencies) • CG-69: CG-46 + two trios + extended CEU pedigree • Ethnicities: European, Black African, East Asian, Indian, Mexican • Platform: Complete Genomics (whole genome, 80x) • Variant calling: Complete Genomics pipeline

  20. CG-46

  21. Allele Frequency Databases: Take Home Messages • Different ethnic compositions • Whole genome / exome • Different platforms and (diploid) variant callers • Different sequencing depth <-> Different power for variant detection at different frequencies • Different number of subjects <-> Different capability to generalize across population • Data-sets are complementary • Constant updates, keep yourself updated!

  22. Variant Annotation Components • Variant database mapping • Allele frequencies from reference data-sets (1000G, NHLBI-ESP, CGI-46) • dbSNP (sequence variation database) • COSMIC (somatic variant database) • Gene mapping (coding/splicing, UTR, intergenic) • Gene product effect type(e.g. loss of function, missense) • Conservation • PhyloP • Missense Effect Scoring • SIFT • PolyPhen2 • MutationAssessor

  23. dbSNP • Broad scope repository of “small” genetic variation (e.g. NCBI counterpart for structural variants: dbVAR) • Submissions before and after NGS era • Includes polymorphisms found in general population • Includes rare germline disease-associated (or suspected to be) • Includes somatic variants (also in COSMIC) • Good to look up variants • If you want to use as filter, make sure you remove “clinically flagged” variants (somatic, germline)

  24. COSMIC • “Catalogue of Somatic Mutation In Cancer” • Reference database for somatic variation in cancer • Worth following up variants matching COSMIC entries • How many studies/samples was it found in? 1, many? • Does the variant overlap a hotspot? • Is the gene frequently mutated?

  25. Looking up a well-established driver mutation in dbSNP and Cosmic(BRAF V600E)

  26. BRAF V600E: rs113488022 dbSNP

  27. BRAF V600E: rs113488022 dbSNP click

  28. BRAF V600E: rs113488022 dbSNP Clinical view V600E T>C Somatic Pathogenic V600A T>A Somatic / germline Pathogenicity untested

  29. BRAF V600E: rs113488022 dbSNP Clinical view V600E T>C Somatic Pathogenic V600A T>A Somatic / germline Pathogenicity untested click

  30. BRAF V600E: rs113488022 From dbSNP to OMIM

  31. BRAF V600E

  32. click

  33. Variant Annotation Components • Variant database mapping • Allele frequencies from reference data-sets (1000G, NHLBI-ESP, CGI-46) • dbSNP (sequence variation database) • COSMIC (somatic variant database) • Gene mapping (coding/splicing, UTR, intergenic) • Gene product effect type(e.g. loss of function, missense) • Effect Scoring • PhyloP (conservation) • CADD • Coding Missense Effect Scoring • SIFT • PolyPhen2 • MutationAssessor

  34. Gene Mapping:Types of Genes Types of genes: • Protein-coding genes • Non-protein-coding RNA genes (e.g. miRNA) • Different functional relevance • Different knowledge of variant effects

  35. Gene Mapping:Parts of Genes • Protein-coding genes have these parts: • UTR (transcribed, not translated) • Coding exons (translated) • Introns (spliced out, not translated) • Splice sites Also: • Upstream, downstream transcribed gene • Inter-genic

  36. Gene Mapping: Annovar’s priority system • Gene types and parts: what if they overlap..? • Whenever more than one mapping is possible, Annovar will follow this priority system

  37. Gene Mapping: Annovar’s priority system Protein Coding Gene G1 >> >> >>>> >>>> >> >>>>>>>>>> TSS of G1 (Transcription Start Site) Non-coding RNA ncR1 (e.g. miRNA)

  38. Gene Mapping: Annovar’s priority system >> >> >>>> >>>> >> >>>>>>>>>> G1 Intronic G1 Upstream G1 Intronic ncR1 G1 Exonic G1 Exonic G1 Exonic G1 UTR 3’ G1 UTR 5’ G1 Intronic G1 Exonic ncR1 G1 Downstream ncR1 Downstream G1 Splicing ** ** Splice sites after the first were omitted to avoid clutter

  39. Example of Annovar Output

  40. Example of Annovar Output

  41. Example of Annovar Output

  42. Example of Annovar Output

  43. Example of Annovar Output • More than one KCNAB2 isoform is present • Annovar reported the UTR5 and not the intron, following the priority rules

  44. Example of Annovar Output

  45. Example of Annovar Output • AG splicing acceptor intronic sequence becomes AA • This happens for both GORASP2 transcript isoforms • What will happen at the functional level..? Frameshift splicing?

  46. Splice Sites and Annovar Annovar considers a +/-2 bp window around the intron/exon junction and reports the following splicing categories: • Splicing: 2 bpintronic • Splicing;exonic: 2 bpexonic General things to keep in mind: • The intronic site is much more biologically relevant • Other sequence features outside the +/- 2 bp splice site window may be important for guiding splicing  Splicing variants always need to be manually reviewed

  47. AG splicing acceptor (intronic) is very well conserved across 46 vertebrates in UCSC

More Related