730 likes | 746 Views
Canadian Bioinformatics Workshops. www.bioinformatics.ca. Module #: Title of Module. 2. Module 7 - Part.1 Annotation of Somatic Coding Variants. Daniele Merico Bioinformatics for Cancer Genomics May 26-30, 2014. Annovar. Informatics Facility, The Centre for Applied Genomics (TCAG)
E N D
Canadian Bioinformatics Workshops www.bioinformatics.ca
Module 7 - Part.1Annotation of Somatic Coding Variants Daniele Merico Bioinformatics for Cancer Genomics May 26-30, 2014 Annovar Informatics Facility, The Centre for Applied Genomics (TCAG) The Hospital for Sick Children
Learning Objectives of Module I have detected somatic variants in a cancer sample. What information can I use to interpret them? • What do different annotations mean? • How do missense effect prediction models work? • How to use an annotation tool: Annovar (LAB)
Cancer Driver Discovery:Biological Knowledge vs Frequency • Small data-sets (1-10 subjects) • Variant previously reported • Gene function/disease phenotype + variant effect • Large data-sets (> 100 subjects) • Variant recurrence • Over-represented pathways/networks
Variant vs Gene Information We have to consider information at two levels: • Gene • Is the gene central to processes related to cancer? (e.g. proliferation, apoptosis, matrix degradation) • Is the gene sensitive to perturbation? (e.g. haploinsufficiency) • Variant • What is the variant effect on the gene product?
Passengers and Drivers Important Activator + Activating Variant Cancer Drive Important Repressor + Loss-of-function Variant A Very Simplistic Model Cancer Drive Redundant Gene (or controlling unrelated process) No effect
Passengers and Drivers Important Repressor + Silent Variant No effect Important Repressor + Loss-of-function Variant A Very Simplistic Model Cancer Drive
Integrating Different Evidences Variant Recurrence Gene Product Function / Pathway Variant Gene Product Effect
On Variant Size Small: 1-50 bp • SNV (Single Nucleotide Variants): 1 bp substitution, relatively easy to detect • Small In/dels: a bit more challenging to detect • Most available in databases; can be mapped by exact coordinates Medium: 100-1,000 bp • Insertions, Deletions, Translocations, Complex re-arrangements • Most challenging to detect • More tolerant mapping (e.g. partners of gene fusion) Large: > 5 kbp • Copy number variants relatively easy to detect using arrays, more challenging using next generation sequencing • More tolerant mapping (e.g. 50% reciprocal overlap, cytoband(s))
Variant Annotation Components • Variant database mapping • Allele frequencies from reference data-sets (1000G, NHLBI-ESP, CGI-46) • dbSNP (sequence variation database) • COSMIC (somatic variant database) • Gene mapping (coding/splicing, UTR, intergenic) • Gene product effect type(e.g. loss of function, missense) • Effect Scoring • PhyloP (conservation) • CADD • Coding Missense Effect Scoring • SIFT • PolyPhen2 • MutationAssessor
Variant Annotation Components • Variant database mapping • Allele frequencies from reference data-sets (1000G, NHLBI-ESP, CGI-46) • dbSNP (sequence variation database) • COSMIC (somatic variant database) • Gene mapping (coding/splicing, UTR, intergenic) • Gene product effect type(e.g. loss of function, missense) • Effect Scoring • PhyloP (conservation) • CADD • Coding Missense Effect Scoring • SIFT • PolyPhen2 • MutationAssessor
Allele Frequencies Databases Use: • identify “suspect” somatic variants that show up as germline variants in these databases • 1000 Genomes • NHLBI-ESP • CGI-46 / CGI-69
1000 Genomes • Goal: • Identify all variants at > 1% frequency in represented human populations • Subjects: • 1092 with available variants • 2500 at project completion • Launch date: 2007 • Many revisions (e.g. increase coverage)
1000 Genomes * version: 30 April 2012 ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20110521/README.phase1_integrated_release_version3_20120430 • Phase 1: variants available (*) • 1092 apparently healthy subjects • Ethnicities: • Represented: European, Black African, East Asian, Mixed Americans • Missing: South-east Asians, Indians, Middle-east, North Africans • 38.2 M SNPs, 3.9 M In/Dels • Platform: Illumina + SOLID • Low coverage (2-4x) whole genome • Exon (50x coverage) • Variant calling: multiple methods including GATK Unified Genotyper • Phase 2: variant calling on-going • Phase 3: alignments just made available
NHLBI-ESP • Goal: • discover heart, lung and blood disorder variants at frequency < 1% • Subjects: 6503 (ESP 6500 release) • Not necessarily healthy (includes individuals with extreme subclinical traits and diseased) • Ethnicities: 2203 African-Americans, 4300 European-Americans • Platform: Illumina, exome sequencing (average 110x) • Variant calling: • SNV: glfMultiples + ad-hoc quality filtering • In/Dels: GATK unified genotyper
CG-46 and CG-69 • Goal: variation in controls • Subjects: • CG-46: 46 unrelated (recommended for allele frequencies) • CG-69: CG-46 + two trios + extended CEU pedigree • Ethnicities: European, Black African, East Asian, Indian, Mexican • Platform: Complete Genomics (whole genome, 80x) • Variant calling: Complete Genomics pipeline
Allele Frequency Databases: Take Home Messages • Different ethnic compositions • Whole genome / exome • Different platforms and (diploid) variant callers • Different sequencing depth <-> Different power for variant detection at different frequencies • Different number of subjects <-> Different capability to generalize across population • Data-sets are complementary • Constant updates, keep yourself updated!
Variant Annotation Components • Variant database mapping • Allele frequencies from reference data-sets (1000G, NHLBI-ESP, CGI-46) • dbSNP (sequence variation database) • COSMIC (somatic variant database) • Gene mapping (coding/splicing, UTR, intergenic) • Gene product effect type(e.g. loss of function, missense) • Conservation • PhyloP • Missense Effect Scoring • SIFT • PolyPhen2 • MutationAssessor
dbSNP • Broad scope repository of “small” genetic variation (e.g. NCBI counterpart for structural variants: dbVAR) • Submissions before and after NGS era • Includes polymorphisms found in general population • Includes rare germline disease-associated (or suspected to be) • Includes somatic variants (also in COSMIC) • Good to look up variants • If you want to use as filter, make sure you remove “clinically flagged” variants (somatic, germline)
COSMIC • “Catalogue of Somatic Mutation In Cancer” • Reference database for somatic variation in cancer • Worth following up variants matching COSMIC entries • How many studies/samples was it found in? 1, many? • Does the variant overlap a hotspot? • Is the gene frequently mutated?
Looking up a well-established driver mutation in dbSNP and Cosmic(BRAF V600E)
BRAF V600E: rs113488022 dbSNP
BRAF V600E: rs113488022 dbSNP click
BRAF V600E: rs113488022 dbSNP Clinical view V600E T>C Somatic Pathogenic V600A T>A Somatic / germline Pathogenicity untested
BRAF V600E: rs113488022 dbSNP Clinical view V600E T>C Somatic Pathogenic V600A T>A Somatic / germline Pathogenicity untested click
BRAF V600E: rs113488022 From dbSNP to OMIM
Variant Annotation Components • Variant database mapping • Allele frequencies from reference data-sets (1000G, NHLBI-ESP, CGI-46) • dbSNP (sequence variation database) • COSMIC (somatic variant database) • Gene mapping (coding/splicing, UTR, intergenic) • Gene product effect type(e.g. loss of function, missense) • Effect Scoring • PhyloP (conservation) • CADD • Coding Missense Effect Scoring • SIFT • PolyPhen2 • MutationAssessor
Gene Mapping:Types of Genes Types of genes: • Protein-coding genes • Non-protein-coding RNA genes (e.g. miRNA) • Different functional relevance • Different knowledge of variant effects
Gene Mapping:Parts of Genes • Protein-coding genes have these parts: • UTR (transcribed, not translated) • Coding exons (translated) • Introns (spliced out, not translated) • Splice sites Also: • Upstream, downstream transcribed gene • Inter-genic
Gene Mapping: Annovar’s priority system • Gene types and parts: what if they overlap..? • Whenever more than one mapping is possible, Annovar will follow this priority system
Gene Mapping: Annovar’s priority system Protein Coding Gene G1 >> >> >>>> >>>> >> >>>>>>>>>> TSS of G1 (Transcription Start Site) Non-coding RNA ncR1 (e.g. miRNA)
Gene Mapping: Annovar’s priority system >> >> >>>> >>>> >> >>>>>>>>>> G1 Intronic G1 Upstream G1 Intronic ncR1 G1 Exonic G1 Exonic G1 Exonic G1 UTR 3’ G1 UTR 5’ G1 Intronic G1 Exonic ncR1 G1 Downstream ncR1 Downstream G1 Splicing ** ** Splice sites after the first were omitted to avoid clutter
Example of Annovar Output • More than one KCNAB2 isoform is present • Annovar reported the UTR5 and not the intron, following the priority rules
Example of Annovar Output • AG splicing acceptor intronic sequence becomes AA • This happens for both GORASP2 transcript isoforms • What will happen at the functional level..? Frameshift splicing?
Splice Sites and Annovar Annovar considers a +/-2 bp window around the intron/exon junction and reports the following splicing categories: • Splicing: 2 bpintronic • Splicing;exonic: 2 bpexonic General things to keep in mind: • The intronic site is much more biologically relevant • Other sequence features outside the +/- 2 bp splice site window may be important for guiding splicing Splicing variants always need to be manually reviewed
AG splicing acceptor (intronic) is very well conserved across 46 vertebrates in UCSC