1 / 51

From Reads to Results Exome-seq analysis at CCBR

From Reads to Results Exome-seq analysis at CCBR. Justin Lack March 8, 2015. Workflow for Data Analysis. Read Generation. Read Mapping. BAM Processing. Variant Calling. Variant Annotation. Workflow for Data Analysis. FASTQ format QC analysis Read trimming. Read Generation.

jmcneal
Download Presentation

From Reads to Results Exome-seq analysis at CCBR

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. From Reads to ResultsExome-seq analysis at CCBR Justin Lack March 8, 2015

  2. Workflow for Data Analysis Read Generation Read Mapping BAM Processing Variant Calling Variant Annotation

  3. Workflow for Data Analysis • FASTQ format • QC analysis • Read trimming Read Generation Read Mapping BAM Processing Variant Calling Variant Annotation

  4. FASTQ Data Format • FASTQ format Sequence ID Sequence Quality score

  5. Read Quality Assessment • Read quality analysis • Crucial to ensure high quality data • Can reveal issues in library preparation and sequence generation

  6. Read Quality Assessment • Read trimming • Trims reads for both adapter contamination and low quality • Absolutely essential for variant detection

  7. Read Quality Assessment • FastQC and trimming (Trimmomatic)

  8. Workflow for Data Analysis Read Generation • Map reads to reference genome • Alignment QC Read Mapping BAM Processing Variant Calling Variant Annotation

  9. Read Mapping • Challenge: • compare billions of short sequence reads against human genome (3Gb)

  10. Different Alignment Algorithms • BWA – 2009 • BWA-SW – 2010 • BWA-MEM – 2013 • Bowtie – 2009 • Bowtie2 – 2012 • Gem – 2012 • Cushaw2 – 2014 • Novoalign Li, arXiv:1303.3997 (2013)

  11. Different Alignment Algorithms • BWA – 2009 • BWA-SW – 2010 • BWA-MEM – 2013 • Bowtie – 2009 • Bowtie2 – 2012 • Gem – 2012 • Cushaw2 – 2014 • Novoalign Li, arXiv:1303.3997 (2013)

  12. SAM/BAM Format • SAM (Sequence Alignment/Map) format • Single unified format for storing read alignments to a reference genome • BAM (Binary Alignment/Map) format • Binary equivalent of SAM • Advantages • Supports indexing • Compact size

  13. BAM File Format Header Data

  14. Alignment QC • Crucial for examining and summarizing quality of alignment at exome targets • GATK Depth of Coverage

  15. Alignment QC • Crucial for examining and summarizing quality of alignment at exome targets • Qualimap

  16. BAM Visualization - IGV Mismatches Reference

  17. Workflow for Data Analysis Read Generation Read Mapping - BAM/SAM Alignment improvement BAM Processing Variant Calling Variant Annotation

  18. BAM Improvement • Short-read mappers designed to balance accuracy and speed • Algorithm can result in errors, especially at challenging indels • Tools designed to target specific systematic errors • Remove duplicates • Local realignment • Base quality recalibration

  19. Library Duplicates • All next generation sequencing platforms are NOT single molecule sequencing • PCR amplification step in library preparation • Can result in duplicate DNA fragments in the final library prep. • PCR-free protocols do exist – require large volumes of input DNA • Can result in false SNP calls • Duplicates manifest themselves as high read depth support

  20. Duplicates and False SNP Calls

  21. Remove Duplicates • Identify read-pairs where the outer ends map to the same position on the genome and remove all but 1 copy • Samtools: samtools rmdup or samtools rmdupse • Picard/GATK: MarkDuplicates

  22. Local Realignment - indels • The trouble with mapping approaches

  23. Local Realignment - indels • The trouble with mapping approaches

  24. Local Realignment - indels • The trouble with mapping approaches

  25. Local Realignment - indels Local realignment

  26. Local realignment in GATK • Uses information from known SNPs/indels (dbSNP, 1000 Genomes) • Uses information from other reads • Smith-Waterman exhaustive alignment on select reads • Similar to GATK Haplotype Caller

  27. Quality scores issued by sequencers are inaccurate and biased • Quality  scores  are  critical  for  all  downstream  analysis • Systematic  biases  are  a  major  contributor  to  bad calls https://www.broadinstitute.org/gatk/

  28. Base Quality Recalibration

  29. Base Quality Recalibration in GATK • Align subsample of reads from a lane to human reference • Exclude all known dbSNP sites • Assume all other mismatches are sequencing errors • Compute a new calibration table based on mismatch rates per position on the read

  30. Base Quality Recalibration

  31. Workflow for Data Analysis Read Generation Read Mapping BAM Processing - Germline variant detection - Somatic variant detection - VCF files Variant Calling Variant Annotation

  32. Germline Variant Detection • Mutations are hidden in the noise!

  33. Germline Variant Detection • Mutations are hidden in the noise! • Utilize GATK Haplotype Caller

  34. Germline Variant Detection • Mutations are hidden in the noise! • Utilize GATK Haplotype Caller • Genotype jointly to maximize information

  35. Germline Variant Detection • Mutations are hidden in the noise! • Utilize GATK Haplotype Caller • Genotype jointly to maximize information

  36. Somatic Variant Detection • Genes and chromosomes can mutate in either somatic or germline tissue Mutation Detection

  37. An Example of Germline Variants Robinson et al. 2011

  38. An Example of Somatic Variants Normal Tumor

  39. Somatic Variant Detection • But somatic variant detection can be EXTREMELY difficult • Allelic fractions do not scale to ploidy

  40. Somatic Variant Detection • But somatic variant detection can be EXTREMELY difficult • Multiple additional sources of errors Low depth and/or tumor contaminated normal Noise vs Event

  41. MuTect2 • Somatic caller that attempts to account for and model all of these sources of errors

  42. Variant Call Format (VCF) • VCF is a standardized format for storing DNA polymorphism data • SNPs, insertions, deletions and structural variants • With rich annotations • Indexed for fast data retrieval of variants from a range of positions • Store variant information across many samples • Record meta-data about the site • dbSNP accession, filter status, validation status, • Very flexible format

  43. Example VCF

  44. Workflow for Data Analysis Read Generation Read Mapping BAM Processing Variant Calling - Genome Annotation Databases - AVIA… Variant Annotation

  45. Annotation and Functional Prediction

  46. dbSNP • dbSNP is a free public archive for genetic variation within and across different species developed by NCBI Sherry, Genome Res. 1999

  47.  1000 Genomes Project • 15 million SNPs • 1 million short insertions/deletions   • 20,000 structural variants The 1000 Genomes Project Consortium, Nature 2010 (www.1000genomes.org)

  48. COSMIC • COSMIC is the most comprehensive resource for exploring impact of somatic mutations in human cancer Forbes, Nucleic Acids Research 2015

  49. COSMIC http://cancer.sanger.ac.uk/cancergenome/projects/cosmic/

  50. Lots lots more in AVIA!

More Related