510 likes | 534 Views
From Reads to Results Exome-seq analysis at CCBR. Justin Lack March 8, 2015. Workflow for Data Analysis. Read Generation. Read Mapping. BAM Processing. Variant Calling. Variant Annotation. Workflow for Data Analysis. FASTQ format QC analysis Read trimming. Read Generation.
E N D
From Reads to ResultsExome-seq analysis at CCBR Justin Lack March 8, 2015
Workflow for Data Analysis Read Generation Read Mapping BAM Processing Variant Calling Variant Annotation
Workflow for Data Analysis • FASTQ format • QC analysis • Read trimming Read Generation Read Mapping BAM Processing Variant Calling Variant Annotation
FASTQ Data Format • FASTQ format Sequence ID Sequence Quality score
Read Quality Assessment • Read quality analysis • Crucial to ensure high quality data • Can reveal issues in library preparation and sequence generation
Read Quality Assessment • Read trimming • Trims reads for both adapter contamination and low quality • Absolutely essential for variant detection
Read Quality Assessment • FastQC and trimming (Trimmomatic)
Workflow for Data Analysis Read Generation • Map reads to reference genome • Alignment QC Read Mapping BAM Processing Variant Calling Variant Annotation
Read Mapping • Challenge: • compare billions of short sequence reads against human genome (3Gb)
Different Alignment Algorithms • BWA – 2009 • BWA-SW – 2010 • BWA-MEM – 2013 • Bowtie – 2009 • Bowtie2 – 2012 • Gem – 2012 • Cushaw2 – 2014 • Novoalign Li, arXiv:1303.3997 (2013)
Different Alignment Algorithms • BWA – 2009 • BWA-SW – 2010 • BWA-MEM – 2013 • Bowtie – 2009 • Bowtie2 – 2012 • Gem – 2012 • Cushaw2 – 2014 • Novoalign Li, arXiv:1303.3997 (2013)
SAM/BAM Format • SAM (Sequence Alignment/Map) format • Single unified format for storing read alignments to a reference genome • BAM (Binary Alignment/Map) format • Binary equivalent of SAM • Advantages • Supports indexing • Compact size
BAM File Format Header Data
Alignment QC • Crucial for examining and summarizing quality of alignment at exome targets • GATK Depth of Coverage
Alignment QC • Crucial for examining and summarizing quality of alignment at exome targets • Qualimap
BAM Visualization - IGV Mismatches Reference
Workflow for Data Analysis Read Generation Read Mapping - BAM/SAM Alignment improvement BAM Processing Variant Calling Variant Annotation
BAM Improvement • Short-read mappers designed to balance accuracy and speed • Algorithm can result in errors, especially at challenging indels • Tools designed to target specific systematic errors • Remove duplicates • Local realignment • Base quality recalibration
Library Duplicates • All next generation sequencing platforms are NOT single molecule sequencing • PCR amplification step in library preparation • Can result in duplicate DNA fragments in the final library prep. • PCR-free protocols do exist – require large volumes of input DNA • Can result in false SNP calls • Duplicates manifest themselves as high read depth support
Remove Duplicates • Identify read-pairs where the outer ends map to the same position on the genome and remove all but 1 copy • Samtools: samtools rmdup or samtools rmdupse • Picard/GATK: MarkDuplicates
Local Realignment - indels • The trouble with mapping approaches
Local Realignment - indels • The trouble with mapping approaches
Local Realignment - indels • The trouble with mapping approaches
Local Realignment - indels Local realignment
Local realignment in GATK • Uses information from known SNPs/indels (dbSNP, 1000 Genomes) • Uses information from other reads • Smith-Waterman exhaustive alignment on select reads • Similar to GATK Haplotype Caller
Quality scores issued by sequencers are inaccurate and biased • Quality scores are critical for all downstream analysis • Systematic biases are a major contributor to bad calls https://www.broadinstitute.org/gatk/
Base Quality Recalibration in GATK • Align subsample of reads from a lane to human reference • Exclude all known dbSNP sites • Assume all other mismatches are sequencing errors • Compute a new calibration table based on mismatch rates per position on the read
Workflow for Data Analysis Read Generation Read Mapping BAM Processing - Germline variant detection - Somatic variant detection - VCF files Variant Calling Variant Annotation
Germline Variant Detection • Mutations are hidden in the noise!
Germline Variant Detection • Mutations are hidden in the noise! • Utilize GATK Haplotype Caller
Germline Variant Detection • Mutations are hidden in the noise! • Utilize GATK Haplotype Caller • Genotype jointly to maximize information
Germline Variant Detection • Mutations are hidden in the noise! • Utilize GATK Haplotype Caller • Genotype jointly to maximize information
Somatic Variant Detection • Genes and chromosomes can mutate in either somatic or germline tissue Mutation Detection
An Example of Germline Variants Robinson et al. 2011
An Example of Somatic Variants Normal Tumor
Somatic Variant Detection • But somatic variant detection can be EXTREMELY difficult • Allelic fractions do not scale to ploidy
Somatic Variant Detection • But somatic variant detection can be EXTREMELY difficult • Multiple additional sources of errors Low depth and/or tumor contaminated normal Noise vs Event
MuTect2 • Somatic caller that attempts to account for and model all of these sources of errors
Variant Call Format (VCF) • VCF is a standardized format for storing DNA polymorphism data • SNPs, insertions, deletions and structural variants • With rich annotations • Indexed for fast data retrieval of variants from a range of positions • Store variant information across many samples • Record meta-data about the site • dbSNP accession, filter status, validation status, • Very flexible format
Workflow for Data Analysis Read Generation Read Mapping BAM Processing Variant Calling - Genome Annotation Databases - AVIA… Variant Annotation
dbSNP • dbSNP is a free public archive for genetic variation within and across different species developed by NCBI Sherry, Genome Res. 1999
1000 Genomes Project • 15 million SNPs • 1 million short insertions/deletions • 20,000 structural variants The 1000 Genomes Project Consortium, Nature 2010 (www.1000genomes.org)
COSMIC • COSMIC is the most comprehensive resource for exploring impact of somatic mutations in human cancer Forbes, Nucleic Acids Research 2015
COSMIC http://cancer.sanger.ac.uk/cancergenome/projects/cosmic/