Read Processing and Mapping: From Raw to Analysis-ready Reads

Read Processing and Mapping:From Raw to Analysis-ready Reads Ben Passarelli Quake Lab NGS Workshop MAY 30,2014

From Raw to Analysis-ready Reads Raw reads • Session Topics • Brief overview of high-throughput sequencing platforms • Understand read data formats and quality scores • Identify and fix some common read data problems • Find a genomic reference for mapping • Map reads to a reference genome • Understand alignment output • Sort, merge, index alignment for further analysis • Mark/eliminate duplicate reads • Locally realign at indels • Recalibrate base quality scores • How to get started • Read assessment and prep • Mapping • Duplicate • Marking • Local realignment • Base quality recalibration Analysis-ready reads

Sequencing Platforms at a Glance

Illumina Sequencing Platforms MiSeq NextSeq 500 HiSeq 2500

Single Cell Analysis Toolset • Built on R Statistics Package • Differential gene expression analysis and visualization • PCA • Unsupervised clustering • ANOVA (statistical hypothesis testing)

Sample to Raw Reads • Sample • Preparation QC and Quantification • Library Construction Sequencing Raw Reads C1 Single Cell Capture Imaging / Lysis Ampof DNA / cDNA AATI Fragment Analyzer Evaluate and Quantitate Harvested C1 DNA products NextSeq 500 300M or 800M Reads In ~24 hours

Solid Phase Amplification • Sequencing Steps • Clusters are linearized • Sequencing primer annealed • All labeled dNTPs added at each cycle • Intensity of different tags base call • Error Profile: substitutions Library DNA binds to Oligos Immobilized on Glass Flowcell Surface

Instrument Output Oxford Nanopore LifeTech Pacific Biosciences Illumina MiSeqNextSeqHiSeq MinION PGM Proton RS Base call file (.bcl) Standard flowgram file (.sff) Trace (.trc.h5) Pulse (.pls.h5) Base (.bas.h5) Squiggle (.h5) Sequence Data (FASTQ Format)

FASTQ Format (Illumina Example) Lane Tile Barcode Read Record Header Flow Cell ID Tile Coordinates @DJG84KN1:272:D17DBACXX:2:1101:12432:5554 1:N:0:AGTCAA CAGGAGTCTTCGTACTGCTTCTCGGCCTCAGCCTGATCAGTCACACCGTT + BCCFFFDFHHHHHIJJIJJJJJJJIJJJJJJJJJJIJJJJJJJJJIJJJJ @DJG84KN1:272:D17DBACXX:2:1101:12454:5610 1:N:0:AG AAAACTCTTACTACATCAGTATGGCTTTTAAAACCTCTGTTTGGAGCCAG + @@@DD?DDHFDFHEHIIIHIIIIIBBGEBHIEDH=EEHI>FDABHHFGH2 @DJG84KN1:272:D17DBACXX:2:1101:12438:5704 1:N:0:AG CCTCCTGCTTAAAACCCAAAAGGTCAGAAGGATCGTGAGGCCCCGCTTTC + CCCFFFFFHHGHHJIJJJJJJJI@HGIJJJJIIIJGIGIHIJJJIIIIJJ @DJG84KN1:272:D17DBACXX:2:1101:12340:5711 1:N:0:AG GAAGATTTATAGGTAGAGGCGACAAACCTACCGAGCCTGGTGATAGCTGG + CCCFFFFFHHHHHGGIJJJIJJJJJJIJJIJJJJJGIJJJHIIJJJIJJJ Read Bases Separator (with optional repeated header) Read Quality Scores NOTE: for paired-end runs, there is a second file with one-to-one corresponding headers and reads

Base Call Quality: Phred Quality Scores Phred* quality score Q with base-calling error probability P Q = -10 log10P * Name of first program to assign accurate base quality scores. From the Human Genome Project. SSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSS..................................................... ...............................IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII...................... LLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLL.................................................... !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~ | | | | | | 33 59 64 73 104 126 S - Sanger Phred+33 range: 0 to 40 I - Illumina 1.3+ Phred+64 range: 0 to 40 L - Illumina 1.8+ Phred+33 range: 0 to 41

Initial Read Assessment and Processing • Common problems that can affect analysis: • Low confidence base calls • typically toward ends of reads • criteria vary by application • Presence of adapter sequence in reads • poor fragment size selection • protocol execution or artifacts • Over-abundant sequence duplicates • Library contamination Raw reads • Read assessment and prep • Mapping • Duplicate • Marking • Local realignment • Base quality recalibration Analysis-ready reads

Quick Read Assessment: FastQC • Free Download • Download: http://www.bioinformatics.babraham.ac.uk/projects/fastqc/ • Tutorial : http://www.youtube.com/watch?v=bz93ReOv87Y • Samples reads (200K default): fast, low resource use

Read Assessment Example (Cont’d) Trim for base quality or adapters (run or library issue) Trim leading bases (library artifact)

Read Assessment Example (Cont’d) TruSeq Adapter, Index 9 5’ GATCGGAAGAGCACACGTCTGAACTCCAGTCACGATCAGATCTCGTATGCCGTCTTCTGCTTG

Comprehensive Read Assessment: Prinseq http://prinseq.sourceforge.net/

Selected Tools to Process Reads • Fastxtoolkit* http://hannonlab.cshl.edu/fastx_toolkit/ • (partial list) • FASTQ Information: Chart Quality Statistics and Nucleotide Distribution • FASTQ Trimmer: Shortening FASTQ/FASTA reads (removing barcodes or noise). • FASTQ Clipper: Removing sequencing adapters • FASTQ Quality Filter: Filters sequences based on quality • FASTQ Quality Trimmer: Trims (cuts) sequences based on quality • FASTQ Masker: Masks nucleotides with 'N' (or other character) based on quality • *defaults to old Illumina fastq (ASCII offset 64). Use –Q33 option. • SepPrephttps://github.com/jstjohn/SeqPrep • Adapter trimming • Merge overlapping paired-end read • Biopythonhttp://biopython.org, http://biopython.org/DIST/docs/tutorial/Tutorial.html • (for python programmers) • Especially useful for implementing custom/complex sequence analysis/manipulation • Galaxy http://galaxy.psu.edu • Great for beginners: upload data, point and click • Just about everything you’ll see in today’s presentations • SolexaQA2 http://solexaqa.sourceforge.net • Dynamic trimming • Length sorting (resembles read grouping of Prinseq)

Many Analysis Pipelines Start with Read Mapping Genotyping/Haplotyping Gene Expression https://www.broadinstitute.org/gatk/guide/best-practices?bpm=DNAseq Tumor/Normal Comparison https://www.broadinstitute.org/gatk/guide/best-practices http://www.appistry.com/sites/all/themes/appistry/files/pdfs/CGAS_download.pdf

Read Mapping Raw reads • Read assessment and prep • Mapping • Duplicate • Marking • Local realignment • Base quality recalibration http://www.broadinstitute.org/igv/ Analysis-ready reads

Sequence References and Annotations • http://www.ncbi.nlm.nih.gov/projects/genome/assembly/grc/data.shtml • http://www.ncbi.nlm.nih.gov/guide/howto/dwn-genome • Comprehensive reference information • http://hgdownload.cse.ucsc.edu/downloads.html • Comprehensive reference, annotation, and translation information • ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle • References and SNP information data by GATK • Human only • http://cufflinks.cbcb.umd.edu/igenomes.html • Pre-indexed references and gene annotations for Tuxedo suite • Human, Mouse, Rat , Cow, Dog, Chicken, Drosophila, C. elegans, Yeast • http://www.repeatmasker.org

Fasta Sequence Format • One or more sequences per file • “>” denotes beginning of sequence or contig • Subsequent lines up to the next “>” define sequence • Lowercase base denotes repeat masked base • Contig ID may have comments delimited by “|” >chr1 … TGGACTTGTGGCAGGAATgaaatccttagacctgtgctgtccaatatggt agccaccaggcacatgcagccactgagcacttgaaatgtggatagtctga attgagatgtgccataagtgtaaaatatgcaccaaatttcaaaggctaga aaaaaagaatgtaaaatatcttattattttatattgattacgtgctaaaa taaccatatttgggatatactggattttaaaaatatatcactaatttcat … >chr2 … >chr3 …

Read Mapping

Read Mapping: BWA • BWA Features • Uses Burrows Wheeler Transform • fast • modest memory footprint (<4GB) • Accurate • Tolerates base mismatches • increased sensitivity • reduces allele bias • Gapped alignment for both single- and paired-ended reads • Automatically adjusts parameters based on read lengths and error rates • Native BAM/SAM output (the de facto standard) • Large installed base, well-supported • Open-source (no charge)

Read Mapping: Bowtie2 • Bowtie2 • Uses dynamic programming (edit distance scoring) • Eliminates need for realignment around indels • Can be tuned for different sequencing technologies • Multi-seed search - adjustable sensitivity • Input read length limited only by available memory • Fasta or Fastq input • Caveats • Longer input reads require much more memory • Trade-off parallelism with memory requirement Dynamic Programming Illustration http://bowtie-bio.sourceforge.net/bowtie2 Langmead B, Salzberg S. Fast gapped-read alignment with Bowtie 2, Nature Methods. 2012, 9:357-359

SAM (BAM) Format • Sequence Alignment/Map format • Universal standard • Human-readable (SAM) and compact (BAM) forms • Superset of FASTQ • Structure • Header • version, sort order, reference sequences, read groups, program/processing history • Alignment records

SAM/BAM Format: Header [benpassalign_genotype]$ samtools view -H allY.recalibrated.merge.bam @HD VN:1.0 GO:noneSO:coordinate @SQ SN:chrM LN:16571 @SQ SN:chr1 LN:249250621 @SQ SN:chr2 LN:243199373 @SQ SN:chr3 LN:198022430 … @SQ SN:chr19 LN:59128983 @SQ SN:chr20 LN:63025520 @SQ SN:chr21 LN:48129895 @SQ SN:chr22 LN:51304566 @SQ SN:chrX LN:155270560 @SQ SN:chrY LN:59373566 … @RG ID:86-191 PL:ILLUMINA LB:IL500 SM:86-191-1 @RG ID:BsK010 PL:ILLUMINA LB:IL501 SM:BsK010-1 @RG ID:Bsk136 PL:ILLUMINA LB:IL502 SM:Bsk136-1 @RG ID:MAK001 PL:ILLUMINA LB:IL503 SM:MAK001-1 @RG ID:NG87 PL:ILLUMINA LB:IL504 SM:NG87-1 … @RG ID:SDH023 PL:ILLUMINA LB:IL508 SM:SDH023 @PG ID:GATK IndelRealigner VN:2.0-39-gd091f72 CL:knownAlleles=[] targetIntervals=tmp.intervals.listLODThresholdForCleaning=5.0 consensusDeterminationModel=USE_READS entropyThreshold=0.15 maxReadsInMemory=150000 maxIsizeForMovement=3000 maxPositionalMoveAllowed=200 maxConsensuses=30 maxReadsForConsensuses=120 maxReadsForRealignment=20000 noOriginalAlignmentTags=false nWayOut=null generate_nWayOut_md5s=false check_early=false noPGTag=false keepPGTags=false indelsFileForDebugging=null statisticsFileForDebugging=null SNPsFileForDebugging=null @PG ID:bwaPN:bwa VN:0.6.2-r126 samtools to view bam header sort order reference sequence names with lengths read groups with platform, library and sample information program (analysis) history

SAM/BAM Format: Alignment Records [benpassalign_genotype]$ samtools view allY.recalibrated.merge.bam HW-ST605:127:B0568ABXX:2:1201:10933:3739 147 chr1 27675 60 101M = 27588 -188 TCATTTTATGGCCCCTTCTTCCTATATCTGGTAGCTTTTAAATGATGACCATGTAGATAATCTTTATTGTCCCTCTTTCAGCAGACGGTATTTTCTTATGC =7;:;<=??<=BCCEFFEJFCEGGEFFDF?BEA@DEDFEFFDE>EE@E@ADCACB>CCDCBACDCDDDAB@@BCADDCBC@BCBB8@ABCCCDCBDA@>:/ RG:Z:86-191 HW-ST605:127:B0568ABXX:3:1104:21059:173553 83 chr1 27682 60 101M = 27664 -119 ATGGCCCCTTCTTCCTATATCTGGTAGCTTTTAAATGATGACCATGTAGATAATCTTTATTGTCCCTCTTTCAGCAGACGGTATTTTCTTATGCTACAGTA 8;8.7::<?=BDHFHGFFDCGDAACCABHCCBDFBE</BA4//BB@BCAA@CBA@CB@ABA>A??@B@BBACA>?;A@8??CABBBA@AAAA?AA??@BB0 RG:Z:SDH023 * Many fields after column 12 deleted (e.g., recalibrated base scores) have been deleted for improved readability 2 3 4 5 6 8 9 1 11 10 http://samtools.sourceforge.net/SAM1.pdf

Compression is Big Win for HTS Data 33.8M 100bp Illumina reads Compression Ratio 6x 5x 4x 3x Improvement

Preparing for Next Steps Raw reads • Subsequent steps require sorted and indexed bams • Sort orders: karyotypic, lexicographical • Indexing improves analysis performance • Picard tools: fast, portable, free • http://picard.sourceforge.net/command-line-overview.shtml • Sort: SortSam.jar • Merge: MergeSamFiles.jar • Index: BuildBamIndex.jar • Order: sort, merge (optional), index • Read assessment and prep • Mapping • Duplicate • Marking • Local realignment • Base quality recalibration Analysis-ready reads

Duplicate Marking Raw reads $java -Xmx4g -jar <path to picard>/MarkDuplicates.jar\ INPUT=aligned.sorted.bam\ OUTPUT=aligned.sorted.dedup.bam\ VALIDATION_STRINGENCY=LENIENT \ METRICS_FILE=aligned.dedup.metrics.txt \ REMOVE_DUPLICATES=false \ ASSUME_SORTED=true http://picard.sourceforge.net/command-line-overview.shtml#MarkDuplicates • Read assessment and prep • Mapping • Duplicate • Marking • Local realignment • Base quality recalibration Analysis-ready reads

SAM/BAM Format: Alignment Records [benpassalign_genotype]$ samtools view allY.recalibrated.merge.bam HW-ST605:127:B0568ABXX:2:1201:10933:3739 147 chr1 27675 60 101M = 27588 -188 TCATTTTATGGCCCCTTCTTCCTATATCTGGTAGCTTTTAAATGATGACCATGTAGATAATCTTTATTGTCCCTCTTTCAGCAGACGGTATTTTCTTATGC =7;:;<=??<=BCCEFFEJFCEGGEFFDF?BEA@DEDFEFFDE>EE@E@ADCACB>CCDCBACDCDDDAB@@BCADDCBC@BCBB8@ABCCCDCBDA@>:/ RG:Z:86-191 http://picard.sourceforge.net/explain-flags.html http://samtools.sourceforge.net/SAM1.pdf

Local Realignment • BWT-based alignment is fast for matching reads to reference • Individual base alignments often sub-optimal at indels • Approach • Fast read mapping with BWT-based aligner • Realign reads at indel sites using gold standard (but much slower) Smith-Waterman algorithm • Benefits • Refines location of indels • Reduces erroneous SNP calls • Very high alignment accuracy in significantly less time, with fewer resources Raw reads • Read assessment and prep • Mapping • Duplicate • Marking • Local realignment • Base quality recalibration • 1Smith, Temple F.; and Waterman, Michael S. (1981). "Identification of Common Molecular Subsequences". Journal of Molecular Biology 147: 195–197. doi:10.1016/0022-2836(81)90087-5. PMID 7265238 Analysis-ready reads

Local Realignment Raw BWA alignment Post re-alignment at indels DePristo MA, et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet. 2011 May;43(5):491-8. PMID: 21478889

Base Quality Recalibration • STEP 1: Find covariates at non-dbSNP sites using: • Reported quality score • The position within the read • The preceding and current nucleotide (sequencer properties) • java -Xmx4g -jar GenomeAnalysisTK.jar \ • -T BaseRecalibrator \ • -I alignment.bam\ • -R hg19/ucsc.hg19.fasta \ • -knownSiteshg19/dbsnp_135.hg19.vcf \ • -o alignment.recal_data.grp • STEP 2: Generate BAM with recalibrated base scores: • java -Xmx4g -jar GenomeAnalysisTK.jar \ • -T PrintReads \ • -R hg19/ucsc.hg19.fasta \ • -I alignment.bam\ • -BQSR alignment.recal_data.grp\ • -o alignment.recalibrated.bam Raw reads • Read assessment and prep • Mapping • Duplicate • Marking • Local realignment • Base quality recalibration Analysis-ready reads

Base Quality Recalibration (Cont’d)

Raw reads • Read assessment and prep • Mapping • Duplicate • Marking • Local realignment • Base quality recalibration Analysis-ready reads

Is there an easier way to get started?! http://galaxyproject.org/ Click on “Use Galaxy”

Getting Started

Raw reads • Read assessment and prep • Mapping • Duplicate • Marking • Local realignment • Base quality recalibration Analysis-ready reads

Read Processing and Mapping: From Raw to Analysis-ready Reads