360 likes | 567 Views
Read Processing and Mapping: From Raw to Analysis-ready Reads. Ben Passarelli Stem Cell Institute Genome Center NGS Workshop 12 September 2012. Samples to Information. Variant calling Gene expression Chromatin structure Methylome Immunorepertoires De novo assembly ….
E N D
Read Processing and Mapping:From Raw to Analysis-ready Reads Ben Passarelli Stem Cell Institute Genome Center NGS Workshop 12 September 2012
Samples to Information Variant calling Gene expression Chromatin structure Methylome Immunorepertoires De novo assembly …
Many Analysis Pipelines Start with Read Mapping Genotyping (GATK) RNA-seq (Tuxedo) http://www.broadinstitute.org/gsa/wiki/images/7/7a/Overall_flow.jpg http://www.broadinstitute.org/gatk/guide/topic?name=intro http://www.nature.com/nprot/journal/v7/n3/full/nprot.2012.016.html
From Raw to Analysis-ready Reads Raw reads • Session Topics • Understand read data formats and quality scores • Identify and fix some common read data problems • Find and prepare a genomic reference for mapping • Map reads to a genome reference • Understand alignment output • Sort, merge, index alignment for further analysis • Locally realign at indels to reduce alignment artifacts • Mark/eliminate duplicate reads • Recalibrate base quality scores • An easy way to get started • Read assessment and prep • Mapping • Local realignment • Duplicate marking • Base quality recalibration Analysis-ready reads
Instrument Output Illumina MiSeq Illumina HiSeq IonTorrent PGM Roche 454 Pacific Biosciences RS Images (.tiff) Cluster intensity file (.cif) Base call file (.bcl) Movie Trace (.trc.h5) Pulse (.pls.h5) Base (.bas.h5) Standard flowgram file (.sff) Sequence Data (FASTQ Format)
FASTQ Format (Illumina Example) Lane Tile Barcode Raw reads Read Record Header Flow Cell ID Tile Coordinates @DJG84KN1:272:D17DBACXX:2:1101:12432:5554 1:N:0:AGTCAA CAGGAGTCTTCGTACTGCTTCTCGGCCTCAGCCTGATCAGTCACACCGTT + BCCFFFDFHHHHHIJJIJJJJJJJIJJJJJJJJJJIJJJJJJJJJIJJJJ @DJG84KN1:272:D17DBACXX:2:1101:12454:5610 1:N:0:AG AAAACTCTTACTACATCAGTATGGCTTTTAAAACCTCTGTTTGGAGCCAG + @@@DD?DDHFDFHEHIIIHIIIIIBBGEBHIEDH=EEHI>FDABHHFGH2 @DJG84KN1:272:D17DBACXX:2:1101:12438:5704 1:N:0:AG CCTCCTGCTTAAAACCCAAAAGGTCAGAAGGATCGTGAGGCCCCGCTTTC + CCCFFFFFHHGHHJIJJJJJJJI@HGIJJJJIIIJGIGIHIJJJIIIIJJ @DJG84KN1:272:D17DBACXX:2:1101:12340:5711 1:N:0:AG GAAGATTTATAGGTAGAGGCGACAAACCTACCGAGCCTGGTGATAGCTGG + CCCFFFFFHHHHHGGIJJJIJJJJJJIJJIJJJJJGIJJJHIIJJJIJJJ Read Bases • Read assessment and prep Separator (with optional repeated header) Read Quality Scores • Duplicate marking • Local realignment • Mapping • Base quality recalibration Analysis-ready reads NOTE: for paired-end runs, there is a second file with one-to-one corresponding headers and reads
Base Call Quality: Phred Quality Scores Phred* quality score Q with base-calling error probability P Q = -10 log10P * Name of first program to assign accurate base quality scores. From the Human Genome Project. SSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSS..................................................... ...............................IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII...................... LLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLL.................................................... !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~ | | | | | | 33 59 64 73 104 126 S - Sanger Phred+33 range: 0 to 40 I - Illumina 1.3+ Phred+64 range: 0 to 40 L - Illumina 1.8+ Phred+33 range: 0 to 41
[benpass@solexalign]$ ls Sample_FS53_EPCAM+_CD10-_IL2270-18 Sample_FS53_EPCAM+_CD10+_IL2269-19 Sample_COH77_CD49F-_IL2275-13 Sample_COH77_CD49F+_CD66-_IL2274-14 Sample_COH77_CD49F+_CD66+_IL2273-15 Sample_COH74_EPCAM+_CD10-_IL2272-16 Sample_COH74_EPCAM+_CD10+_IL2271-17 Sample_COH69_EPCAM+_CD10-_IL2268-20 Sample_COH69_EPCAM+_CD10+_IL2267-21 [benpass@solexalign]$ lsSample_COH77_CD49F-_IL2275-13 COH77_CD49F-_IL2275-13_AGTCAA_L002_R1_001.fastq.gz COH77_CD49F-_IL2275-13_AGTCAA_L002_R1_002.fastq.gz COH77_CD49F-_IL2275-13_AGTCAA_L002_R1_003.fastq.gz COH77_CD49F-_IL2275-13_AGTCAA_L002_R1_004.fastq.gz COH77_CD49F-_IL2275-13_AGTCAA_L002_R2_001.fastq.gz COH77_CD49F-_IL2275-13_AGTCAA_L002_R2_002.fastq.gz COH77_CD49F-_IL2275-13_AGTCAA_L002_R2_003.fastq.gz COH77_CD49F-_IL2275-13_AGTCAA_L002_R2_004.fastq.gz gzip compressed [benpass@solexalign]$ ls Sample_FS53_EPCAM+_CD10-_IL2270-18 Sample_FS53_EPCAM+_CD10+_IL2269-19 Sample_COH77_CD49F-_IL2275-13 Sample_COH77_CD49F+_CD66-_IL2274-14 Sample_COH77_CD49F+_CD66+_IL2273-15 Sample_COH74_EPCAM+_CD10-_IL2272-16 Sample_COH74_EPCAM+_CD10+_IL2271-17 Sample_COH69_EPCAM+_CD10-_IL2268-20 Sample_COH69_EPCAM+_CD10+_IL2267-21 [benpass@solexalign]$ lsSample_COH77_CD49F-_IL2275-13 COH77_CD49F-_IL2275-13_AGTCAA_L002_R1_001.fastq.gz COH77_CD49F-_IL2275-13_AGTCAA_L002_R1_002.fastq.gz COH77_CD49F-_IL2275-13_AGTCAA_L002_R1_003.fastq.gz COH77_CD49F-_IL2275-13_AGTCAA_L002_R1_004.fastq.gz COH77_CD49F-_IL2275-13_AGTCAA_L002_R2_001.fastq.gz COH77_CD49F-_IL2275-13_AGTCAA_L002_R2_002.fastq.gz COH77_CD49F-_IL2275-13_AGTCAA_L002_R2_003.fastq.gz COH77_CD49F-_IL2275-13_AGTCAA_L002_R2_004.fastq.gz Read [benpass@solexalign]$ ls Sample_FS53_EPCAM+_CD10-_IL2270-18 Sample_FS53_EPCAM+_CD10+_IL2269-19 Sample_COH77_CD49F-_IL2275-13 Sample_COH77_CD49F+_CD66-_IL2274-14 Sample_COH77_CD49F+_CD66+_IL2273-15 Sample_COH74_EPCAM+_CD10-_IL2272-16 Sample_COH74_EPCAM+_CD10+_IL2271-17 Sample_COH69_EPCAM+_CD10-_IL2268-20 Sample_COH69_EPCAM+_CD10+_IL2267-21 [benpass@solexalign]$ lsSample_COH77_CD49F-_IL2275-13 COH77_CD49F-_IL2275-13_AGTCAA_L002_R1_001.fastq.gz COH77_CD49F-_IL2275-13_AGTCAA_L002_R1_002.fastq.gz COH77_CD49F-_IL2275-13_AGTCAA_L002_R1_003.fastq.gz COH77_CD49F-_IL2275-13_AGTCAA_L002_R1_004.fastq.gz COH77_CD49F-_IL2275-13_AGTCAA_L002_R2_001.fastq.gz COH77_CD49F-_IL2275-13_AGTCAA_L002_R2_002.fastq.gz COH77_CD49F-_IL2275-13_AGTCAA_L002_R2_003.fastq.gz COH77_CD49F-_IL2275-13_AGTCAA_L002_R2_004.fastq.gz Barcode [benpass@solexalign]$ ls Sample_FS53_EPCAM+_CD10-_IL2270-18 Sample_FS53_EPCAM+_CD10+_IL2269-19 Sample_COH77_CD49F-_IL2275-13 Sample_COH77_CD49F+_CD66-_IL2274-14 Sample_COH77_CD49F+_CD66+_IL2273-15 Sample_COH74_EPCAM+_CD10-_IL2272-16 Sample_COH74_EPCAM+_CD10+_IL2271-17 Sample_COH69_EPCAM+_CD10-_IL2268-20 Sample_COH69_EPCAM+_CD10+_IL2267-21 [benpass@solexalign]$ lsSample_COH77_CD49F-_IL2275-13 COH77_CD49F-_IL2275-13_AGTCAA_L002_R1_001.fastq.gz COH77_CD49F-_IL2275-13_AGTCAA_L002_R1_002.fastq.gz COH77_CD49F-_IL2275-13_AGTCAA_L002_R1_003.fastq.gz COH77_CD49F-_IL2275-13_AGTCAA_L002_R1_004.fastq.gz COH77_CD49F-_IL2275-13_AGTCAA_L002_R2_001.fastq.gz COH77_CD49F-_IL2275-13_AGTCAA_L002_R2_002.fastq.gz COH77_CD49F-_IL2275-13_AGTCAA_L002_R2_003.fastq.gz COH77_CD49F-_IL2275-13_AGTCAA_L002_R2_004.fastq.gz gzip compressed [benpass@solexalign]$ ls Sample_FS53_EPCAM+_CD10-_IL2270-18 Sample_FS53_EPCAM+_CD10+_IL2269-19 Sample_COH77_CD49F-_IL2275-13 Sample_COH77_CD49F+_CD66-_IL2274-14 Sample_COH77_CD49F+_CD66+_IL2273-15 Sample_COH74_EPCAM+_CD10-_IL2272-16 Sample_COH74_EPCAM+_CD10+_IL2271-17 Sample_COH69_EPCAM+_CD10-_IL2268-20 Sample_COH69_EPCAM+_CD10+_IL2267-21 [benpass@solexalign]$ lsSample_COH77_CD49F-_IL2275-13 COH77_CD49F-_IL2275-13_AGTCAA_L002_R1_001.fastq.gz COH77_CD49F-_IL2275-13_AGTCAA_L002_R1_002.fastq.gz COH77_CD49F-_IL2275-13_AGTCAA_L002_R1_003.fastq.gz COH77_CD49F-_IL2275-13_AGTCAA_L002_R1_004.fastq.gz COH77_CD49F-_IL2275-13_AGTCAA_L002_R2_001.fastq.gz COH77_CD49F-_IL2275-13_AGTCAA_L002_R2_002.fastq.gz COH77_CD49F-_IL2275-13_AGTCAA_L002_R2_003.fastq.gz COH77_CD49F-_IL2275-13_AGTCAA_L002_R2_004.fastq.gz gzip compressed File Organization Raw reads [benpass@solexalign]$ ls [benpass@solexalign]$ ls Sample_FS53_EPCAM+_CD10-_IL2270-18 Sample_FS53_EPCAM+_CD10+_IL2269-19 Sample_COH77_CD49F-_IL2275-13 Sample_COH77_CD49F+_CD66-_IL2274-14 Sample_COH77_CD49F+_CD66+_IL2273-15 Sample_COH74_EPCAM+_CD10-_IL2272-16 Sample_COH74_EPCAM+_CD10+_IL2271-17 Sample_COH69_EPCAM+_CD10-_IL2268-20 Sample_COH69_EPCAM+_CD10+_IL2267-21 [benpass@solexalign]$ lsSample_COH77_CD49F-_IL2275-13 COH77_CD49F-_IL2275-13_AGTCAA_L002_R1_001.fastq.gz COH77_CD49F-_IL2275-13_AGTCAA_L002_R1_002.fastq.gz COH77_CD49F-_IL2275-13_AGTCAA_L002_R1_003.fastq.gz COH77_CD49F-_IL2275-13_AGTCAA_L002_R1_004.fastq.gz COH77_CD49F-_IL2275-13_AGTCAA_L002_R2_001.fastq.gz COH77_CD49F-_IL2275-13_AGTCAA_L002_R2_002.fastq.gz COH77_CD49F-_IL2275-13_AGTCAA_L002_R2_003.fastq.gz COH77_CD49F-_IL2275-13_AGTCAA_L002_R2_004.fastq.gz [benpass@solexalign]$ ls Sample_FS53_EPCAM+_CD10-_IL2270-18 Sample_FS53_EPCAM+_CD10+_IL2269-19 Sample_COH77_CD49F-_IL2275-13 Sample_COH77_CD49F+_CD66-_IL2274-14 Sample_COH77_CD49F+_CD66+_IL2273-15 Sample_COH74_EPCAM+_CD10-_IL2272-16 Sample_COH74_EPCAM+_CD10+_IL2271-17 Sample_COH69_EPCAM+_CD10-_IL2268-20 Sample_COH69_EPCAM+_CD10+_IL2267-21 [benpass@solexalign]$ ls Sample_FS53_EPCAM+_CD10-_IL2270-18 Sample_FS53_EPCAM+_CD10+_IL2269-19 Sample_COH77_CD49F-_IL2275-13 Sample_COH77_CD49F+_CD66-_IL2274-14 Sample_COH77_CD49F+_CD66+_IL2273-15 Sample_COH74_EPCAM+_CD10-_IL2272-16 Sample_COH74_EPCAM+_CD10+_IL2271-17 Sample_COH69_EPCAM+_CD10-_IL2268-20 Sample_COH69_EPCAM+_CD10+_IL2267-21 • Read assessment and prep [benpass@solexalign]$ ls Sample_FS53_EPCAM+_CD10-_IL2270-18 Sample_FS53_EPCAM+_CD10+_IL2269-19 Sample_COH77_CD49F-_IL2275-13 Sample_COH77_CD49F+_CD66-_IL2274-14 Sample_COH77_CD49F+_CD66+_IL2273-15 Sample_COH74_EPCAM+_CD10-_IL2272-16 Sample_COH74_EPCAM+_CD10+_IL2271-17 Sample_COH69_EPCAM+_CD10-_IL2268-20 Sample_COH69_EPCAM+_CD10+_IL2267-21 [benpass@solexalign]$ lsSample_COH77_CD49F-_IL2275-13 COH77_CD49F-_IL2275-13_AGTCAA_L002_R1_001.fastq.gz COH77_CD49F-_IL2275-13_AGTCAA_L002_R1_002.fastq.gz COH77_CD49F-_IL2275-13_AGTCAA_L002_R1_003.fastq.gz COH77_CD49F-_IL2275-13_AGTCAA_L002_R1_004.fastq.gz COH77_CD49F-_IL2275-13_AGTCAA_L002_R2_001.fastq.gz COH77_CD49F-_IL2275-13_AGTCAA_L002_R2_002.fastq.gz COH77_CD49F-_IL2275-13_AGTCAA_L002_R2_003.fastq.gz COH77_CD49F-_IL2275-13_AGTCAA_L002_R2_004.fastq.gz Format • Base quality recalibration • Local realignment • Mapping • Duplicate marking Analysis-ready reads
Initial Read Assessment Raw reads • Read assessment and prep • Mapping • Local realignment • Duplicate marking • Base quality recalibration Analysis-ready reads Common problems that can affect analysis • Low confidence base calls • typically toward ends of reads • criteria vary by application • Presence of adapter sequence in reads • poor fragment size selection • protocol execution or artifacts • Over-abundant sequence duplicates • Library contamination
Initial Read Assessment: FastQC Raw reads • Read assessment and prep • Mapping • Local realignment • Duplicate marking • Base quality recalibration Analysis-ready reads • Free Download Download:http://www.bioinformatics.babraham.ac.uk/projects/fastqc/ Tutorial : http://www.youtube.com/watch?v=bz93ReOv87Y • Samples reads (200K default): fast, low resource use
Read Assessment Examples ~71.48% of sequences are duplicates Sanger Quality Score by Cycle Median, Inner Quartile Range, 10-90 percentile range, Mean Read Duplication ~6% of sequences occur more than 10x ~8% of sampled sequences occur twice http://proteo.me.uk/2011/05/interpreting-the-duplicate-sequence-plot-in-fastqc Note: Duplication based on read identity, not alignment at this point
Read Assessment Example (Cont’d) Per base sequence content should resemble this…
Read Assessment Example (Cont’d) TruSeqAdapter, Index 9 5’ GATCGGAAGAGCACACGTCTGAACTCCAGTCACGATCAGATCTCGTATGCCGTCTTCTGCTTG
Read Assessment Example (Cont’d) Trim leading bases (library artifact) Trim for base quality or adapters (run or library issue)
Selected Tools to Process Reads • Fastxtoolkit* http://hannonlab.cshl.edu/fastx_toolkit/ • (partial list) • FASTQ Information: Chart Quality Statistics and Nucleotide Distribution • FASTQ Trimmer: Shortening FASTQ/FASTA reads (removing barcodes or noise). • FASTQ Clipper: Removing sequencing adapters • FASTQ Quality Filter: Filters sequences based on quality • FASTQ Quality Trimmer: Trims (cuts) sequences based on quality • FASTQ Masker: Masks nucleotides with 'N' (or other character) based on quality • *defaults to old Illumina fastq (ASCII offset 64). Use –Q33 option. • SepPrephttps://github.com/jstjohn/SeqPrep • Adapter trimming • Merge overlapping paired-end read • Biopythonhttp://biopython.org, http://biopython.org/DIST/docs/tutorial/Tutorial.html • (for python programmers) • Especially useful for implementing custom/complex sequence analysis/manipulation • Galaxy http://galaxy.psu.edu • Great for beginners: upload data, point and click • Just about everything you’ll see in today’s presentations
Read Mapping Raw reads • Read assessment and prep • Mapping • Local realignment • Duplicate marking • Base quality recalibration Analysis-ready reads http://www.broadinstitute.org/igv/
Read Mapping: Aligning to a Reference Raw reads • Read assessment and prep • Mapping • Local realignment • Duplicate marking • Base quality recalibration Analysis-ready reads
Read Mapping: BWA Raw reads • BWA Features • Uses Burrows Wheeler Transform • fast • modest memory footprint (<4GB) • Accurate • Tolerates base mismatches • increased sensitivity • reduces allele bias • Gapped alignment for both single- and paired-ended reads • Automatically adjusts parameters based on read lengths and error rates • Native BAM/SAM output (the de facto standard) • Large installed base, well-supported • Open-source (no charge) • Read assessment and prep • Mapping • Local realignment • Duplicate marking • Base quality recalibration Analysis-ready reads
Sequence References and Annotations http://www.ncbi.nlm.nih.gov/projects/genome/assembly/grc/data.shtml http://www.ncbi.nlm.nih.gov/guide/howto/dwn-genome Comprehensive reference information http://hgdownload.cse.ucsc.edu/downloads.html Comprehensive reference, annotation, and translation information ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle References and SNP information data by GATK Human only http://cufflinks.cbcb.umd.edu/igenomes.html Pre-indexed references and gene annotations for Tuxedo suite Human, Mouse, Rat , Cow, Dog, Chicken, Drosophila, C. elegans, Yeast http://www.repeatmasker.org/
Fasta Sequence Format • One or more sequences per file • “>” denotes beginning of sequence or contig • Subsequent lines up to the next “>” define sequence • Lowercase base denotes repeat masked base • Contig ID may have comments delimited by “|” >chr1 … TGGACTTGTGGCAGGAATgaaatccttagacctgtgctgtccaatatggt agccaccaggcacatgcagccactgagcacttgaaatgtggatagtctga attgagatgtgccataagtgtaaaatatgcaccaaatttcaaaggctaga aaaaaagaatgtaaaatatcttattattttatattgattacgtgctaaaa taaccatatttgggatatactggattttaaaaatatatcactaatttcat … >chr2 … >chr3 …
Running BWA Input files: reference.fasta, read1.fastq.gz, read2.fastq.gz Step 1: Index the genome (~3 CPU hours for a human genome reference): bwa index -a bwtswreference.fasta Step 2: Generate alignments in Burrows-Wheeler transform suffix array coordinates: bwaalnreference.fasta read1.fastq.gz > read1.sai bwaalnreference.fasta read2.fastq.gz > read2.sai Apply option –q<quality threshold> to trim poor quality bases at 3'-ends of reads Step 3: Generate alignments in the SAM format (paired-end): bwasampereference.fasta read1.sai read2.sai \ read1.fastq.gz read2.fastq.gz > alignment_ouput.sam http://bio-bwa.sourceforge.net/bwa.shtml
Running BWA (Cont’d) Simple Form: bwasampereference.fasta read1.sai read2.sai \ read1.fastq.gz read2.fastq.gz > alignment.sam Output to BAM: bwasampereference.fasta read1.sai read2.sai \ read1.fastq.gz read2.fastq.gz | samtoolsview -Sbh - > alignment.bam With Read Group Information: bwasampe-r "@RG\tID:readgroupID\tLB:libraryname\tSM:samplename\tPL:ILLUMINA“ \ reference.fasta \ read1.sai read2.sai \ read1.fastq.gz read2.fastq.gz | samtools view -Sbh - > alignment.bam
SAM (BAM) Format Sequence Alignment/Map format • Universal standard • Human-readable (SAM) and compact (BAM) forms Structure • Header version, sort order, reference sequences, read groups, program/processing history • Alignment records
SAM/BAM Format: Header samtools to view bam header sort order reference sequence names with lengths read groups with platform, library and sample information program (analysis) history [benpassalign_genotype]$ samtools view -H allY.recalibrated.merge.bam @HD VN:1.0 GO:noneSO:coordinate @SQ SN:chrM LN:16571 @SQ SN:chr1 LN:249250621 @SQ SN:chr2 LN:243199373 @SQ SN:chr3 LN:198022430 … @SQ SN:chr19 LN:59128983 @SQ SN:chr20 LN:63025520 @SQ SN:chr21 LN:48129895 @SQ SN:chr22 LN:51304566 @SQ SN:chrX LN:155270560 @SQ SN:chrYLN:59373566 … @RG ID:86-191 PL:ILLUMINA LB:IL500SM:86-191-1 @RG ID:BsK010 PL:ILLUMINA LB:IL501SM:BsK010-1 @RG ID:Bsk136 PL:ILLUMINA LB:IL502SM:Bsk136-1 @RG ID:MAK001 PL:ILLUMINA LB:IL503SM:MAK001-1 @RG ID:NG87 PL:ILLUMINA LB:IL504SM:NG87-1 … @RG ID:SDH023 PL:ILLUMINA LB:IL508 SM:SDH023 @PG ID:GATK IndelRealigner VN:2.0-39-gd091f72CL:knownAlleles=[] targetIntervals=tmp.intervals.listLODThresholdForCleaning=5.0 consensusDeterminationModel=USE_READS entropyThreshold=0.15 maxReadsInMemory=150000 maxIsizeForMovement=3000 maxPositionalMoveAllowed=200 maxConsensuses=30 maxReadsForConsensuses=120 maxReadsForRealignment=20000 noOriginalAlignmentTags=false nWayOut=null generate_nWayOut_md5s=false check_early=false noPGTag=false keepPGTags=false indelsFileForDebugging=null statisticsFileForDebugging=null SNPsFileForDebugging=null @PG ID:bwaPN:bwa VN:0.6.2-r126
SAM/BAM Format: Alignment Records 8 3 4 5 6 9 1 11 10 http://samtools.sourceforge.net/SAM1.pdf [benpassalign_genotype]$ samtools view allY.recalibrated.merge.bam HW-ST605:127:B0568ABXX:2:1201:10933:3739 147 chr1 27675 60101M = 27588 -188 TCATTTTATGGCCCCTTCTTCCTATATCTGGTAGCTTTTAAATGATGACCATGTAGATAATCTTTATTGTCCCTCTTTCAGCAGACGGTATTTTCTTATGC =7;:;<=??<=BCCEFFEJFCEGGEFFDF?BEA@DEDFEFFDE>EE@E@ADCACB>CCDCBACDCDDDAB@@BCADDCBC@BCBB8@ABCCCDCBDA@>:/ RG:Z:86-191 HW-ST605:127:B0568ABXX:3:1104:21059:173553 83 chr1 27682 60101M = 27664 -119 ATGGCCCCTTCTTCCTATATCTGGTAGCTTTTAAATGATGACCATGTAGATAATCTTTATTGTCCCTCTTTCAGCAGACGGTATTTTCTTATGCTACAGTA 8;8.7::<?=BDHFHGFFDCGDAACCABHCCBDFBE</BA4//BB@BCAA@CBA@CB@ABA>A??@B@BBACA>?;A@8??CABBBA@AAAA?AA??@BB0 RG:Z:SDH023 * Many fields after column 12 deleted (e.g., recalibrated base scores) have been deleted for improved readability
Preparing for Next Steps Raw reads • Subsequent steps require sorted and indexed bams • Sort orders: karyotypic, lexicographical • Indexing improves analysis performance • Picard tools: fast, portable, free http://picard.sourceforge.net/command-line-overview.shtml Sort: SortSam.jar Merge: MergeSamFiles.jar Index: BuildBamIndex.jar • Order: sort, merge (optional), index • Read assessment and prep • Mapping • Local realignment • Duplicate marking • Base quality recalibration Analysis-ready reads
Local Realignment Raw reads • Read assessment and prep • Mapping • Local realignment • Duplicate marking • Base quality recalibration Analysis-ready reads • 1Smith, Temple F.; and Waterman, Michael S. (1981). "Identification of Common Molecular Subsequences". Journal of Molecular Biology 147: 195–197. doi:10.1016/0022-2836(81)90087-5. PMID 7265238 • BWT-based alignment is fast for matching reads to reference • Individual base alignments often sub-optimal at indels • Approach • Fast read mapping with BWT-based aligner • Realign reads at indel sites using gold standard (but much slower) Smith-Waterman1 algorithm • Benefits • Refines location of indels • Reduces erroneous SNP calls • Very high alignment accuracy in significantly less time, with fewer resources
Local Realignment Raw BWA alignment Post re-alignment at indels DePristo MA, et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet. 2011 May;43(5):491-8. PMID: 21478889
Duplicate Marking Raw reads • Read assessment and prep • Mapping • Local realignment • Duplicate marking • Base quality recalibration Analysis-ready reads Covered in genotyping presentation Note that this is done after alignment
Base Quality Recalibration • STEP 1: Find covariates at non-dbSNP sites using: • Reported quality score • The position within the read • The preceding and current nucleotide (sequencer properties) • java -Xmx4g -jar GenomeAnalysisTK.jar \ • -T BaseRecalibrator \ • -I alignment.bam\ • -R hg19/ucsc.hg19.fasta \ • -knownSiteshg19/dbsnp_135.hg19.vcf\ • -o alignment.recal_data.grp • STEP 2: Generate BAM with recalibrated base scores: • java -Xmx4g -jar GenomeAnalysisTK.jar \ • -T PrintReads \ • -R hg19/ucsc.hg19.fasta \ • -I alignment.bam\ • -BQSR alignment.recal_data.grp\ • -o alignment.recalibrated.bam Raw reads • Read assessment and prep • Mapping • Local realignment • Duplicate marking • Base quality recalibration Analysis-ready reads
Getting Started Is there an easier way to get started?!!
Getting Started http://galaxy.psu.edu/ Click “Use Galaxy”
Getting Started http://galaxy.psu.edu/ Click “Use Galaxy”