500 likes | 2.19k Views
Whole Genome/Exome sequencing data processing -Required Computing Skills -Raw Genomic/Exomic data quality control and manipulation: Fastx toolkit… -Genomic/Exomic Mapping: BWA, Bowtie, MAQ, SOAP3…(Mapping raw data from; 454, SOLiD, Illumina, Ion torrent etc.) -Data manipulations (for SAM/BAM files): Pysam, Picard… -Filtering, Sorting, Indexing,visualizing and variant calling (for SAM/BAM files): Samtools, GATK, IGV… -Filtering and recalibrating variants (for BCF/VCF files) -Variant annotation: SNPeff, ANNOVAR, SeattleSeq… -Validating, merging, comparing and calculating VCF files: VCFtools -Viewing, sorting and filtering variants: Varsifter
E N D
NGS Data ProcessingAdvanced(Whole Genome/Exome Sequencing Data Analyses) Alireza K. Jamayran Sept- 25st , 2012 Alireza K. J
Topics • Required Computing Skills • Raw Genomic/Exomic data quality control and manipulation: Fastx toolkit… • Genomic/Exomic Mapping: BWA, Bowtie, MAQ, SOAP3…(Mapping raw data from; 454, SOLiD, Illumina, Ion torrent etc.) • Data manipulations (for SAM/BAM files): Pysam, Picard… • Filtering, Sorting, Indexing, visualizing and variant calling (for SAM/BAM files): Samtools, GATK, IGV… • Filtering and recalibrating variants (for BCF/VCF files) • Variant annotation: SNPeff, ANNOVAR, SeattleSeq… • Validating, merging, comparing and calculating VCF files: VCFtools • Viewing, sorting and filtering variants: Varsifter Alireza K. J
Required Computing Skills • Basic understanding of server computing and server management (database server management) • Skilled in using different operating systems: Unix, Linux, Mac and Windows and skilled in using Command Line Interface (CLI) tools • Basic understanding of programming languages (Python, Perl, Java, C and C++) • Cloud computing • Handling large files (terabyte-sized) • Crowd-sourcing Alireza K. J
Most tools are Open-source tools • Open-source ≠ Free • It is important to read about the licensing of the tools you use • In most cases even the Operating System which Bioinformaticians use is an Open-source OS. (Linux distributions such as Ubuntu 12.4) • Generally these Operating Systems and tools are made by Computer geeks for Computer geeks. Alireza K. J
Different types of raw reads generated from different sequencing machines FASTA file format (SOLiD) FASTQ file format (Sanger, Solexa, illumina) FASTA file format (454) Alireza K. J
FASTQ file • A FASTQ file normally uses four lines per sequence. Line 1 begins with a '@' character and is followed by a sequence identifier and anoptional description (like a FASTA title line). Line 2 is the raw sequence letters. Line 3 begins with a '+' character and is optionallyfollowed by the same sequence identifier (and any description) again. Line 4 encodes the quality values for the sequence in Line 2, and must contain the same number of symbols as letters in the sequence. A minimal FASTQ file might look like this: • FASTQ files from the NCBI/EBI Sequence Read Archive often include a description, e.g. Alireza K. J
Sequences from the Illumina software use a systematic identifier: Alireza K. J
Versions of the Illumina pipeline since 1.4 appear to use #NNNNNN instead of #0 for the multiplex ID, where NNNNNN is the sequence of the multiplex tag.With Casava 1.8 the format of the '@' line has changed: Alireza K. J
Phred quality score Phred quality scores are used for: • Assessment of sequence quality • Recognition and removal of low-quality sequence (end clipping) • Determination of accurate consensus sequences Alireza K. J
QualityA quality value Q is an integer mapping of p (i.e., the probability that the corresponding base call is incorrect). Two different equations have been in use. The first is the standard Sanger variant to assess reliability of a base call, otherwise known as Phred quality score: Alireza K. J
ASCIIAmerican Standard Code for Information Interchange Alireza K. J
FASTQ/A Reverse-Complement FASTQ/A Barcode splitter FASTA Formatter FASTA Nucleotide Changer FASTQ Quality Filter FASTQ Quality Trimmer FASTQ Masker FASTQ /A Quality Stats … FASTQ-to-FASTA converter FASTQ Information FASTQ/A Collapser FASTQ/A Trimmer FASTQ/A Renamer FASTQ/A Clipper FASTQ Quality Box plot Graph Alireza K. J
Summary Statistics Alireza K. J
Viewing summary statistics on Box plot Alireza K. J
And then other QC and Manipulations Such as : FASTA/Q Renamer FASTQ Quality Filter …. Alireza K. J
Format converters • Biopython version 1.51 onwards (interconverts Sanger, Solexa and Illumina 1.3+) • EMBOSS version 6.1.0 patch 1 onwards (interconverts Sanger, Solexa and Illumina 1.3+) • BioPerl version 1.6.1 onwards (interconverts Sanger, Solexa and Illumina 1.3+) • BioRuby version 1.4.0 onwards (interconverts Sanger, Solexa and Illumina 1.3+) • BioJava version 1.7.1 to 1.8.x (interconverts Sanger, Solexa and Illumina 1.3+) • MAQ can convert from Solexa to Sanger (use this patch to support Illumina 1.3+ files). • fastx_toolkit The included fastq_quality_converter program can convert Illumina to Sanger Alireza K. J
Slide from Mark DePristo Alireza K. J
Read Aligner Tools • Bfast • BioScope • Bowtie • BWA • CLC bio • CloudBurst • Eland/Eland2 • GenomeMapper • GnuMap • Karma • MAQ • MOM • Mosaik • MrFAST/MrsFAST • NovoAlign • PASS • PerM • RazerS • RMAP • SSAHA2 • Segemehl • SeqMap • SHRiMP • Slider/SliderII • SOAP/SOAP2 • SOAP3 • Srprism • Stampy • vmatch • ZOOM Alireza K. J
Burrows-Wheeler Aligner (BWA) Burrows-Wheeler Aligner (BWA) is an efficient program that aligns relatively short nucleotide sequences against a long reference sequence such as the human genome. It implements two algorithms, bwa-short and BWA-SW. The former works for query sequences shorter than 200bp and the latter for longer sequences up to around 100kbp. Both algorithms do gapped alignment. They are usually more accurate and faster on queries with low error rates. AUTHOR Heng Li at the Sanger Institute wrote the key source codes and integrated the following codes for BWT construction: bwtsw <http://i.cs.hku.hk/~ckwong3/bwtsw/>, implemented by Chi-Kwong Wong at the University of Hong Kong and IS <http://yuta.256.googlepages.com/sais> originally proposed by NongGe<http://www.cs.sysu.edu.cn/nong/> at the Sun Yat-Sen University and implemented by Yuta Mori. Alireza K. J
SYNOPSIS Alireza K. J
SAM (Sequence Alignment/Map) format is a generic format for storing large nucleotide sequence alignments. Alireza K. J
Picard and Pysam • Picard comprises Java-based command-line utilities that manipulate SAM files, and a Java API (SAM-JDK) for creating new programs that read and write SAM files. Both SAM text format and SAM binary (BAM) format are supported. • Pysam is a python module for reading and manipulating Samfiles. It's a lightweight wrapper of the samtools C-API. Pysam also includes an interface for tabix. Alireza K. J
Common Variant Calling Tools • Mosaik –Mapping (Illumina,454, SOLID) – Variant calling • SOAP – Mapping (Illumina) – SNP calling • Samtools – Data processing – Variant calling • GLFtools – SNP calling • QCALL – SNP calling • GATK – Data processing – Variant calling – Variant manipulation – Variant QC • DINDEL – Indel calling Alireza K. J
Samtools - Utilities for the Sequence Alignment/Map (SAM) format Samtools is a set of utilities that manipulate alignments in the BAM format. It imports from and exports to the SAM (Sequence Alignment/Map) format, does sorting, merging and indexing, and allows to retrieve reads in any regions swiftly. AUTHOR Heng Li from the Sanger Institute wrote the C version of samtools. Bob Handsakerfrom the Broad Institute implemented the BGZF library and JueRuanfrom Beijing Genomics Institute wrote the RAZF library. John Marshall and PetrDanecekcontribute to the source code and various people from the 1000 Genomes Project have contributed to the SAM format specification. Alireza K. J
SYNOPSIS Alireza K. J
BCFtools - Utilities for the Binary Call Format (BCF) and VCF SYNOPSIS • bcftools index in.bcf • bcftools view in.bcf chr2:100-200 > out.vcf • bcftools view -vc in.bcf > out.vcf 2> out.afs Alireza K. J
Data Visualization with tview Alireza K. J
Data Visualization with IGV Alireza K. J
Data Visualization with IGB Alireza K. J
Data Visualization with UCSC Alireza K. J
The Genome Analysis Toolkit or GATK • The Genome Analysis Toolkit or GATK is a software package developed at the Broad Institute to analyse next-generation resequencing data. The toolkit offers a wide variety of tools, with a primary focus on variant discovery and genotyping as well as strong emphasis on data quality assurance. Its robust architecture, powerful processing engine and high-performance computing features make it capable of taking on projects of any size. Alireza K. J
SYNOPSIS Alireza K. J
Slide from Mark DePristo Alireza K. J
Variant Call Format (VCF) Alireza K. J
Annotation Tools • SeattleSeq Annotation, by Deborah Nickerson, University of Washington - conservation, HapMap freq, PolyPhen, clinical assoc., limited indels (it is an external server) • ANNOVAR , by Kai Wang et.al. Children’s Hospital of Philadelphia - exonic splicing, HGVS format, distance to nearest gene, indels (local scripts using local data downloaded from UCSC Genome Browser) • SNPeff , by Pablo Cingolani - integration with GATK and Galaxy, can read and write VCF (local Java program using local data files) • PIANNO / CDPred, by Praveen Cherukuri, NHGRI - Conserved Domain Prediction, dbSNP, indels (local scripts using UCSC Genome Browser SQL server) Alireza K. J
snpEff • It's a variant annotation and effect prediction tool. It annotates and predicts the effects of variants on genes (such as amino acid changes). Alireza K. J
VCFtools • Welcome to VCFtools - a program package designed for working with VCF files, such as those generated by the 1000 Genomes Project. The aim of VCFtools is to provide methods for working with VCF files: validating, merging, comparing and calculate some basic population genetic statistics. Alireza K. J
VarSifter • "VarSifter" is a graphical Java program designed to display, sort, filter, and generally sift variation data from massively parallel sequencing experiments. It is designed to read exome-scale variation data in either a tab-delimited text file with header, or an uncompressed VCF file. These files should be pre-generated with desired annotation information one would like to view. Alireza K. J
Galaxy • In addition to using the public Galaxy server (a.k.a. Main), you can also install your own instance of Galaxy (what this page is about), or create an instance of Galaxy on the cloud. Another option is to use one of the ever-increasing number of Public Galaxy Servers hosted by other organizations. Reasons to Install Your Own Galaxy You only need to download Galaxy if you plan to: • Develop it further • Add new tools • Plug-in new datasources, or • Run a local production server for your site because you have • Sensitive data (e.g., clinical) • Large datasets or processing requirements that are too big to be processed on Main Alireza K. J
Thank you • References • Galaxy • 1. Belinda M. Giardine, Cathy Riemer, Richard Burhans, AakroshRatan, Webb Miller, "Some Phenotype Association Tools in Galaxy: Looking for Disease SNPs in a Full Genome." Current Protocols in Bioinformatics 39:15.2.1-15.2.27, 2012 September: Unit 15.2. • 2. Anton Nekrutenko & James Taylor, "Next-generation sequencing data interpretation: enhancing reproducibility and accessibility." Nature Reviews Genetics. 13, 667-672 (September 2012). • 3. EnisAfgan, Brad Chapman, MargitaJadan, VedranFranke, James Taylor, "Using Cloud Computing Infrastructure with CloudBioLinux, CloudMan, and Galaxy". Current Protocols in Bioinformatics. 2012 June: Unit 11.9. • 4. Jennifer Hillman-Jackson, Dave Clements, Daniel Blankenberg, James Taylor, Anton Nekrutenko, Galaxy Team, "Using Galaxy to Perform Large-Scale Interactive Data Analyses". Current Protocols in Bioinformatics. 2012 June: Unit 10.5. • Varsifter • http://research.nhgri.nih.gov/software/VarSifter/ • BWA • Li H. and Durbin R. (2010) Fast and accurate long-read alignment with Burrows-Wheeler transform. Bioinformatics. [PMID: 20080505] • Li H. and Durbin R. (2009) Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics, 25, 1754-60. [PMID: 19451168] • Samtools Alireza K. J
Li H., Handsaker B., Wysoker A., Fennell T., Ruan J., Homer N., Marth G., Abecasis G., Durbin R. and 1000 Genome Project Data Processing Subgroup (2009) • VCFtools • The Variant Call Format and VCFtools, PetrDanecek, Adam Auton, Goncalo Abecasis, Cornelis A. Albers, Eric Banks, Mark A. DePristo, Robert Handsaker, GertonLunter, Gabor Marth, Stephen T. Sherry, GileanMcVean, Richard Durbin and 1000 Genomes Project Analysis Group, Bioinformatics, 2011 • SNPeff • De Baets G., Van Durme J., Reumers J., MaurerStroh S., Vanhee P., DopazoJ.,Schymkowitz J., Rousseau F. (2012) SNPeffect 4.0: on-line prediction of molecular and structural effects of protein-coding variants. Nucleic Acids Res. 40,D935–939 • IGV • James T. Robinson, Helga Thorvaldsdóttir, Wendy Winckler, Mitchell Guttman, Eric S. Lander, Gad Getz, Jill P. Mesirov.Integrative Genomics Viewer. Nature Biotechnology 29, 24–26 (2011) • Helga Thorvaldsdottir, James T. Robinson, Jill P. Mesirov.Integrative Genomics Viewer (IGV): high-performance genomics data visualization and exploration. • Briefings in Bioinformatics 2012. • IGB • Nicol JW, Helt GA, Blanchard SG Jr., Raja A, Loraine AE. 2009. The Integrated Genome Browser: Free software for distribution and exploration of genome-scale datasets. Bioinformatics 25: 2730–2731. • Pysam • http://code.google.com/p/pysam/ • Picard • http://picard.sourceforge.net/command-line-overview.shtml#Overview • GATK • http://www.broadinstitute.org/gatk/ Alireza K. J
FASTQ • Cock et al (2009) The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants. Nucleic Acids Research, doi:10.1093/nar/gkp1137 • Sequencing Analysis Software User Guide: For Pipeline Version 1.4 and CASAVA Version 1.0, dated April 2009 PDF • Sequencing Analysis Software User Guide: For Pipeline Version 1.5 and CASAVA Version 1.0, dated August 2009 PDF • Sequence/Alignment Map format Version 1.0, dated August 2009 PDF • Seqanswer's topic of skruglyak, dated January 2011 website • Illumina Quality Scores, Tobias Mann, Bioinformatics, San Diego, Illumina • [Using Genome Analyzer Sequencing Control Software, Version 2.6, Catalog # SY-960-2601, Part # 15009921 Rev. A, November 2009]http://watson.nci.nih.gov/solexa/Using_SCSv2.6_15009921_A.pdf • SolexaQA project website • FastX • http://hannonlab.cshl.edu/fastx_toolkit/index.html Alireza K. J