910 likes | 4.24k Views
Introduction to Next-Generation Sequencing. Kihoon Yoon, Ph.D. Dept of Epidemiology & Biostatistics School of Medicine University of Texas Health Science Center at San Antonio. Outline. Sequencing technologies Applications Bioinformatics tools for short-read sequencing
E N D
Introduction toNext-Generation Sequencing Kihoon Yoon, Ph.D. Deptof Epidemiology & Biostatistics School of Medicine University of Texas Health Science Center at San Antonio
Outline Sequencing technologies Applications Bioinformatics tools for short-read sequencing Examples of Applications: ChIP-Seq /RNA-Seq
Sequencing technologies • Next-next….-generation: how many ‘next’s are there? • First Generation: automated version of Sanger sequencing (DNA-sequencing method invented by Fred Sanger in the 1970s) • Take 500 days to read one Giga (billion) base (Gb) (1/3 of human genome) • 1000 bases per read / Cost is high - $0.50 per 1000 bases • Second Generation • Roche/454 sequencing machine from 454 Life Science (2005) • 450 bases per read / $0.02 per 1000 bases / 2 days per Gb • Solexa from Illumina (2006) • 75 bases per read / $0.001 per 1000 bases / 0.5 days per Gb • SOLiD from Applied Biosystem (2006) • 50 bases per read / $0.001 per 1ooo bases / 0.5 days per Gb • Next-Next-Gen – Third Generation? • HiSeq2000 from Illumina – 0.04 days per Gb • HelicosHeliscopeTM (www.helicosbio.com) • Pacific Biosciences SMRT (www.pacificbiosciences.com)
First vs Second Generation Figure 1 from Shendure & Ji, 2008
Second Generation Sequencing 454, SOLiD Solexa Figure 2 from Shendure & Ji, 2008
NGS • A typical procedure: • Sequencing • How deep? • Alignment • References, assemble or both • Experimental specific analysis • A ‘one-size-fits-all’ program does not exist
Applications • De novo sequence assembly • Whole Genome Assembly • Transcriptome Assembly • Short Sequence Alignment • Single read • Paired read • Genomic Variation Detection • Detection of Single Nucleotide Polymorphism (SNP) • Detection of Alternative Splicing Event • Detection of major/minor transcript isoforms
Applications RNA-Seq Table 2 from Shendure & Ji, 2008
Bioinformatics Tools Table 3 from Shendure & Ji, 2008
File Format • Sequence Reads • fastq • fasta • Alignment • Sequence Alignment Map (SAM) • http://samtools.sourceforge.net/SAM1.pdf • BAM • http://iesdp.gibberlings3.net/file_formats/ie_formats/bam_v1.htm • Samtools: http://samtools.sourceforge.net/
Data: Sequence Reads Size of raw data A challenge call for a new compression algorithm
Data: Sequence Reads Examples from Illuminasequcing read file - fastq Line 1: Line 2: Line 3: Line 4: @EAS042_0001:1:1:1061:20798#0/1 TNTCTGTGTCCTGGGGCATCAATGATAGTCACATAGTACTTGCTGGTCTCAAATTTCCACAAGGAGATATCAATGG +EAS042_0001:1:1:1061:20798#0/1 aB\^^Y]a^]cde`daaYaaa_bc\\`b^Y\a\aaUQY\]a\`aa\W__]HVZ]VQF^[`UH]\J^F^T^\\I]__ Line 1 Line 2: raw sequence Line 3: + ? Line 4: sequence quality score from -5 to 62 using ASCII 59 to 126 Will Lossy Compression work?
Example of Applications • ChIP-Seq • allows you to assay the amount of binding and location of a protein to DNA, such as a transcription factor bound to the start site of a gene, or a histones of a certain type. • RNA-Seq • Transcriptome sequencing • Substantial challenges exist for annotation • Should be able to reconstruct transcripts & accurately measure their relative abundance w/o reference to an annotated genome
ChIP-Seq Chromatin immunoprecipitation (ChIP) followed by high-throughput sequencing Figure 1 from Mardis, 2007
ChIP-Seq • ChIP-chip: ChIP is coupled to DNA hybridization array (chip) technology • This is the closest methodology to ChIP-seq, but its mapping precision is lower, and the dynamic range of the readout is significantly less. Comparison of ChIP-seq and ChIP-chip. Representative signals from ChIP-seq (solid line) and ChIP-chip (dashed line) show both greater dynamic range and higher resolution with ChIP-seq. Whereas three binding peaks are identified using ChIP-seq, only one broad peak is detected using ChIP-chip. Liu et al.BMC Biology 2010 8:56 doi:10.1186/1741-7007-8-56
ChIP-Seq • Three key steps • antibody selection – most crucial • actual sequencing, which is subject to several possible biases • algorithmic analysis, including mapping and peak-calling. • short tags (around 25 to 35 bp) can be ambiguous in regions of high homology or in repeat regions • Align and Pick-calling to detect active binding sites • Alignment tools: BWA, MAQ, SOAP …. • a large number of free and commercial peak-calling software packages: MACS, SICER, PeakSeq, SISSR, F-seq • Pepke S, Wold B, Mortazavi A: Computation for ChIP-seq and RNA-seq studies. Nat Methods 2009 , 6:S22-S32. • Barski A, Zhao K: Genomic location analysis by ChIP-Seq. J Cell Biochem 2009 , 107:11-18.
ChIP-Seq Shirley Pepke, Barbara Wold & Ali Mortazavi Nature Methods 6, S22 - S32 (2009) Published online: 15 October 2009 doi:10.1038/nmeth.1371
ChIP-Seq: Wilbanks et al. • Wilbanks EG, Facciotti MT (2010) Evaluation of Algorithm Performance in ChIP-Seq Peak Detection. PLoS ONE 5(7): e11471. doi:10.1371/journal.pone.0011471 Figure 1
ChIP-Seq: Wilbanks et al. Figure 7. Positional accuracy and precision. The distance between the predicted binding site and high confidence motif occurrences within 250 bp was calcualted for different peak calling programs in the (A) NRSF….
ChIP-Seq: Wilbanks et al. • Conclusion: It is a hard problem! • Balance b/w sensitivity & specificity in compiling the final candidate peak list is desired • High false positives! • “We suggest that rather than focus solely on algorithmic development, equal or better gains could be made through careful consideration of experimental design and further development of sample preparations to reduce noise in the datasets.” • New methods do not always give us clear ideas about the outcome…. • Biologists do not think analysis part in advance, and quantitative scientists absolutely don’t have any idea to recommend on their experiments. And, the results of experiments are likely to be inclusive!
RNA-Seq Transcriptiome Analysis Figure 5 | Overview of RNA-Seq. A RNA fraction of interest is selected, fragmented and reverse transcribed. The resulting cDNA can then be sequenced using any of the current ultra-high-throughput technologies to obtain ten to a hundred million reads, which are then mapped back onto the genome. The reads are then analyzed to calculate expression levels. Shirley Pepke, Barbara Wold & Ali Mortazavi Nature Methods 6, S22 - S32 (2009) Published online: 15 October 2009 doi:10.1038/nmeth.1371
RNA-Seq: Strategies Figure 1 from Hass & Zody, 2010
RNA-Seq: Strategies • Alignment Strategy • Align to transcriptome • no new transcript discovery • Align to genome and exon-exon junction sequences • extremely large search space due to all possible exon combinations • De novo assembly • Cufflink • Scripture Shirley Pepke, Barbara Wold & Ali Mortazavi Nature Methods 6, S22 - S32 (2009) Published online: 15 October 2009 doi:10.1038/nmeth.1371
RNA-Seq • two major objectives of RNA-Seq experiments: • Identification of novel transcripts from the locations of regions covered in the mapping. • Estimation of the abundance of the transcripts from their depth of coverage in the mapping.
TopHat/Cufflink Cole Trapnell, LiorPachter, and Steven L. Salzberg, TopHat: discovering splice junctions with RNA-SeqBioinformatics (2009) 25(9): 1105-1111 doi:10.1093/bioinformatics/btp120 Cole Trapnell,Brian A Williams,GeoPertea,AliMortazavi,GordonKwan,Marijke J van Baren,Steven L Salzberg,Barbara J Wold& Lior, Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation, Nature Biotechnology,Vol: 28, 511–515 (2010)
TopHat/Cufflink Trapnell et al., 2010 Trapnell et al., 2009
Scripture • Mitchell Guttman, Manuel Garber, Joshua Z Levin, Julie Donaghey, James Robinson, Xian Adiconis, Lin Fan, Magdalena J Koziol, Andreas Gnirke, Chad Nusbaum, John L Rinn, Eric S Lander & Aviv Regevaregev, Abinitio reconstruction of cell type–specific transcriptomes in mouse reveals the conserved multi-exonic structure of lincRNAs. Nature Biotechnology.Vol: 28, 503–510 (2010)
Scripture Figure 1 Figure 2 Guttman et al., 2010
RNA-Seq Software Shirley Pepke, Barbara Wold & Ali Mortazavi Nature Methods 6, S22 - S32 (2009) Published online: 15 October 2009 doi:10.1038/nmeth.1371
Quantitation • Metric for RNA-Seq Expression • RPKM • Reads per kilobase per million reads • Count the number of reads which map to constitutive exon bodies. The set of constitutive exons was derived from Ensembl genes (hg18, UCSC genome browser), where an exon was defined to be constitutive if present in all transcripts for a given gene • Determine the number of uniquely mappable positions in the same set of constitutive exons. "Uniquely mappable" was defined as being a unique 32-mer in the genome and our junction database. • Count the total number of uniquely mapping reads in each tissue or sample. • Compute RPKM as the number of reads which map per kilobase of exon model per million mapped reads for each gene, for each tissue or sample.
RNA-Seq De novo assembly algorithms Post-transcriptional regulation
References Metzker, M.L. (2010) Sequencing technologies - the next generation. Nat Rev Genet, 11, 31-46. Mardis, E.R. (2008) Next-generation DNA sequencing methods. Annu Rev Genom Hum G, 9, 387-402. Shendure, J. and Ji, H.L. (2008) Next-generation DNA sequencing. Nat Biotechnol, 26, 1135-1145. Mardis, E.R. (2007) ChIP-seq: welcome to the new frontier. Nat Methods, 4, 613-614. Wang, Z., Gerstein, M. and Snyder, M. (2009) RNA-Seq: a revolutionary tool for transcriptomics. Nature Reviews Genetics, 10, 57-63. Haas, B.J. and Zody, M.C. (2010) Advancing RNA-Seq analysis. Nature Biotechnology 28, 421–423.