710 likes | 883 Views
Vanderbilt Center for Quantitative Sciences Summer Institute Sequencing Analysis. Yan Guo. What is Sequencing?. Sequencing is the process of determining the precise order of nucleotides.
E N D
Vanderbilt Center for Quantitative Sciences Summer InstituteSequencing Analysis Yan Guo
What is Sequencing? • Sequencing is the process of determining the precise order of nucleotides. • Non high throughput sequencing: Sanger Sequencing: The basic chain termination method, developed by Frederick Sanger in 1974. Generates all possible single-stranded DNA molecules complementary to a given template, and beginning at a common 5' base.
The Pros and Cons of Sanger Sequencing • Pros: Highly accurate • targetable • Cons: • Cost $15 per /1000 base pairs, to sequencing the whole genome will cost roughly: 30bil/1000x$15=$15m • Low detection rate of alternative allele
Sequencing Type By Source • RNA: mRNA, Small RNA, Total RNA • DNA: Whole Genome or targeted (Exome, mitochondrial, genes of interest, etc)
Sequencing Data • Raw Image data is more than 2TB per sample • Raw data is about 5-15GB per single end sample or 10-30GB per pair end sample for RNAseq or Exome Sequencing. Whole genome data can easily exceed 200GB per sample. • In general 5x raw data size is needed to finish processing • Raw data is usually in FASTQ format, the base quality is in Phred scale • Older Illumina pipeline uses Phred 64 scale, newer CASAVA 1.8 pipeline uses Sanger scale.
Single end vs Paired end • Paired end data has double amount of data than single end. • Paired end is more expensive than single end. • Paired end data is easier to do quality control (insert size, removing duplicate) • Paired end data provides more opportunities to detect structural variance.
What can you obtain from DNAseq • SNPs (require only normal or tumor) • Somatic Mutations (require tumor and normal pair) • Copy Number Variation (work best with whole genome sequencing) • Small Structural Variance: Insertion, deletion • Large Structure Variance: (Translocation, Inversion)
What can you obtain from RNAseq • Gene Expression • SNP (only for expressed genes) • Novel Splicing Variants • Genes Fusion RNAseq has been used primarily as a replacement of microarray
How does RNAseq compare to Microarray? • Since 2008, people has been saying that RNAseq will replace microarray for gene expression profiling. • VANTAGE stopped offering microarray service earlier this year. • Wang, Z., M. Gerstein, and M. Snyder, RNA-Seq: a revolutionary tool for transcriptomics. Nat Rev Genet, 2009. 10(1): p. 57-63. • 2. Shendure, J., The beginning of the end for microarrays? Nat Methods, 2008. 5(7): p. 585-7.
Data Distribution Guo, Y., et al., Large Scale Comparison of Gene Expression Levels by Microarrays and RNAseq Using TCGA Data. PLoS One, 2013. 8(8): p. e71462.
Result Consistency Guo, Y., et al., Large Scale Comparison of Gene Expression Levels by Microarrays and RNAseq Using TCGA Data. PLoS One, 2013. 8(8): p. e71462.
Raw data • @HWI-ST508:203:D078GACXX:8:1101:1296:1011 1:N:0:ATCACG • NTGGAGTCCTAGGCACAGCTCTAAGCCTCCTTATTCGAGCCGAGCTGGGCC • + • #4=DDDDDDDDDDE<DAEEEIDFEIEIEIEIIIIIIDEDDDDA@DDDDII@
Quality Control • Quality control should be conducted at multiple steps during sequencing data processing • Raw data • Alignment • Results (Expression for RNA, and SNP/mutation for DNA) Guo, Y., et al., Three-stage quality control strategies for DNA re-sequencing data. Brief Bioinform, 2013.
Raw Data QC - Tools • FAST QC http://www.bioinformatics.babraham.ac.uk/projects/fastqc/ • FASTX-Toolkit http://hannonlab.cshl.edu/fastx_toolkit/ • QC3 https://github.com/slzhao/QC3 • NGS QC Toolkit http://59.163.192.90:8080/ngsqctoolkit/
Alignment QC - Tools • QC3 https://github.com/slzhao/QC3 • Qqplothttp://genome.sph.umich.edu/wiki/QPLOT • SAMStat http://samstat.sourceforge.net/
Expression QC - Tools • MultiRankSeqhttps://github.com/slzhao/MultiRankSeq
Clustering Algorithms • Start with a collection of n objects each represented by a p–dimensional feature vector xi , i=1, …n. • The goal is to divide these n objects into k clusters so that objects within a clusters are more “similar” than objects between clusters. k is usually unknown. • Popular methods: hierarchical, k-means, SOM, mixture models, etc.
Distance Calculation in SequencingSmith-Waterman algorithm Sequence 1 = ACACACTA Sequence 2 = AGCACACA w(gap) = 0 w(match) = +2 w(a, − ) = w( − ,b) = w(mismatch) = − 1
Distance Calculation in Microarray • Pearson Correlation Two profiles (vectors) and +1 Pearson Correlation – 1
Similarity Measurements • Euclidean Distance
Linkage • Single Linkage: D(X, Y) = min(d(x, y)), x ϵ X, y ϵ Y • Complete Linkage: D(X, Y) = max(d(x, y)), x ϵ X, y ϵ Y • Average Linkage:
Correction of Batch Effect Guo, Y., et al., Statistical strategies for microRNAseq batch effect reduction. Translational Cancer Research, 2014. 3(3): p. 260-265.
Normalization of RNAseq Reads PerKilo base per Million reads (RPKM)
RNAseq Data Alignment • TopHat2 http://ccb.jhu.edu/software/tophat/index.shtml • MapSplicehttp://www.netlab.uky.edu/p/bioinfo/MapSplice
Gene Quantification • CufflInks for RPKM http://cufflinks.cbcb.umd.edu/ • HTSeq for read count http://www-huber.embl.de/users/anders/HTSeq/doc/overview.html
Example of Quantile Normalization Red = G1; Green = G2; Blue = G3; Yellow = G4; Black = G5 Original Original Sort S1 Sort S2 Sort S3 Sorted
Take Average for Each Row Sorted Averaged
Reorder Red = G1; Green = G2; Blue = G3; Yellow = G4; Black = G5 Averaged
Differential Expression Analysis • Cuffdiff from Cufflinks package Trapnell, C., et al., Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks. Nat Protoc, 2012. 7(3): p. 562-78. • DESeqhttp://bioconductor.org/packages/release/bioc/html/DESeq.html • EdgeR • http://www.bioconductor.org/packages/release/bioc/html/edgeR.html • NBPSeqhttp://cran.r-project.org/web/packages/NBPSeq/index.html • TSPM http://omictools.com/sequencing/rna-seq/normalization-de/tspm-r-s2496.html • baySeqhttp://www.bioconductor.org/packages/release/bioc/html/baySeq.html
Which Method Is the Best? Guo, Y., et al., Evaluation of read count based RNAseq analysis methods. BMC Genomics, 2013. 14 Suppl 8: p. S2.