730 likes | 768 Views
"Discover the advantages of mRNA sequencing over microarrays, from increased sensitivity to no need for probe design. Learn about computational processing and methods for alignment, abundance estimation, and read count normalization in RNA-seq data analysis."
E N D
Introduction to RNA-seq Joel Parker, Ph.D.
Why mRNAseq? • Measurement of differential expression • There are at least four compelling reasons for choosing mRNA-seq instead of microarray based technologies • Specificity of what is being measured • Reduced technical (batch) bias • Increased dynamic range and log ratio (FC) estimates • More sensitive detection of genes, transcripts, and differential expression • Other reasons • Detection of expressed SNVs • Detection of fusions and other structural variations • No transcriptome definition is needed • No probes need to be designed or manufactured • Cost (will soon be equivalent on a per assay basis with microarray)
Why mRNAseq? – Reduced Bias Cell types separate biologically CD19 CD8CD14CD4
Why mRNAseq? – Reduced Processing Bias Client’s miRNAseq samples sequenced on 4 different machines at 2 different sites at different times over several months with no apparent bias in the top principal components GAIIx HS-01HS-02 HS-IL
Library preparation mRNA RNA Capture Enrichment via hybridization Total RNA Depletion of rRNA via hybridization Blood, MT, etc
Sequencing parameters Read Length Trapnell et al., Nature Biotechnology31,46–53(2013) Precision = PPV; Recall = Sensitivity
Detection is Dependent on Depth PMID: 24888378
Computational Processing • Technical variation (batch effects) from library preparation and sequencing are small, and the sequencing strategy directs the level of repeatability and detection, especially depth • The raw results of sequencing require significant computational processing • Alignment : Maximizing unambiguous alignments; Alignment of reads that cross exon junctions; Ex: Bowtie, BWA, TopHat, Mapsplice, STAR, . . • Abundance estimation : Gene or transcript; Handling alignments that are ambiguous in the transcriptome; Ex: Sailfish, RSEM, Cufflinks, MISO, Salmon, IsoEM, IsoInfer, Rseq, . . . • Normalization of read counts : Minimizing bias due to variation in number of clusters available; Ex: Total count (RPM), Upper quartile, quantile, density • Different algorithmic and computational strategies, reference genome and transcriptome definition, impact performance much more than SE vs. PE, 50 bp vs. 100 bp.
Alignment BWA, Bowtie alignment to transcriptome X X X X X X Trinity, Trans-Abyss X X X X Transcriptome Count alignments
Example Concordant Gene V2 V1 http://www.broadinstitute.org/igv/
Alignment TopHat, MapSplice, STAR Trinity, Trans-Abyss
Alignment Comparison Engstrom et al., Nature Methods 10, 1185-1191 (2013)
Alignment Comparison Splice Junction Accuracy Engstrom et al., Nature Methods 10, 1185-1191 (2013)
Computational Processing • Technical variation (batch effects) from library preparation and sequencing are small, and the sequencing strategy directs the level of repeatability and detection, especially depth • The raw results of sequencing require significant computational processing • Alignment : Maximizing unambiguous alignments; Alignment of reads that cross exon junctions; Ex: Bowtie, BWA, TopHat • Abundance estimation : Gene or transcript; Handling alignments that are ambiguous in the transcriptome; Ex: Sailfish, RSEM, Cufflinks, MISO, IsoEM, IsoInfer, Rseq, . . . • Normalization of read counts : Minimizing bias due to variation in number of clusters available; Ex: Total count (RPM), Upper quartile, quantile, density • Different algorithmic and computational strategies, especially the transcriptome definition, impact performance much more than SE vs. PE, 50 bp vs. 100 bp.
Multireads: Reads Mapping to Multiple Genes/Transcripts HTSeq << PMID: 28784092 Wang X, Wu Z, Zhang X. Isoform abundance inference provides a more accurate estimation of gene expression levels in RNA-seq. J BioinformComput Biol. 2010 Dec;8 Suppl 1:177-92. PubMed PMID: 21155027.
Multireads: Reads Mapping to Multiple Genes/Transcripts 350 200 1 Long 150 100 300 2 Medium Multireads 50 200 3 Short Unique Relative abundance for these genes, f1, f2, f3 N
Approach 1: Ignore Multireads 350 200 1 Long 150 100 300 2 Medium 50 200 3 Short Relative abundance for these genes, f1, f2, f3 Nagalakshmi et. al. Science. 2008 Marioni, et. al. Genome Research 2008
Approach 1: Ignore Multireads 350 200 1 Long 150 100 300 2 Medium 50 200 3 Short • Over-estimates the abundance of genes with unique reads • Under-estimates the abundance of genes with multireads • Not an option at all, if interested in isoform expression N
Approach 2: Allocate Fraction of Multireads Using Estimates From Uniques 350 200 1 Long 150 100 300 2 Medium 50 200 3 Short Relative abundance for these genes, f1, f2, f3 Ali Mortazavi, et. al. Nature Methods 2008 Sailfish, RSEM,Cufflinks N
PMID: 20436464 Cufflinks
RSEM • Li and Dewey, 2011 • PMID: 21816040 θirepresents the probability that a fragment is derived from transcript i A) PE isoform; B) PE gene; C) SE isoform; D) SE gene
Salmon Novelties • Streaming variationalBayes (VB) inference combined with batched VB or EM • Lightweight alignment through maximal exact matches • Transcript / gene abundance inference is abstracted from the alignment step [RSEM also permits this; sam-xlate in https://github.com/mozack/ubu/wiki]
Repeatability & Detection by Isoform Database • Larger reference transcriptomes result in reduced repeatability (left), but increased detection (right) • Detection - 73% of RefSeq, 66% of UCSC, and 52% of Ensembl
Computational Processing • Technical variation (batch effects) from library preparation and sequencing are small, and the sequencing strategy directs the level of repeatability and detection, especially depth • The raw results of sequencing require significant computational processing • Alignment : Maximizing unambiguous alignments; Alignment of reads that cross exon junctions; Ex: Bowtie, BWA, TopHat • Abundance estimation : Gene or transcript; Handling alignments that are ambiguous in the transcriptome; Ex: Sailfish, RSEM, Cufflinks, MISO, IsoEM, IsoInfer, Rseq, . . . • Normalization of read counts : Minimizing bias due to variation in number of clusters available; Ex: Total count (RPM), Upper quartile, quantile, density • Different algorithmic and computational strategies, especially the transcriptome definition, impact performance much more than SE vs. PE, 50 bp vs. 100 bp.