Introduction to RNA- seq

Introduction to RNA-seq Joel Parker, Ph.D.

Why mRNAseq? • Measurement of differential expression • There are at least four compelling reasons for choosing mRNA-seq instead of microarray based technologies • Specificity of what is being measured • Reduced technical (batch) bias • Increased dynamic range and log ratio (FC) estimates • More sensitive detection of genes, transcripts, and differential expression • Other reasons • Detection of expressed SNVs • Detection of fusions and other structural variations • No transcriptome definition is needed • No probes need to be designed or manufactured • Cost (will soon be equivalent on a per assay basis with microarray)

Why mRNAseq? – Reduced Bias Cell types separate biologically CD19 CD8CD14CD4

Why mRNAseq? – Reduced Processing Bias Client’s miRNAseq samples sequenced on 4 different machines at 2 different sites at different times over several months with no apparent bias in the top principal components GAIIx HS-01HS-02 HS-IL

Library preparation mRNA RNA Capture Enrichment via hybridization Total RNA Depletion of rRNA via hybridization Blood, MT, etc

PMID: 24888378

mRNA

Sequencing parameters Read Length Trapnell et al., Nature Biotechnology31,46–53(2013) Precision = PPV; Recall = Sensitivity

Detection is Dependent on Depth PMID: 24888378

Liu et al., Bioinformatics (2014) 30 (3): 301-304.

Computational Processing • Technical variation (batch effects) from library preparation and sequencing are small, and the sequencing strategy directs the level of repeatability and detection, especially depth • The raw results of sequencing require significant computational processing • Alignment : Maximizing unambiguous alignments; Alignment of reads that cross exon junctions; Ex: Bowtie, BWA, TopHat, Mapsplice, STAR, . . • Abundance estimation : Gene or transcript; Handling alignments that are ambiguous in the transcriptome; Ex: Sailfish, RSEM, Cufflinks, MISO, Salmon, IsoEM, IsoInfer, Rseq, . . . • Normalization of read counts : Minimizing bias due to variation in number of clusters available; Ex: Total count (RPM), Upper quartile, quantile, density • Different algorithmic and computational strategies, reference genome and transcriptome definition, impact performance much more than SE vs. PE, 50 bp vs. 100 bp.

Alignment BWA, Bowtie alignment to transcriptome X X X X X X Trinity, Trans-Abyss X X X X Transcriptome Count alignments

Example Concordant Gene V2 V1 http://www.broadinstitute.org/igv/

Example Discordant 1 Gene V2 V1

Example Discordant 2 Gene V2 V1

Alignment TopHat, MapSplice, STAR Trinity, Trans-Abyss

Alignment Comparison Engstrom et al., Nature Methods 10, 1185-1191 (2013)

Alignment Comparison Splice Junction Accuracy Engstrom et al., Nature Methods 10, 1185-1191 (2013)

Computational Processing • Technical variation (batch effects) from library preparation and sequencing are small, and the sequencing strategy directs the level of repeatability and detection, especially depth • The raw results of sequencing require significant computational processing • Alignment : Maximizing unambiguous alignments; Alignment of reads that cross exon junctions; Ex: Bowtie, BWA, TopHat • Abundance estimation : Gene or transcript; Handling alignments that are ambiguous in the transcriptome; Ex: Sailfish, RSEM, Cufflinks, MISO, IsoEM, IsoInfer, Rseq, . . . • Normalization of read counts : Minimizing bias due to variation in number of clusters available; Ex: Total count (RPM), Upper quartile, quantile, density • Different algorithmic and computational strategies, especially the transcriptome definition, impact performance much more than SE vs. PE, 50 bp vs. 100 bp.

Multireads: Reads Mapping to Multiple Genes/Transcripts HTSeq << PMID: 28784092 Wang X, Wu Z, Zhang X. Isoform abundance inference provides a more accurate estimation of gene expression levels in RNA-seq. J BioinformComput Biol. 2010 Dec;8 Suppl 1:177-92. PubMed PMID: 21155027.

Multireads: Reads Mapping to Multiple Genes/Transcripts 350 200 1 Long 150 100 300 2 Medium Multireads 50 200 3 Short Unique Relative abundance for these genes, f1, f2, f3 N

Approach 1: Ignore Multireads 350 200 1 Long 150 100 300 2 Medium 50 200 3 Short Relative abundance for these genes, f1, f2, f3 Nagalakshmi et. al. Science. 2008 Marioni, et. al. Genome Research 2008

Approach 1: Ignore Multireads 350 200 1 Long 150 100 300 2 Medium 50 200 3 Short • Over-estimates the abundance of genes with unique reads • Under-estimates the abundance of genes with multireads • Not an option at all, if interested in isoform expression N

Approach 2: Allocate Fraction of Multireads Using Estimates From Uniques 350 200 1 Long 150 100 300 2 Medium 50 200 3 Short Relative abundance for these genes, f1, f2, f3 Ali Mortazavi, et. al. Nature Methods 2008 Sailfish, RSEM,Cufflinks N

PMID: 20436464 Cufflinks

RSEM • Li and Dewey, 2011 • PMID: 21816040 θirepresents the probability that a fragment is derived from transcript i A) PE isoform; B) PE gene; C) SE isoform; D) SE gene

Salmon Novelties • Streaming variationalBayes (VB) inference combined with batched VB or EM • Lightweight alignment through maximal exact matches • Transcript / gene abundance inference is abstracted from the alignment step [RSEM also permits this; sam-xlate in https://github.com/mozack/ubu/wiki]

Repeatability & Detection by Isoform Database • Larger reference transcriptomes result in reduced repeatability (left), but increased detection (right) • Detection - 73% of RefSeq, 66% of UCSC, and 52% of Ensembl

Computational Processing • Technical variation (batch effects) from library preparation and sequencing are small, and the sequencing strategy directs the level of repeatability and detection, especially depth • The raw results of sequencing require significant computational processing • Alignment : Maximizing unambiguous alignments; Alignment of reads that cross exon junctions; Ex: Bowtie, BWA, TopHat • Abundance estimation : Gene or transcript; Handling alignments that are ambiguous in the transcriptome; Ex: Sailfish, RSEM, Cufflinks, MISO, IsoEM, IsoInfer, Rseq, . . . • Normalization of read counts : Minimizing bias due to variation in number of clusters available; Ex: Total count (RPM), Upper quartile, quantile, density • Different algorithmic and computational strategies, especially the transcriptome definition, impact performance much more than SE vs. PE, 50 bp vs. 100 bp.

Introduction to RNA- seq

Introduction to RNA- seq

Presentation Transcript

RNA-Seq

RNA- seq library prep introduction

RNA-Seq and transcriptome analysis

RNA- seq Analysis

RNA- Seq Lab

Measuring transcriptomes with RNA-Seq

RNA seq (I)

Biases in RNA- Seq data

Le RNA-seq

Bioinformatics for DNA - seq and RNA- seq experiments

RNA- seq Analysis Practical Exercise

Statistics for RNA- seq Analysis

RNA-Seq and transcriptome analysis

RNA-seq data

RNA-Seq datasets

RNA- seq Analysis in Galaxy

RNA-SEQ