RNA- s eq: From experimental design to gene expression.

RNA-seq: From experimental design to gene expression. Steve Munger @stevemunger The Jackson Laboratory UMaine Computational Methods in Biology/ Genomics September 29, 2014

Outline General overview of RNA-seq analysis. • Introduction to RNA-seq • The importance of a good experimental design • Quality control • Read alignment • Quantifying isoform and gene expression • Normalization of expression estimates

Understandinggene expression Alwineet. al. PNAS 1977 DeRisiet. al. Science 1997 ON/OFF 1.5 1.5 10.5

Next Generation Genome Sequencers IlluminaHiSeq and MiSeq PacBio SMRT 454 GS FLX Oxford Nanopore N

RNA AGCTA ATGCTCA RNA-Seq: Sequencing Transcriptomes AGCTA TAGATGCTCA AGCTAATC ATGCTCA AGCTA ATGCTCA AGCTA AGTAGATGCTCA AGCTA ATGCTCA AGCTA ATGCTCA AGCTA ATGCTCA TAGATGCTCA AGCTAATC CTCA AGCTAATCCTAG N

A/G TAGATGCTCA AGCTAATCCTAG Applications of RNA-Seq Technology A AGCTA ATGCTCA A AGCTA TAGATGCTCA AGCTAATC A ATGCTCA A AGCTA ATGCTCA AGCTA A AGTAGATGCTCA AGCTA G ATGCTCA AGCTA G ATGCTCA Gene expression analysis Novel Exon discovery G AGCTA ATGCTCA TAGATGCTCA G AGCTAATC G CTCA AGCTAATCCTAG TAGATGCTCAA AGCTAATCCTAG A AGCTA ATGCTCA A AGCTA TAGATGCTCA AGCTAATC A ATGCTCA A AGCTA ATGCTCA AGCTA A AGTAGATGCTCA AGCTA A ATGCTCA AGCTA A ATGCTCA G AGCTA ATGCTCA TAGATGCTCA G AGCTAATC G CTCA AGCTAATCCTAG Allele Specific Gene Expression RNA Editing N

Total RNA RNA-Seq mRNA mRNA after fragmentation cDNA Adaptors ligated to cDNA Single/ Paired End Sequencing N

Challenges in RNA-Seq Work Flow Study Design RNA isolation/ Library Prep Sequencing Reads (SE or PE) Aligned Reads Quantified isoform and gene expression N

Know your application – Design your experiment accordingly • Differential expression of highly expressed and well annotated genes? • Smaller sample depth; more biological replicates • No need for paired end reads; shorter reads (50bp) may be sufficient. • Better to have 20 million 50bp reads than 10 million 100bp reads. • Looking for novel genes/splicing/isoforms? • More read depth, paired-end reads from longer fragments. • Allele specific expression • “Good” genomes for both strains. N

Good Experimental Design One IlluminaHiSeqFlowcell = 8 lanes Multiplex up to 24 samples in a lane Multiplexing Replication Randomization 187 Million SE reads per lane 374 Million PE reads per lane • Cost per lane: • 100 bases PE: $1800 - $2700 • 50 bases PE: $1400 - $2100 • 50 bases SE: $800 - $1200 N

RNA-Seq Experimental Design: Randomization Experimental Group 1 Experimental Group 2 Two Illumina Lanes Random.org Bad Design N

RNA-Seq Experimental Design: Randomization Experimental Group 1 Experimental Group 2 Two Illumina Lanes Random.org Bad Design Good Design N

RNA-Seq Experimental Design: Randomization Experimental Group 1 Experimental Group 2 Two Illumina Lanes Random.org Better Design Bad Design Good Design N

Challenges in RNA-Seq Work Flow Study Design RNA isolation/ Library Prep Sequencing Reads (SE or PE) Aligned Reads Quantified isoform and gene expression N

Total RNA poly-A tail selection mRNA mRNA after fragmentation Not actually random “Not So” random primers cDNA Size selection step (gel extraction) Adaptors ligated to cDNA PCR amplification • Adapter dimers S

Sequence Read: Sanger fastq format @HISEQ2000_0074:8:1101:7544:2225#TAGCTT/1 TCACCCGTAAGGTAACAAACCGAAAGTATCCAAAGCTAAAAGAAGTGGACGACGTGCTTGGTGGAGCAGCTGCATG + CCCFFFFFHHHHDHHJJJJJJJJIJJ?FGIIIJJJJJJIJJJJJJFHIJJJIJHHHFFFFD>AC?B??C?ACCAC>BB<<<>C@CCCACCCDCCIJ @HISEQ2000_0074:8:1101:7544:2225#TAGCTT/1 The member of a pair Instrument: run/flowcell id Flowcell lane and tile number Index Sequence X-Y Coordinate in flowcell Q = -10 log10 P 10 indicates 1 in 10 chance of error 20 indicates 1 in 100, 30 indicates 1 in 1000, Phred Score: SN

To Trim or not to Trim? Quality control S

Quality Control: Sequence quality per base position • Bad data • High Variance • Quality Decrease with Length • Good data • Consistent • High Quality Along the reads S

Per sequence quality distribution bad data Y= number of reads X= Mean sequence quality Average data NGS Data Preprocessing S

Per sequence quality distribution bad data Good data Y= number of reads X= Mean sequence quality Average data NGS Data Preprocessing S

Quality Control: Sequence Content Across Bases S

K-mer content counts the enrichment of every 5-mer within the sequence library Bad: If k-mer enrichment >= 10 fold at any individual base position NGS Data Preprocessing

K-mer content Most samples

Duplicated sequences Good: non-unique sequences make up less than 20% Bad: non-unique sequences make >50% NGS Data Preprocessing S

PCR duplicates or high expressed genes? The case of Albumin. ~80,000x coverage here S

Tradeoffs to preprocessing data • Signal/noise -> Preprocessing can remove low-quality “noise”, but the cost is information loss. • Some uniformly low-quality reads map uniquely to the genome. • Trimming reads to remove lower quality ends can adversely affect alignment, especially if aligning to the genome and the read spans a splice site. • Duplicated reads or just highly expressed genes? • Most aligners can take quality scores into consideration. • Currently, we do not recommend preprocessing reads aside from removing uniformly low quality samples. S

The problem with trimming all SE reads 100bp reads All reads trimmed to 75 bp Longer is better for splice junction spanning reads S

Quality Control: Resources • FASTX-Toolkit • http://hannonlab.cshl.edu/fastx_toolkit/ • FastQC • http://www.bioinformatics.babraham.ac.uk/projects/fastqc/ NGS Data Preprocessing S

RNA-Seq Work Flow Study Design Total RNA Sequencing Reads (SE or PE) Aligned Reads Quantified isoform and gene expression KB

Alignment 101 100bp Read ACATGCTGCGGA Chr 3 ACATGCTGCGGA Chr 2 Chr 1 KB

The perfect read: 1 read = 1 unique alignment. 100bp Read ACATGCTGCGGA ✓ Chr 3 ACATGCTGCGGA Chr 2 Chr 1 KB

Some reads will align equally well to multiple locations. “Multireads” 100bp Read ✗ ACATGCTGCGGA ACATGCTGCGGA ✓ ✗ ACATGCTGCGGA 1 read 3 valid alignments Only 1 alignment is correct ACATGCTGCGGA KB

The worst case scenario. 100bp Read ACATGCTGCGGA ✗ ACATGCTGCGGA ✗ ACATGCTGCGGC 1 read 2 valid alignments Neither is correct ACATGCTGCGGA KB

Individual genetic variation may affect read alignment. 100bp Read ✗ ATATGCTGCGGA ACATGCTGCGGA ACATGCTGCGGC ✗ ACATGCTGCGGA KB

Aligning Billions of Short Sequence Reads Gene A Gene B Aligners: Bowtie, GSNAP, BWA, MAQ, BLAT Designed to align the short reads fast, but not accurate KB

Aligning Sequence Reads Gene A Gene B • Gene family (orthologous/paralogous) • Low-complexity sequence • Alternatively spliced isoforms • Pseudogenes • Polymorphisms • Indels • Structural Variants • Reference sequence Errors KB

Align to Genome or Transcriptome? Genome Transcriptome KB

Aligning to Reference Genome: Exon First Alignment Exon 1 Exon 2 Exon 3 KB

Exon First Alignment: Pseudo-gene problem Exon 1 Exon 2 Exon 3 Processed Pseudo-gene Exon 2 Exon 3 Exon 1 KB

Align to Genome or Transcriptome? Genome Advantages: Can align novel isoforms. Disadvantages: Difficult, Spurious alignments, spliced alignment, gene families, pseudo genes Transcriptome KB

Reference Transcriptome Exon 1 Gene 1 Exon 2 Exon 3 Isoform 1 Isoform 2 Isoform 3 Exon 1 Gene 2 Exon 2 Exon 3 Exon 4 Isoform 1 Isoform 2 Isoform 3 KB

Align to Genome or Transcriptome? Genome Advantages: Can align novel isoforms. Disadvantages: Difficult, Spurious alignments, spliced alignment, gene families, pseudo genes Transcriptome Advantages: Easy, Focused to the part of the genome that is known to be transcribed. Disadvantages: Reads that come from novel isoforms may not align at all or may be misattributed to a known isoform. KB

Better Approach: Aligning to Transcriptome and Genome Align to Transcriptome First Align the remaining reads to Genome next Advantages: relatively simpler, overcomes the pseudo-gene and novel isoform problems KB RUM, TopHat2, STAR Advantages: Can align novel isoforms. Disadvantages: Difficult, Spurious alignments, spliced alignment, gene families, pseudo genes

Conclusions • There is no perfect aligner. Pick one well-suited to your application. • E.g. Want to identify novel exons? Don’t align only to the known set of isoforms. • Visually inspect the resulting alignments. Setting a parameter a little too liberal or conservative can have a huge effect on alignment. • Consider running the same fastq files through multiple alignment pipelines specific to each application. • Gene expression -> Bowtie to transcriptome • Exon discovery -> RUM or other hybrid mapper • Variant detection -> GSNAP, GATK, Samtoolsmpileup • If your species has not been sequenced, use a de novo assembly method. Can also use the genome of a related species as a scaffold. KB

Output of most aligners: Bam/Sam file of reads and genome positions S

Visualization of alignment data (BAM/SAM) • Genome browsers – UCSC, IGV, etc.

RNA- s eq: From experimental design to gene expression.