1 / 85

RNA- s eq: From experimental design to gene expression.

RNA- s eq: From experimental design to gene expression. Steve Munger @ stevemunger The Jackson Laboratory UMaine Computational Methods in Biology/ Genomics September 29 , 2014. Outline. General overview of RNA-seq analysis. Introduction to RNA- s eq

Download Presentation

RNA- s eq: From experimental design to gene expression.

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. RNA-seq: From experimental design to gene expression. Steve Munger @stevemunger The Jackson Laboratory UMaine Computational Methods in Biology/ Genomics September 29, 2014

  2. Outline General overview of RNA-seq analysis. • Introduction to RNA-seq • The importance of a good experimental design • Quality control • Read alignment • Quantifying isoform and gene expression • Normalization of expression estimates

  3. Understandinggene expression Alwineet. al. PNAS 1977 DeRisiet. al. Science 1997 ON/OFF 1.5 1.5 10.5

  4. Next Generation Genome Sequencers IlluminaHiSeq and MiSeq PacBio SMRT 454 GS FLX Oxford Nanopore N

  5. RNA AGCTA ATGCTCA RNA-Seq: Sequencing Transcriptomes AGCTA TAGATGCTCA AGCTAATC ATGCTCA AGCTA ATGCTCA AGCTA AGTAGATGCTCA AGCTA ATGCTCA AGCTA ATGCTCA AGCTA ATGCTCA TAGATGCTCA AGCTAATC CTCA AGCTAATCCTAG N

  6. A/G TAGATGCTCA AGCTAATCCTAG Applications of RNA-Seq Technology A AGCTA ATGCTCA A AGCTA TAGATGCTCA AGCTAATC A ATGCTCA A AGCTA ATGCTCA AGCTA A AGTAGATGCTCA AGCTA G ATGCTCA AGCTA G ATGCTCA Gene expression analysis Novel Exon discovery G AGCTA ATGCTCA TAGATGCTCA G AGCTAATC G CTCA AGCTAATCCTAG TAGATGCTCAA AGCTAATCCTAG A AGCTA ATGCTCA A AGCTA TAGATGCTCA AGCTAATC A ATGCTCA A AGCTA ATGCTCA AGCTA A AGTAGATGCTCA AGCTA A ATGCTCA AGCTA A ATGCTCA G AGCTA ATGCTCA TAGATGCTCA G AGCTAATC G CTCA AGCTAATCCTAG Allele Specific Gene Expression RNA Editing N

  7. Total RNA RNA-Seq mRNA mRNA after fragmentation cDNA Adaptors ligated to cDNA Single/ Paired End Sequencing N

  8. Challenges in RNA-Seq Work Flow Study Design RNA isolation/ Library Prep Sequencing Reads (SE or PE) Aligned Reads Quantified isoform and gene expression N

  9. Know your application – Design your experiment accordingly • Differential expression of highly expressed and well annotated genes? • Smaller sample depth; more biological replicates • No need for paired end reads; shorter reads (50bp) may be sufficient. • Better to have 20 million 50bp reads than 10 million 100bp reads. • Looking for novel genes/splicing/isoforms? • More read depth, paired-end reads from longer fragments. • Allele specific expression • “Good” genomes for both strains. N

  10. Good Experimental Design One IlluminaHiSeqFlowcell = 8 lanes Multiplex up to 24 samples in a lane Multiplexing Replication Randomization 187 Million SE reads per lane 374 Million PE reads per lane • Cost per lane: • 100 bases PE: $1800 - $2700 • 50 bases PE: $1400 - $2100 • 50 bases SE: $800 - $1200 N

  11. RNA-Seq Experimental Design: Randomization Experimental Group 1 Experimental Group 2 Two Illumina Lanes Random.org Bad Design N

  12. RNA-Seq Experimental Design: Randomization Experimental Group 1 Experimental Group 2 Two Illumina Lanes Random.org Bad Design Good Design N

  13. RNA-Seq Experimental Design: Randomization Experimental Group 1 Experimental Group 2 Two Illumina Lanes Random.org Better Design Bad Design Good Design N

  14. Challenges in RNA-Seq Work Flow Study Design RNA isolation/ Library Prep Sequencing Reads (SE or PE) Aligned Reads Quantified isoform and gene expression N

  15. Total RNA poly-A tail selection mRNA mRNA after fragmentation Not actually random “Not So” random primers cDNA Size selection step (gel extraction) Adaptors ligated to cDNA PCR amplification • Adapter dimers S

  16. Sequence Read: Sanger fastq format @HISEQ2000_0074:8:1101:7544:2225#TAGCTT/1 TCACCCGTAAGGTAACAAACCGAAAGTATCCAAAGCTAAAAGAAGTGGACGACGTGCTTGGTGGAGCAGCTGCATG + CCCFFFFFHHHHDHHJJJJJJJJIJJ?FGIIIJJJJJJIJJJJJJFHIJJJIJHHHFFFFD>AC?B??C?ACCAC>BB<<<>C@CCCACCCDCCIJ @HISEQ2000_0074:8:1101:7544:2225#TAGCTT/1 The member of a pair Instrument: run/flowcell id Flowcell lane and tile number Index Sequence X-Y Coordinate in flowcell Q = -10 log10 P 10 indicates 1 in 10 chance of error 20 indicates 1 in 100, 30 indicates 1 in 1000, Phred Score: SN

  17. To Trim or not to Trim? Quality control S

  18. Quality Control: Sequence quality per base position • Bad data • High Variance • Quality Decrease with Length • Good data • Consistent • High Quality Along the reads S

  19. Per sequence quality distribution bad data Y= number of reads X= Mean sequence quality Average data NGS Data Preprocessing S

  20. Per sequence quality distribution bad data Good data Y= number of reads X= Mean sequence quality Average data NGS Data Preprocessing S

  21. Quality Control: Sequence Content Across Bases S

  22. K-mer content counts the enrichment of every 5-mer within the sequence library Bad: If k-mer enrichment >= 10 fold at any individual base position NGS Data Preprocessing

  23. K-mer content Most samples

  24. Duplicated sequences Good: non-unique sequences make up less than 20% Bad: non-unique sequences make >50% NGS Data Preprocessing S

  25. PCR duplicates or high expressed genes? The case of Albumin. ~80,000x coverage here S

  26. Tradeoffs to preprocessing data • Signal/noise -> Preprocessing can remove low-quality “noise”, but the cost is information loss. • Some uniformly low-quality reads map uniquely to the genome. • Trimming reads to remove lower quality ends can adversely affect alignment, especially if aligning to the genome and the read spans a splice site. • Duplicated reads or just highly expressed genes? • Most aligners can take quality scores into consideration. • Currently, we do not recommend preprocessing reads aside from removing uniformly low quality samples. S

  27. The problem with trimming all SE reads 100bp reads All reads trimmed to 75 bp Longer is better for splice junction spanning reads S

  28. Quality Control: Resources • FASTX-Toolkit • http://hannonlab.cshl.edu/fastx_toolkit/ • FastQC • http://www.bioinformatics.babraham.ac.uk/projects/fastqc/ NGS Data Preprocessing S

  29. RNA-Seq Work Flow Study Design Total RNA Sequencing Reads (SE or PE) Aligned Reads Quantified isoform and gene expression KB

  30. Alignment 101 100bp Read ACATGCTGCGGA Chr 3 ACATGCTGCGGA Chr 2 Chr 1 KB

  31. The perfect read: 1 read = 1 unique alignment. 100bp Read ACATGCTGCGGA ✓ Chr 3 ACATGCTGCGGA Chr 2 Chr 1 KB

  32. Some reads will align equally well to multiple locations. “Multireads” 100bp Read ✗ ACATGCTGCGGA ACATGCTGCGGA ✓ ✗ ACATGCTGCGGA 1 read 3 valid alignments Only 1 alignment is correct ACATGCTGCGGA KB

  33. The worst case scenario. 100bp Read ACATGCTGCGGA ✗ ACATGCTGCGGA ✗ ACATGCTGCGGC 1 read 2 valid alignments Neither is correct ACATGCTGCGGA KB

  34. Individual genetic variation may affect read alignment. 100bp Read ✗ ATATGCTGCGGA ACATGCTGCGGA ACATGCTGCGGC ✗ ACATGCTGCGGA KB

  35. Aligning Billions of Short Sequence Reads Gene A Gene B Aligners: Bowtie, GSNAP, BWA, MAQ, BLAT Designed to align the short reads fast, but not accurate KB

  36. Aligning Sequence Reads Gene A Gene B • Gene family (orthologous/paralogous) • Low-complexity sequence • Alternatively spliced isoforms • Pseudogenes • Polymorphisms • Indels • Structural Variants • Reference sequence Errors KB

  37. Align to Genome or Transcriptome? Genome Transcriptome KB

  38. Aligning to Reference Genome: Exon First Alignment Exon 1 Exon 2 Exon 3 KB

  39. Aligning to Reference Genome: Exon First Alignment Exon 1 Exon 2 Exon 3 KB

  40. Aligning to Reference Genome: Exon First Alignment Exon 1 Exon 2 Exon 3 KB

  41. Exon First Alignment: Pseudo-gene problem Exon 1 Exon 2 Exon 3 Processed Pseudo-gene Exon 2 Exon 3 Exon 1 KB

  42. Exon First Alignment: Pseudo-gene problem Exon 1 Exon 2 Exon 3 Processed Pseudo-gene Exon 2 Exon 3 Exon 1 KB

  43. Exon First Alignment: Pseudo-gene problem Exon 1 Exon 2 Exon 3 Processed Pseudo-gene Exon 2 Exon 3 Exon 1 KB

  44. Align to Genome or Transcriptome? Genome Advantages: Can align novel isoforms. Disadvantages: Difficult, Spurious alignments, spliced alignment, gene families, pseudo genes Transcriptome KB

  45. Reference Transcriptome Exon 1 Gene 1 Exon 2 Exon 3 Isoform 1 Isoform 2 Isoform 3 Exon 1 Gene 2 Exon 2 Exon 3 Exon 4 Isoform 1 Isoform 2 Isoform 3 KB

  46. Align to Genome or Transcriptome? Genome Advantages: Can align novel isoforms. Disadvantages: Difficult, Spurious alignments, spliced alignment, gene families, pseudo genes Transcriptome Advantages: Easy, Focused to the part of the genome that is known to be transcribed. Disadvantages: Reads that come from novel isoforms may not align at all or may be misattributed to a known isoform. KB

  47. Better Approach: Aligning to Transcriptome and Genome Align to Transcriptome First Align the remaining reads to Genome next Advantages: relatively simpler, overcomes the pseudo-gene and novel isoform problems KB RUM, TopHat2, STAR Advantages: Can align novel isoforms. Disadvantages: Difficult, Spurious alignments, spliced alignment, gene families, pseudo genes

  48. Conclusions • There is no perfect aligner. Pick one well-suited to your application. • E.g. Want to identify novel exons? Don’t align only to the known set of isoforms. • Visually inspect the resulting alignments. Setting a parameter a little too liberal or conservative can have a huge effect on alignment. • Consider running the same fastq files through multiple alignment pipelines specific to each application. • Gene expression -> Bowtie to transcriptome • Exon discovery -> RUM or other hybrid mapper • Variant detection -> GSNAP, GATK, Samtoolsmpileup • If your species has not been sequenced, use a de novo assembly method. Can also use the genome of a related species as a scaffold. KB

  49. Output of most aligners: Bam/Sam file of reads and genome positions S

  50. Visualization of alignment data (BAM/SAM) • Genome browsers – UCSC, IGV, etc.

More Related