1 / 66

RNA sequencing, transcriptome and expression quantification

RNA sequencing, transcriptome and expression quantification. Henrik Lantz, BILS/SciLifeLab. Lecture synopsis. What is RNA-seq? Basic concepts Mapping-based transcriptomics (genome -based) De novo based transcriptomics (genome-free) Expression counts and differential expression

Download Presentation

RNA sequencing, transcriptome and expression quantification

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. RNA sequencing, transcriptome and expression quantification Henrik Lantz, BILS/SciLifeLab

  2. Lecture synopsis • What is RNA-seq? • Basic concepts • Mapping-based transcriptomics (genome -based) • De novo based transcriptomics (genome-free) • Expression counts and differential expression • Transcript annotation

  3. RNA-seq DNA Exon Intron Exon Intron Exon Intron Exon UTR UTR AG GT AG GT GT AG ATG Start codon TAG, TAA, TGA Stop codon Transcription Pre-mRNA UTR UTR AA AAAAA ATG Start codon TAG, TAA, TGA Stop codon Splicing mRNA UTR UTR AAAAAAAAA ATG Start codon TAG, TAA, TGA Stop codon Translation

  4. Overview of RNA-Seq Reconstruct original full-length transcripts From: http://www2.fml.tuebingen.mpg.de/raetsch/members/research/transcriptomics.html

  5. Common Data Formats for RNA-Seq FASTA format: >61DFRAAXX100204:1:100:10494:3070/1 AAACAACAGGGCACATTGTCACTCTTGTATTTGAAAAACACTTTCCGGCCAT FASTQ format: @61DFRAAXX100204:1:100:10494:3070/1 AAACAACAGGGCACATTGTCACTCTTGTATTTGAAAAACACTTTCCGGCCAT + ACCCCCCCCCCCCCCCCCCCCCCCCCCCCCBC?CCCCCCCCC@@CACCCCCA Quality values in increasing order: !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~ You might get the data in a .sff or .bam format. Fastq-reads are easy to extract from both of these binary (compressed) formats!

  6. Paired-End

  7. Insert size Insert size Read 1 DNA-fragment Read 2 Adapter+primer Inner mate distance

  8. Paired-end gives you two files FASTQ format: @61DFRAAXX100204:1:100:10494:3070/1 AAACAACAGGGCACATTGTCACTCTTGTATTTGAAAAACACTTTCCGGCCAT + ACCCCCCCCCCCCCCCCCCCCCCCCCCCCCBC?CCCCCCCCC@@CACCCCCA @61DFRAAXX100204:1:100:10494:3070/2 ATCCAAGTTAAAACAGAGGCCTGTGACAGACTCTTGGCCCATCGTGTTGATA + _^_a^cccegcgghhgZc`ghhc^egggd^_[d]defcdfd^Z^OXWaQ^ad

  9. Transcript Reconstruction from RNA-Seq Reads Nature Biotech, 2010

  10. Transcript Reconstruction from RNA-Seq Reads MAPPING TopHat

  11. Transcript Reconstruction from RNA-Seq Reads MAPPING TopHat Cufflinks

  12. Transcript Reconstruction from RNA-Seq Reads MAPPING Trinity The Tuxedo Suite: End-to-end Genome-basedRNA-Seq Analysis Software Package TopHat GMAP Cufflinks

  13. Transcript Reconstruction from RNA-Seq Reads MAPPING Trinity TopHat Cufflinks

  14. Transcript Reconstruction from RNA-Seq Reads MAPPING Trinity TopHat GMAP Cufflinks

  15. Transcript Reconstruction from RNA-Seq Reads End-to-end Transcriptome-basedRNA-Seq Analysis Software Package Trinity GMAP

  16. Basic concepts of mapping-based RNA-seq - Coverage Reads 5x Coverage 2x Mapping Overlapping reads Reference genome Coverage = number of reads at a certain position Higher coverage in RNA-seq=>higher chance of sequencing low-abundance transcripts

  17. Basic concepts of mapping-based RNA-seq - Spliced reads DNA Exon Intron Exon Intron Exon Intron Exon UTR UTR AG GT AG GT GT AG ATG Start codon TAG, TAA, TGA Stop codon Transcription Pre-mRNA UTR UTR AA AAAAA ATG Start codon TAG, TAA, TGA Stop codon Splicing mRNA UTR UTR AAAAAAAAA ATG Start codon TAG, TAA, TGA Stop codon Translation

  18. RNA-seq - Spliced reads

  19. Pre-mRNA DNA Exon Intron Exon Intron Exon Intron Exon UTR UTR GT GT GT ATG Start codon TAG, TAA, TGA Stop codon Transcription Pre-mRNA UTR UTR ATG Start codon TAG, TAA, TGA Stop codon Splicing mRNA UTR UTR ATG Start codon TAG, TAA, TGA Stop codon Translation

  20. Pre-mRNA

  21. Pre-mRNA

  22. Stranded rna-seq

  23. Overview of the Tuxedo Software Suite Bowtie (fast short-read alignment) TopHat (spliced short-read alignment) Cufflinks (transcript reconstruction from alignments) Cuffdiff (differential expression analysis) CummeRbund (visualization & analysis)

  24. Slide courtesy of Cole Trapnell

  25. Tophat-mapped reads

  26. Alignments are reported in a compact representation: SAM format 0 61G9EAAXX100520:5:100:10095:16477 1 83 2 chr1 3 51986 4 38 5 46M 6 = 7 51789 8 -264 9 CCCAAACAAGCCGAACTAGCTGATTTGGCTCGTAAAGACCCGGAAA 10 ###CB?=ADDBCBCDEEFFDEFFFDEFFGDBEFGEDGCFGFGGGGG 11 MD:Z:67 12 NH:i:1 13 HI:i:1 14 NM:i:0 15 SM:i:38 16 XQ:i:40 17 X2:i:0 SAM format specification: http://samtools.sourceforge.net/SAM1.pdf

  27. Alignments are reported in a compact representation: SAM format (read name) 0 61G9EAAXX100520:5:100:10095:16477 1 83 2 chr1 3 51986 4 38 5 46M 6 = 7 51789 8 -264 9 CCCAAACAAGCCGAACTAGCTGATTTGGCTCGTAAAGACCCGGAAA 10 ###CB?=ADDBCBCDEEFFDEFFFDEFFGDBEFGEDGCFGFGGGGG 11 MD:Z:67 12 NH:i:1 13 HI:i:1 14 NM:i:0 15 SM:i:38 16 XQ:i:40 17 X2:i:0 (FLAGS stored as bit fields; 83 = 00001010011 ) (alignment target) (position alignment starts) (Compact description of the alignment in CIGAR format) (read sequence, oriented according to the forward alignment) (base quality values) (Metadata) SAM format specification: http://samtools.sourceforge.net/SAM1.pdf

  28. Alignments are reported in a compact representation: SAM format (read name) 0 61G9EAAXX100520:5:100:10095:16477 1 83 2 chr1 3 51986 4 38 5 46M 6 = 7 51789 8 -264 9 CCCAAACAAGCCGAACTAGCTGATTTGGCTCGTAAAGACCCGGAAA 10 ###CB?=ADDBCBCDEEFFDEFFFDEFFGDBEFGEDGCFGFGGGGG 11 MD:Z:67 12 NH:i:1 13 HI:i:1 14 NM:i:0 15 SM:i:38 16 XQ:i:40 17 X2:i:0 (FLAGS stored as bit fields; 83 = 00001010011 ) (alignment target) (position alignment starts) Still not compact enough… Millions to billions of reads takes up a lot of space!! Convert SAM to binary – BAM format. (Compact description of the alignment in CIGAR format) (read sequence, oriented according to the forward alignment) (base quality values) (Metadata) SAM format specification: http://samtools.sourceforge.net/SAM1.pdf

  29. Samtools • Tools for • converting SAM <-> BAM • Viewing BAM files (eg. samtools view file.bam | less ) • Sorting BAM files, and lots more:

  30. Visualizing Alignments of RNA-Seq reads

  31. Text-based Alignment Viewer % samtoolstviewalignments.bamtarget.fasta

  32. IGV

  33. IGV: Viewing Tophat Alignments

  34. Transcript Reconstruction Using Cufflinks From Martin & Wang. Nature Reviews in Genetics. 2011

  35. Transcript Reconstruction Using Cufflinks From Martin & Wang. Nature Reviews in Genetics. 2011

  36. Transcript Reconstruction Using Cufflinks From Martin & Wang. Nature Reviews in Genetics. 2011

  37. GFF file format

  38. GFF3 file format

  39. GTF file format

  40. Transcript Reconstruction from RNA-Seq Reads Trinity The Tuxedo Suite: End-to-end Genome-basedRNA-Seq Analysis Software Package TopHat GMAP Cufflinks

  41. Transcript Reconstruction from RNA-Seq Reads End-to-end Transcriptome-basedRNA-Seq Analysis Software Package Trinity GMAP

  42. De novo transcriptome assembly No genome required Empower studies of non-model organisms • expressed gene content • transcript abundance • differential expression

  43. The General Approach to De novo RNA-Seq AssemblyUsing De Bruijn Graphs

  44. Sequence Assembly via De Bruijn Graphs From Martin & Wang, Nat. Rev. Genet. 2011

  45. From Martin & Wang, Nat. Rev. Genet. 2011

  46. From Martin & Wang, Nat. Rev. Genet. 2011

  47. Contrasting Genome and Transcriptome Assembly Transcriptome Assembly Genome Assembly • Uniform coverage • Single contig per locus • Double-stranded • Exponentially distributed coverage levels • Multiple contigs per locus (alt splicing) • Strand-specific

  48. Trinity Aggregates Isolated Transcript Graphs Genome Assembly Single Massive Graph Trinity Transcriptome Assembly Many Thousands of Small Graphs Ideally, one graph per expressed gene. Entire chromosomes represented.

  49. Trinity – How it works: Transcripts + Isoforms de-Bruijn graphs RNA-Seqreads Linear contigs Thousands of disjoint graphs

  50. Trinity output: A multi-fasta file

More Related