NGS Transcriptomic Workflows Hugh Shanahan & Jamie al-Nasir Royal Holloway, University of London

NGS Transcriptomic WorkflowsHugh Shanahan & Jamie al-NasirRoyal Holloway, University of London

Setting the scene • Transcriptome – total sequence and abundance of RNA generated by a cell • RNA is transcribed from DNA • Genome is fixed for a organism • Transcriptome is dynamic • Variation between tissues • Variation over time • RNA transcripts are 1,000’s-10,000 bases in length

Interested in • How many copies of a particular transcript are there • What is the sequence • - sequence comes from genome but alternative splicing means a transcript may not just be a contiguous block of DNA

Size of transcriptome will vary between species

Sequencing steps • Fragment transcripts into shorter pieces (reads) • 100-300 bases longs • Have many overlapping reads • Amplify (make lots of copies of) the short reads • Can sequence these short reads and then assemble them to reconstruct transcripts. • Size of data set depends on size of transcriptome but also amount of fragmentation (sequencing depth) • Can either assemble with a reference genome or de novo (very hard)

NGS Workflow

Final points • File formats have been updated to binary – used to use flat text so sizes were huge (Reference Genome – 39 Gbyte -> 0.8 Gybte) • Raw image data is actually discarded • Discussions focusses on assembly and down-stream analysis • Much of this data is deposited in the Sequence Read Archive (SRA) • We’ve papered over everything that happens before sequencing – i.e. the biochemical steps carried out • This is highly variable • These steps are not properly annotated

NGS Transcriptomic Workflows Hugh Shanahan & Jamie al-Nasir Royal Holloway, University of London