120 likes | 239 Views
RNA-Seq datasets. Dan Lawson. New buzz word (old data). In the beginning there were ESTs... and then there was Roche 454.. and then Solexa/Illumina. Why do we generate data sets? Who is producing data sets? Where do we obtain these? What can we use them for? How do we organise these?.
E N D
RNA-Seq datasets Dan Lawson
New buzz word (old data) • In the beginning there were ESTs... • and then there was Roche 454.. • and then Solexa/Illumina. • Why do we generate data sets? • Who is producing data sets? • Where do we obtain these? • What can we use them for? • How do we organise these? VectorBase 2012 2
Why do we produce RNA-Seq data sets? • Access to the transcriptome of an organism (speed v cost) • Technical issues with the genome of that species (size, repeat content) • Quantification of gene expression levels (absolute & relative) • Analysis of these data sets both require and can deliver improvements to the quality of the predicted gene structures VectorBase 2012 3
Who is producing RNA-Seq data sets? • Almost all de novo genome sequencing projects in order to produce a substrate for gene prediction • Large studies (such as the Vosshall and Krzywinski DBPs) • Small studies (such as Zweibel chemosensors) • XXXXX[orgn] AND study_type_transcriptome_analysis[prop] VectorBase 2012 4
RNA-Seq data sets in VectorBase • We do not want to be the archival database for these data sets (as they are large and will be very common) • We do want to identify important sets and present some level of processed/analysed data • All sets require some level of QC/filtering • All sets require alignment back to a reference genome • Default aligner has been bowtie (but we know this is sub-optimal) • Other aligners used include inchworm, gsnap, bwa • Output is a BAM file • Use SAMtools to index the BAM files (so that Ensembl tools can use these sets, tools are a viewer and slicer) • {To Do} Move indexed BAM files on FTP site VectorBase 2012 5
Using RNA-Seq data: Gene prediction • Aligned RNA-Seq data sets provide • Coverage plots which can be processed to transfrags • Exon-Intron junction data • Use in automated annotation (MAKER) • Requires assembly/clustering for performance issues • Useful for providing training data for ab initio predictiors • transfrags should be used with caution in early rounds of MAKER • Use in manual annotation (Apollo/Artemis) • Identification of novel predictions, exons • Confirmation/correction of intron junction data • Manual inclusion of UnTranslated Regions (UTRs) VectorBase 2012 6
Using RNA-Seq data: Gene expression • Use the abundance of reads in an RNA-Seq experiment to assay the level of expression for a locus • Requires: • Aligned RNA-Seq data sets (BAM) • Annotation sets (GFF/GTF) • Processed to give FPKM/RPKM values for expression levels • Storage of these data in BASE2/GDAV (as discussed by Bob yesterday) VectorBase 2012 7
RNA-Seq visualization of coverage • BAM viewer (VectorBase) • Good for single (or small number of lanes) • Flexible, user chooses which experiments to visualize • Becomes slow and unwieldy with a medium-large number of lanes • Multiple experiments (FlyBase) • Good for multiple experiments • Pre-defined set of experiments • Fast response time VectorBase 2012 8
RNA-Seq questions #1 • Given limited space/speed • What are the key experiments we can support? • Criteria fo defining these? • Pre/post publication data sets? • Shelf life for an RNA-Seq experiment? • How do we aggregate across different experiments? • Coverage/Junctions • By species, developmental stage, body part, condition VectorBase 2012 10