290 likes | 395 Views
RNA-seq analysis. Dr.Tech. Daniel Nicorici FIMM – Institute for Molecular Medicine Finland CSC - June 2, 2010. Outline. RNA sequencing overview Finding fusion genes Alternative splicing Conclusions. RNA-seq.
E N D
RNA-seq analysis Dr.Tech. Daniel Nicorici FIMM – Institute for Molecular Medicine Finland CSC - June 2, 2010
Outline • RNA sequencing overview • Finding fusion genes • Alternative splicing • Conclusions
RNA-seq • high-throughput sequencing technology for sequencing RNAs (actually cDNAs which contain the RNAs' content) • invaluable tool for study of diseases like cancer • allows researchers to obtain information like: • gene/transcript/exon expressions • alternative splicing • gene fusions • post-transcriptional mutations • single nucleotide variations • …
RNA-seq - cont’d • It reduces greatly the variability between experiments compared to other established measurement technologies like microarrays, exon arrays, etc. • Due to the small size of the read (cDNA is fragmented before sequencing) the bioinformatics analysis is challenging, e.g. • de novo assembly • aligning of sequenced reads • computation of gene/transcript/exon expressions
Reads in RNA-seq 5’ end 3’ end adaptor adaptor This is sequenced (short reads) Fig. 1 – Adaptor and reads in RNA-seq
Reads in RNA-seq – cont’d Exon A Exon B Exon C Exon D chromosome ? ? ? ? ? ? ? ? ? ? Exon A Exon B Exon C Exon D transcript Fig. 2 – Reads’ mappings at chromosome and transcript level
Why RNA-seq? RNA-seq Exon array (alternative splicing) ~1000€/sample ~700€/sample • exon/transcripts expressions • gene expressions • alternative splicing events • SNPs • fusion genes • ... cDNA array ~600€/sample SNPs array ~400€/sample Exon array (fusion genes) ~700€/sample Fig. 3 – RNA-seq vs array technologies
General steps of RNA-seq analysis • Filtering of short reads • Aligning the reads against a reference • Computationaly analysing of reads’ alignments • compute the gene/transcript/exon expressions • find new/known alternative splicing events • find new/known fusion genes • find new/known SNPs • Visualization
Examples of RNA-seq visualization Fig. 4 – Visualization using MapView
Examples of RNA-seq visualization – cont’d Fig. 5 – Coverage plot
Examples of RNA-seq visualization – cont’d 130.71 Coverage plot for gene ERBB2 in breast cancer Normalized coverage 0.00 4.41 Coverage plot for gene ERBB2 in normal breast Normalized coverage 0.00 Fig. 6 – Coverage plots visualization
Examples of RNA-seq visualization – cont’d Fig. 7 – Visualization of reads’ mappings using the UCSC browser
Examples of RNA-seq visualization – cont’d Fig. 8 – Visualization of coverages using UCSC browser
Examples of RNA-seq visualization – cont’d Fig. 9 – ”Gel-like” visualization of coverages using UCSC browser
Examples of RNA-seq visualization – cont’d Fig. 10 – Histogram of distances between the paired-end reads
Examples of RNA-seq visualization – cont’d Fig. 11 – Visualization of candidate fusion genes
Finding fusion genes Steps: • Reads filtering (quality, B’s, etc.) • Align all reads on genome • Aligning against the transcriptome all the reads which • map uniquely on genome, or • do not map on genome • Find the candiates fusion-genes by looking for paired-end reads which map simultaneusly on two different transcripts from two different genes • Find the fusion junction (e.g. generating exon-exon combinations and find on which one the reads are aligning) • Filtering of candidate fusion-genes
Reads in RNA-seq – cont’d Exon A Exon B Exon C Exon D chromosome ? ? ? ? ? ? ? ? ? ? Exon A Exon B Exon C Exon D transcript Fig. 2 – Reads’ mappings at chromosome and transcript level
Finding fusion genes – cont’d • RNA-seq data for the leukemia K562 cell line [1] • Philadelphia chromosome with the known BCR-ABL fusion genes • ~15 000 candidate fusion-genes found • ~85% candidate fusion-genes are known paralogs or have no protein product!!! • 15 candidate fusion-genes are found after additional filtering of candidate fusion-genes where the known BCR-ABL is number one candidate • Filtering of candidate fusion-genes is highly necessary in order to reduce the large number of candidate fusion-genes (from ten of thousands to tens)!!!
Alternative splicing • process by which the gene’s exons are pieced together in multiple ways forming mRNA during the RNA splicing. • there is a large body of evidence showing the links between alternative splicing and different diseases like cancer • Shannon’s entropy from information theory has been used previously for finding the imbalance in transcript expression [2,3] • Jensen-Shannon divergence has been used in quantifying the relative changes in expression of transcripts [4] • MDL [5] can be used for measuring the relative changes in expression of transcripts too
Alternative splicing – cont’d Steps: • Reads filtering (quality, B’s, etc.) • Align all reads on genome • Aligning against the transcriptome all the reads which • map uniquely on genome, or • do not map on genome • Compute (normalized) transcript expressions (e.g. RPKM) • Repeat steps 1-4 for all samples • Find relative-changes/imbalances between their transcript expressions of the same gene across the group of samples
Alternative splicing – cont’d Table 1 – Example of a gene with its five transcripts
Alternative splicing – cont’d • Computing the imbalance of transcript expression for example from Table 1 using MDL method [5]: • MDL’s advantage: the criteria for deciding between balanced/imbalanced is built-in
Alternative splicing – cont’d • only the transcripts which are validated (e.g. there are reads which map only on the given transcript [3]) are used for finding the imbalances • for example in a prostate cancer control sample versus treated sample are found ~3500 alternatively spliced genes
Conclusions • RNA-seq data analysis: • is computational intensive (when compared to, for example, microarray analysis) • needs very good filtering criteria, which are based on biology mathematics, in order to improve the quality of the results (i.e. low number of false positives) • there is not only one established way of doing it • many tools used for analysis, e.g. aligners, samtools, etc., are still work in progress • Visualization: • multiple facets, i.e. read coverage, fusion genes, etc. • depends on the user profile: • biologist/medical doctor • bioinformatician
References • Berger M. et al., Integrative analysis of the melanoma transcriptome,Genome Research, Feb. 2010. • Ritchie W. et al., Entropy measures quantify global splicing disorders in cancer, PLOS Computational Biology, vol. 4, March 2008. • Gan Q. et al., Dynamic regulation of alternative splicing and chromatin structure in Drosophila gonads revealed by RNA-seq, Cell Research, May 2010. • Trapnell C. et al., Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation, Nature Biotechnology, vol. 28, May 2010. • P. Grunwald, “Minimum description length principle tutorial”, in Advances in Minimum Description Length: Theory and Applications, P. Grunwald, I.J. Myung, and M. Pitt, Eds., pp. 22-79. MIT Press, Cambridge, 2005.
Acknowledgements • Olli Kallioniemi • Janna Saarela • Henrik Edgren • Astrid Murumägi • Sara Kangaspeska • Pekka Ellonen