240 likes | 408 Views
Estimation of alternative splicing isoform frequencies from RNA- Seq data. Marius Nicolae Computer Science and Engineering Department University of Connecticut Joint work with Serghei Mangul , Ion Mandoiu and Alex Zelikovsky. Outline. Introduction EM Algorithm Results
E N D
Estimation of alternative splicing isoform frequencies from RNA-Seq data Marius Nicolae Computer Science and Engineering Department University of Connecticut Joint work with SergheiMangul, Ion Mandoiuand Alex Zelikovsky
Outline • Introduction • EM Algorithm • Results • Conclusions and future work
RNA-Seq Make cDNA & shatter into fragments Sequence fragment ends Map reads A B C D E Isoform Expression (IE) Gene Expression (GE) Isoform Discovery (ID) A B C A C D E
Gene Expression Challenges • Read ambiguity (multireads) • What is the gene length? A B C D E
Previous approaches to GE • Ignore multireads • [Mortazavi et al. 08] • Fractionally allocate multireads based on unique read estimates • [Pasaniuc et al. 10] • EM algorithm for solving ambiguities • Gene length: sum of lengths of exons that appear in at least one isoform Underestimate expression levels for genes with 2 or more isoforms [Trapnell et al. 10]
Read Ambiguity in IE A B C D E A C
Previous approaches to IE • [Jiang&Wong 09] • Poisson model, single reads only • [Li et al.10] • EM Algorithm, single reads only • [Feng et al. 10] • Convex quadratic program, pairs used only for ID • [Trapnell et al. 10] • Extends Jiang’s model to paired reads • Fragment length distribution
Our contributions • EM Algorithm for IE • Single and paired reads • Fragment length distribution • Strand information • Base quality scores • Solving GE by adding isoform levels
Outline • Introduction • EM Algorithm • Results • Conclusions and future work
Fragment length distribution • Paired reads • Single reads A B C A A B B C C A C A A C C A B C A B C A C A B C A C A C
IsoEM algorithm E-step M-step
Outline • Introduction • EM Algorithm • Results • Conclusions and future work
Experimental setup • Human genome UCSC known isoforms • GNFAtlas2 gene expression levels • Uniform/geometric expression of gene isoforms • Normally distributed fragment lengths • Mean 250, std. dev. 25
Accuracy measurements • Error Fraction (EF) • Percentage of isoforms (or genes) with relative error larger than given threshold t • Median Percent Error (MPE) • Threshold t for which EF is 50% • r2 • Coefficient of determination
Isoform Error Fraction Curves • 30M single reads of length 25 • Main difference b/w IsoEM and RSEM is fragment length modeling
Gene Error Fraction Curves • 30M single reads of length 25
Read Length Effect • Fixed sequencing throughput (750Mb) • 50bp reads better than 100bp!
Effect of Pairs & Strand Information • 1-60M 75bp reads • Pairs help, strand info doesn’t • [Trapnell et al. 10] r2=.95 for 13M PE reads
Outline • Introduction • EM Algorithm • Results • Conclusions and future work
Conclusions & Future Work • Presented EM algorithm for isoform frequency estimation that exploits fragment length distribution for both single and paired reads • Significant accuracy improvement over existing methods • Code and datasets to be released publicly soon • Ongoing extensions • Confidence intervals • Allelic specific isoform expression • Testing for novel isoforms • Integration with isoform discovery