260 likes | 421 Views
Accurate Estimation of Gene Expression Levels from Digital Gene Expression Sequencing Data. Marius Nicolae and Ion M ă ndoiu (University of Connecticut, USA). Outline. DGE/SAGE- Seq protocol EM algorithm Experimental results Conclusions. RNA- Seq Protocol.
E N D
Accurate Estimation of Gene Expression Levelsfrom Digital Gene Expression Sequencing Data Marius Nicolae and Ion Măndoiu (University of Connecticut, USA)
Outline • DGE/SAGE-Seq protocol • EM algorithm • Experimental results • Conclusions
RNA-Seq Protocol Make cDNA & shatter into fragments Sequence fragment ends Map reads A B C D E Isoform Expression (IE) Gene Expression (GE) Isoform Discovery (ID) A B C A C D E
DGE Protocol AAAAA Cleave with anchoring enzyme (AE) AAAAA CATG CATG CATG AAAAA AE AE Attach primer for tagging enzyme (TE) TCCRAC TE Cleave with tagging enzyme Map tags Gene Expression (GE) A B C D E
Our Approach Previous methods • Discard ambiguous tags [Asmann et al. 09, Zaretzki et al. 10] • Heuristics to rescue some ambiguous tags [Wu et al. 10] New DGE-EMalgorithm • Uses all tags, including all ambiguous ones • Uses quality scores • Takes into account partial digest and gene isoforms
DGE-EM Algorithm assign random values to allf(i) while not converged init all n(i,j)to0 for each tagt for (i,j,w) in t E-step for each isoformi M-step
MAQC Data (UHRR, HBRR) DGE • 9 Illumina libraries, 238M 20bp tags [Asmann et al. 09] • Anchoring enzyme DpnII (GATC) RNA-Seq • 6 libraries, 47-92M 35bp reads each [Bullard et al. 10] qPCR • Quadruplicate measurements for 832 Ensembl genes [MAQC Consortium 06]
Compared Algorithms DGE • Uniq [Asmann et al. 09, Zaretzki et al. 10] • DGE-EM RNA-Seq • IsoEM [Nicolae et al. 10] • Cufflinks [Trapnell et al. 10]
1-30M tags, lengths 14-26bp UCSC hg19 genome and known isoforms Simulated expression levels Gene expression for 5 tissues from the GNFAtlas2 Geometric expression for the isoforms of each gene Anchoring enzymes from REBASE DpnII (GATC) [Asmann et al. 09] NlaIII (CATG) [Wu et al. 10] CviJI (RGCY, R=G or A, Y=C or T) Synthetic Data
MPEfor 30M 21bp tags RNA-Seq: 8.3 MPE
Conclusions Introduced new DGE-EM algorithm Improves accuracy over previous methods by using ambiguous tags and considering isoforms and partial digestion Source code freely availabe at http://www.dna.engr.uconn.edu/software/DGE-EM First direct comparison of RNA-Seq and DGE protocols Best inference algorithms yield comparable cost-normalized accuracy on MAQC data Simulations suggest possible DGE protocol improvements Enzymes with degenerate recognition sites (e.g. CviJI) Optimizing cutting probability
Questions? ACKNOWLEDGEMENTS Work supported in part by NSF awards IIS-0546457 and IIS-0916948