Serghei Mangul Department of Computer Science Georgia State University

Maximum Likelihood Estimation of Incomplete Genomic Spectrum from HTS Data SergheiMangul Department of Computer Science Georgia State University • Joint work with Irina Astrovskaya, Marius Nicolae, BassamTork, Ion Mandoiu and Alex Zelikovsky

Outline • Introduction • ML Model • EM Algorithm • VSEM Algorithm • Experimental Results • RNA-Seq • 454 • Conclusions and future work WABI 2011, Max-Planck-Institute für Informatics, Saarbrücken, Germany

Advances in High-Throughput Sequencing (HTS) Roche/454 FLX Titanium 400-600 million reads/run 400bp avg. length Illumina HiSeq 2000 Up to 6 billion PE reads/run 35-100bp read length http://www.economist.com/node/16349358 SOLiD 4/5500 1.4-2.4 billion PE reads/run 35-50bp read length WABI 2011, Max-Planck-Institute für Informatics, Saarbrücken, Germany

MLModel reads R1 strings S1 R2 S2 R3 S3 R4 • Panel : bipartite graph • LEFT: genomic sequences (strings) • unknown frequencies • RIGHT: reads • observed frequencies • EDGES: probability of the read to be emitted by the string • weights are calculated based on the mapping of the reads to the strings WABI 2011, Max-Planck-Institute für Informatics, Saarbrücken, Germany

ML estimates of string frequencies • Probability that a read is sampled from string is proportional with its frequency f(j) • ML estimates for f(j)is given by n(j)/(n(1) + . . . + n(N)) • n(j) - number of reads sampled from string j WABI 2011, Max-Planck-Institute für Informatics, Saarbrücken, Germany

EM algorithm Initialization E-step: Compute the expected number n(j) of reads that come from string j under the assumption that string frequencies f(j) are correct M-step: For each string j, set the new value of f(j) equal to the portion of reads being originated by string j among all reads in the sample WABI 2011, Max-Planck-Institute für Informatics, Saarbrücken, Germany

ML Model Quality • How well the maximum likelihood model explain the reads • Measured by deviationbetween expected and observed read frequencies • expected read frequency: WABI 2011, Max-Planck-Institute für Informatics, Saarbrücken, Germany

VSEM : Virtual String EM (Incomplete) Panel + Virtual String with 0-weights in virtual string Update weights of reads in virtual string ML estimates of string frequencies EM EM Virtual String frequency change>ε? Output string frequencies and weights Compute expected read frequencies YES NO WABI 2011, Max-Planck-Institute für Informatics, Saarbrücken, Germany

Example : 1st iteration Incomplete Panel Full Panel reads reads R1 R1 strings strings S1 S1 R2 R2 S2 S2 R3 R3 S3 R4 R4 WABI 2011, Max-Planck-Institute für Informatics, Saarbrücken, Germany

Example : 1st iteration Incomplete Panel Full Panel reads reads R1 R1 strings strings S1 S1 R2 R2 S2 S2 R3 R3 VS S3 R4 R4 VS WABI 2011, Max-Planck-Institute für Informatics, Saarbrücken, Germany 14

Example : 1st iteration Incomplete Panel Full Panel reads reads R1 R1 strings strings S1 S1 R2 R2 S2 S2 R3 R3 VS S3 R4 R4 VS WABI 2011, Max-Planck-Institute für Informatics, Saarbrücken, Germany

Example : last iteration Incomplete Panel Full Panel reads reads R1 R1 strings strings S1 S1 R2 R2 S2 S2 R3 R3 VS S3 R4 R4 VS WABI 2011, Max-Planck-Institute für Informatics, Saarbrücken, Germany

VSEM : Virtual String EM • Decide if the panel is likely to be incomplete • Estimate total frequency of missing strings • Identify read spectrum emitted by missing strings WABI 2011, Max-Planck-Institute für Informatics, Saarbrücken, Germany

VSEM : Applications • RNA-Seq • inferring isoform expressions from RNA-Seq • Viral Quasispecies Sequencing by 454 pyrosequencing • inferring viral quasispecies spectrum from pyrosequencing shotgun reads WABI 2011, Max-Planck-Institute für Informatics, Saarbrücken, Germany

RNA-Seq Make cDNA & shatter into fragments Sequence fragment ends Map reads A B C D E Isoform Expression (IE) Gene Expression (GE) Isoform Discovery (ID) A B C A C D E WABI 2011, Max-Planck-Institute für Informatics, Saarbrücken, Germany

Previous Approach • IsoEM [Nicolae et al. 2011] – novel expectation-maximization algorithm for inference of alternative splicing isoform frequencies from RNA-Seq data • Single and/or paired reads • Fragment length distribution • Strand information • Base quality scores • Insert sizes (library preparation) WABI 2011, Max-Planck-Institute für Informatics, Saarbrücken, Germany

Simulation Setup • Human genome UCSC/CCDS known isoforms • UCSC : 66803 isoforms, 19372 genes • CCDS : 20829 isoforms , 17373 genes • GNFAtlas2 gene expression levels • geometric expression of gene isoforms • Normally distributed fragment lengths • Mean 250, std. dev. 25 WABI 2011, Max-Planck-Institute für Informatics, Saarbrücken, Germany

EXP1 : Reduced transcriptome data • Comparison between IsoEM and IsoVSEM on reduced transcriptome data • in every gene 25% of isoforms is missing • isoforms inside the gene - geometric distribution(p=0.5) • select genes with number of isoforms inside the gene is less or equal to 3. • removed isoforms with frequency 0.25 WABI 2011, Max-Planck-Institute für Informatics, Saarbrücken, Germany

EXP2 : CCDS panel • UCSC database represents the full panel • CCDS represents the incomplete panel • reads were generated from UCSC library of isoforms • only frequencies of known isoforms(CCDS) were estimated WABI 2011, Max-Planck-Institute für Informatics, Saarbrücken, Germany

Error Fraction Curves EXP1, 30M reads of length 25 WABI 2011, Max-Planck-Institute für Informatics, Saarbrücken, Germany

454 Pyrosequencing • Pyrosequencing =Sequencing by Synthesis. • GS FLX Titanium : • Divides the source genetic material into reads (300-800 bp) WABI 2011, Max-Planck-Institute für Informatics, Saarbrücken, Germany

Previous Approach • ViSpA [Astrovskaya et al. 2011] – viral spectrum assembling tool for inferring viral quasispecies sequences and their frequencies from pyroseqencing shotgun reads • align reads • built a read graph : • V – reads • E – overlap between reads • each path – candidate sequence • filter based on ML frequencies WABI 2011, Max-Planck-Institute für Informatics, Saarbrücken, Germany

reads removing duplicated & rare qsps ViSpA-VSEM assembled Qsps Qsps Library ViSPA Weighted assembler Stopping condition reads, weights VSEM Virtual String EM NO YES ViSpA ML estimator Viral Spectrum +Statistics WABI 2011, Max-Planck-Institute für Informatics, Saarbrücken, Germany

Simulation Setup • Real quasispecies sequences data from [von Hahn et al. 2006] • 44 sequences (1739 bp long) from the E1E2 region of Hepatitis C virus • populations sizes: 10, 20, 30, and 40 sequences • population distributions: geometric, skewed normal, uniform WABI 2011, Max-Planck-Institute für Informatics, Saarbrücken, Germany

Experimental Validation of VSEM • Detection of panel incompleteness • VSEM can detect >1% of missing strings • Improving quasispecies frequencies estimations • Detection of reads emitted by missing string • Correlation between predicted reads and reads emitted by missing strings >65% WABI 2011, Max-Planck-Institute für Informatics, Saarbrücken, Germany

VSEM improving frequencies estimates r - Correlation between real and predicted frequencies; err - average prediction error WABI 2011, Max-Planck-Institute für Informatics, Saarbrücken, Germany

ViSpAvsViSpA-VSEM • 100K reads from 10 QSPS • average length 300 r - Correlation between real and predicted frequencies; err - average prediction error WABI 2011, Max-Planck-Institute für Informatics, Saarbrücken, Germany

Conclusions • We propose VSEM, a novel modification of EM algorithm • improves the ML frequency estimations of multiple genomic sequences • identifies reads that belong to unassembled(missing) sequences • We applied VSEM to improve two tools: • IsoEM • ViSpA WABI 2011, Max-Planck-Institute für Informatics, Saarbrücken, Germany

Future work Assemble strings from the set of reads emitted by missing strings Improve other metagenomics tools WABI 2011, Max-Planck-Institute für Informatics, Saarbrücken, Germany

Acknowledgments NSF awards IIS-0546457 IIS-0916948, and DBI-0543365. NSF award IIS-0916401 Agriculture and Food Research Initiative Competitive Grant no. 2011-67016-30331 from the USDA National Institute of Food and Agriculture. WABI 2011, Max-Planck-Institute für Informatics, Saarbrücken, Germany

Thanks

Serghei Mangul Department of Computer Science Georgia State University