1 / 74

Inferring Viral Quasispecies Spectra from NGS Reads

Inferring Viral Quasispecies Spectra from NGS Reads. Ion M ă ndoiu Computer Science & Engineering Department University of Connecticut. Outline. Background Quasispecies spectrum reconstruction from shotgun NGS reads Quasispecies spectrum reconstruction from amplicon NGS reads

garret
Download Presentation

Inferring Viral Quasispecies Spectra from NGS Reads

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Inferring Viral Quasispecies Spectra from NGS Reads Ion Măndoiu Computer Science & Engineering Department University of Connecticut

  2. Outline • Background • Quasispecies spectrum reconstruction from shotgun NGS reads • Quasispecies spectrum reconstruction from ampliconNGS reads • Quasispecies spectrum reconstruction for IBV • Ongoing and future work

  3. Cost of DNA Sequencing http://www.economist.com/node/16349358

  4. Cost/Performance Comparison [Glenn 2011]

  5. De novo genome sequencing Genome re-sequencing RNA-Seq Non-coding RNAs Structural variation ChIP-Seq Methyl-Seq Metagenomics Paleogenomics Viral quasispecies … many more biological measurements “reduced” to NGS sequencing Applications

  6. RNA Virus Replication High mutation rate (~10-4) Lauring & Andino, PLoS Pathogens 2011

  7. How Are Quasispecies Contributing to Virus Persistence and Evolution? • Variants differ in • Virulence • Ability to escape immune response • Resistance to antiviral therapies • Tissue tropism Lauring & Andino, PLoS Pathogens 2011

  8. 454 Pyrosequencing Workflow

  9. Shotgun vs. Amplicon Reads • Shotgun reads • starting positions distributed ~uniformly • Amplicon reads • reads have predefined start/end positions covering fixed overlapping windows

  10. Quasispecies Spectrum Reconstruction (QSR) Problem • Given • Shotgun/ampliconpyrosequencing reads from a quasispecies population of unknown size and distribution • Reconstruct the quasispecies spectrum • Sequences • Frequencies

  11. Prior Work • Eriksson et al 2008 • maximum parsimony using Dilworth’s theorem, clustering, EM • Westbrooks et al. 2008 • min-cost network flow • Zagordiet al 2010-11 (ShoRAH) • probabilistic clustering based on a Dirichlet process mixture • Prosperiet al 2011 (amplicon based) • based on measure of population diversity • Huang et al 2011 (QColors) • Parsimonious reconstruction of quasispecies subsequences using constraint programming within regions with sufficient variability

  12. Outline • Background • Quasispecies spectrum reconstruction from shotgun NGS reads • Quasispecies spectrum reconstruction from ampliconNGS reads • Quasispecies spectrum reconstruction for IBV • Ongoing and future work

  13. ViSpA: Viral Spectrum Assembler • Key features • Error correction both pre-alignment (based on k-mers) and post-alignment • Quasispecies assembly based on maximum-bandwidth paths in weighted read graphs • Frequency estimation via EM on all reads • Freely available at http://alla.cs.gsu.edu/software/VISPA/vispa.html

  14. ViSpA Flow Read Error Correction Read Alignment Preprocessing of Aligned Reads Shotgun 454 reads Frequency Estimation Read Graph Construction Contig Assembly Quasispecies sequences w/ frequencies

  15. k-mer Error Correction [Skums et al.] Zhao X et al 2010 • Calculate k-mers and their frequencies (k-counts) • Assume that kmers with high k-counts (“solid” k-mers) are correct, while k-mers with low k-counts (“weak” k-mers) contain errors • Determine the threshold k-count (error threshold), which distinguishes solid kmers from weak k-mers. • Find error regions. • Correct the errors in error regions

  16. Iterative Read Alignment Read Alignment vs Reference Build Consensus Read Re-Alignment vs. Consensus More Reads Aligned? Yes No Post- processing

  17. 454 Sequencing Errors • Sequencing error rate ~ 0.1% • Most errors due to incorrect resolution of homopolymers • over-calls (insertions) • 65-75% of errors • under-calls (deletions) • 20-30% of errors

  18. Post-processing of Aligned Reads • Deletions in reads: D • Insertions into reference: I • Additional error correction: • Replace deletions supported by a single read with either the allele present in all other reads or N • Remove insertions supported by a single read

  19. Read Graph: Vertices ACTGGTCCCTCCTGAGTGT GGTCCCTCCT TGGTCACTCGTGAG ACCTCATCGAAGCGGCGTCCT Subread = completely contained in other read with ≤ n mismatches. Superreads = not subreads => vertices in the read graph

  20. Read Graph: Edges • Edge b/w two vertices if there is an overlap between superreads and they agree on their overlap with ≤ m mismatches • Transitive reduction

  21. Edge Cost Δ where j is the number of mismatches in overlap o, ε is 454 error rate Cost measures the uncertainty that two superreads belong to the same quasispecies. OverhangΔis the shift in start positions of two overlapping superreads.

  22. Contig Assembly - Path to Sequence • Compute an s-t-Max Bandwidth Path through each vertex (maximizing minimum edge cost) • Build coarse sequence out of each path’s superreads: • For each position: >70%-majority if it exists, otherwise N • Replace N’s in coarse sequence with weighted consensus obtained from all reads • Select unique sequences out of constructed sequences

  23. Frequency Estimation – EM Algorithm • E step: • M step: • Bipartite graph: • Qq is a candidate with frequency fq • Rr is a read with observed frequency or • Weight hq,r= probability that read r is produced by quasispecies q with j mismatches

  24. Experimental Validation • Simulations • Error-free reads from known HCV quasispecies • Reads with errors generated by FlowSim (Balser et al. 2010) • Real 454 reads • HIV and HCV data • Comparison with ShoRAH

  25. Simulations: Error-Free Reads • 44 real qsps (1739 bp long) from the E1E2 region of Hepatitis C virus (von Hahn et al. (2006)) • Simulated reads: • 4 populations sizes: 10, 20, 30, 40 sequences • Geometric distribution • The quasispecies population: • Number of reads between 20K and 100K • Read length distribution N(μ,400); μ varied from 200 to 500

  26. Results

  27. Simulations with FlowSim • 44 real quasispecies sequences (1739 bp long) from the E1E2 region of Hepatitis C virus (von Hahn et al. (2006)) • 30K reads with average length 350bp • 100 bootstrapping tests on 10% - reduced data • For the i-th (i = 1, .., 10) most frequent sequence assembled on the whole data, we record its reproducibility= percentage of runs when there is a match (exact or with at most k mismatches) among 10 most frequent sequences found on reduced data.

  28. Bootstraping Tests • ShoRAH outperforms ViSpA due to its read correction • If ViSpA is used on ShoRAH-corrected reads (ShoRAHreads+ViSpA), the results drastically improve

  29. 454 Reads of HIV Qsps • 55,611 reads (average read length 345bp) from ten 1.5Kbp long region of HIV-1 (Zagordi et al.2010) • No removal of low-quality reads • ~99% of reads has at least one indel • ~11.6 % of reads with at least one N • ShoRAH correctly infers only 2 qspssequences with <=4 mismatches • ViSpA correctly infers 5 qspswith <=2 mismatches , 2 qsps are inferred exactly

  30. Outline • Background • Quasispecies spectrum reconstruction from shotgun NGS reads • Quasispecies spectrum reconstruction from ampliconNGS reads • Quasispecies spectrum reconstruction for IBV • Ongoing and future work

  31. Amplicon Sequencing Challenges • Distinct quasispecies may be indistinguishable in an amplicon interval • Multiple reads from consecutive amplicons may match over theiroverlap

  32. Prosperi et al. 2011 • First published approach for amplicons • Based on the idea of guide distribution • choose most variable amplicon • extend to right/left with matching reads, breaking ties by rank

  33. Read Graph for Amplicons • K amplicons → K-staged read graph • vertices → distinct reads • edges → reads with consistent overlap • vertices, edges have a count function

  34. Read Graph • May transform bi-cliquesinto 'fork' subgraphs • common overlap is represented by fork vertex

  35. Observed vs Ideal Read Frequencies • Ideal frequency • consistent frequency across forks • Observed frequency (count) • inconsistent frequency across forks

  36. Fork Balancing Problem • Given • Set of reads and respective frequencies • Find • Minimal frequency offsets balancing all forks • Simplest approach is to scale frequencies from left to right

  37. Least Squares Balancing • Quadratic Program for read offsets • q – fork, oi – observed frequency, xi – frequency offset

  38. Fork Resolution: Parsimony 6 8 4 2 12 8 6 8 6 6 6 8 4 4 2 2 4 8 4 4 2 2 4 4 4 2 2 2 4 2 (a) (b)

  39. Fork Resolution: Max Likelihood • Given a forest, ML = # of ways to produce observed reads / 2^(#qsp): • Can be computed efficiently for trees: multiply by binomial coefficient of a leaf and its parent edge, prune the edge, and iterate • Solution (b) has a larger likelihood than (a) although both have 3 qsp’s • (a) (4 choose 2) * (8 choose 4) * (8 choose 4)/2^20 = 29400/2^20 ~ 2.8% • (b) (12 choose 6) * (4 choose 2)*(4 choose 2)/2^20 = 33264/2^20 ~ 3.3% 12 8 6 8 6 6 6 8 4 4 2 2 4 8 4 4 2 2 4 4 4 2 2 2 4 2 (a) (b)

  40. Fork Resolution: Min Entropy 12 8 6 8 6 6 6 8 4 4 2 • Solution (b) also has a lower entropy than (a) • (a) -[ (8/20)log(8/20) + (8/20)log(8/20) + (4/20)log(4/20) ] ~ 1.522 • (b) -[ (12/20)log(4/20) + (4/20)log(4/20) + (4/20)log(4/20) ] ~ 1.37 2 4 8 4 4 2 2 4 4 4 2 2 2 4 2 (a) (b)

  41. Local Optimization: Greedy Method

  42. Greedy Method

  43. Greedy Method

  44. Greedy Method

  45. Greedy Method

  46. Greedy Method

  47. Greedy Method

  48. Greedy Method

  49. Greedy Method

  50. Global Optimization: Maximum Bandwidth

More Related