1 / 60

Next-Generation Sequencing: Challenges and Opportunities

Next-Generation Sequencing: Challenges and Opportunities. Ion Mandoiu Computer Science and Engineering Department University of Connecticut. Outline. Background on high-throughput sequencing Identification of tumor-specific epitopes Estimation of gene and isoform expression levels

vaughan
Download Presentation

Next-Generation Sequencing: Challenges and Opportunities

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Next-Generation Sequencing: Challenges and Opportunities Ion Mandoiu Computer Science and Engineering Department University of Connecticut

  2. Outline • Background on high-throughput sequencing • Identification of tumor-specific epitopes • Estimation of gene and isoform expression levels • Viral quasispecies reconstruction • Future work

  3. Advances in High-Throughput Sequencing (HTS) Roche/454 FLX Titanium 400-600 million reads/run 400bp avg. length Illumina HiSeq 2000 Up to 6 billion PE reads/run 35-100bp read length http://www.economist.com/node/16349358 SOLiD 4 1.4-2.4 billion PE reads/run 35-50bp read length

  4. Illumina Workflow – Library Preparation mRNA Genomic DNA

  5. Illumina Workflow – Cluster Generation

  6. Illumina Workflow – Sequencing by Synthesis

  7. C.Venter Sanger@7.5x J. Watson 454@7.4x NA18507 Illumina@36x SOLiD@12x Cost of Whole Genome Sequencing

  8. HTS is a transformative technology Numerous applications besides de novo genome sequencing: RNA-Seq Non-coding RNAs ChIP-Seq Epigenetics Structural variation Metagenomics Paleogenomics … HTS applications

  9. Outline • Background on high-throughput sequencing • Identification of tumor-specific epitopes • Estimation of gene and isoform expression levels • Viral quasispecies reconstruction • Future work

  10. Genomics-Guided Cancer Immunotherapy Peptide Synthesis Tumor mRNA Sequencing Tumor Specific Epitopes CTCAATTGATGAAATTGTTCTGAAACT GCAGAGATAGCTAAAGGATACCGGGTT CCGGTATCCTTTAGCTATCTCTGCCTC CTGACACCATCTGTGTGGGCTACCATG … AGGCAAGCTCATGGCCAAATCATGAGA Immune System Stimulation Tumor Remission SYFPEITHI ISETDLSLL CALRRNESL … Mouse Image Source: http://www.clker.com/clipart-simple-cartoon-mouse-2.html

  11. Bioinformatics Pipeline CCDS mapped reads CCDS Mapping Mapped reads Tumor mRNA reads Read Merging Genome Mapping SNVs Detection Genome mapped reads Tumor-specific SNVs EpitopePrediction Close SNV Haplotypes Haplotyping Tumor specific epitopes Primers for Sanger Sequencing Primers Design

  12. Bioinformatics Pipeline CCDS mapped reads CCDS Mapping Mapped reads Tumor mRNA reads Read Merging Genome Mapping SNVs Detection Genome mapped reads Tumor-specific SNVs EpitopePrediction Close SNV Haplotypes Haplotyping Tumor specific epitopes Primers for Sanger Sequencing Primers Design

  13. Mapping mRNA Reads http://en.wikipedia.org/wiki/File:RNA-Seq-alignment.png

  14. Read Merging

  15. SNV Detection and Genotyping Locus i AACGCGGCCAGCCGGCTTCTGTCGGCCAGCAGCCAGGAATCTGGAAACAATGGCTACAGCGTGC AACGCGGCCAGCCGGCTTCTGTCGGCCAGCCGGCAG CGCGGCCAGCCGGCTTCTGTCGGCCAGCAGCCCGGA GCGGCCAGCCGGCTTCTGTCGGCCAGCCGGCAGGGA GCCAGCCGGCTTCTGTCGGCCAGCAGCCAGGAATCT GCCGGCTTCTGTCGGCCAGCAGCCAGGAATCTGGAA CTTCTGTCGGCCAGCCGGCAGGAATCTGGAAACAAT CGGCCAGCAGCCAGGAATCTGGAAACAATGGCTACA CCAGCAGCCAGGAATCTGGAAACAATGGCTACAGCG CAAGCAGCCAGGAATCTGGAAACAATGGCTACAGCG GCAGCCAGGAATCTGGAAACAATGGCTACAGCGTGC Reference Ri r(i) : Base call of read r at locus i εr(i) : Probability of error reading base call r(i) Gi: Genotype at locus i

  16. SNV Detection and Genotyping • Use Bayes rule to calculate posterior probabilities and pick the genotype with the largest one

  17. SNV Detection and Genotyping • Calculate conditional probabilities by multiplying contributions of individual reads

  18. Data Filtering

  19. Accuracy per RPKM bins

  20. Bioinformatics Pipeline CCDS mapped reads CCDS Mapping Mapped reads Tumor mRNA reads Read Merging Genome Mapping SNVs Detection Genome mapped reads Tumor-specific SNVs EpitopePrediction Close SNV Haplotypes Haplotyping Tumor specific epitopes Primers for Sanger Sequencing Primers Design

  21. Haplotyping • Human somatic cells are diploid, containing two sets of nearly identical chromosomes, one set derived from each parent. ACGTTACATTGCCACTCAATC--TGGA ACGTCACATTG-CACTCGATCGCTGGA Heterozygous variants

  22. Haplotyping

  23. RefHap Algorithm • Reduce the problem to Max-Cut. • Solve Max-Cut • Build haplotypes according with the cut 4 -1 1 3 1 2 1 -1 3 h1 00110 h2 11001

  24. Bioinformatics Pipeline CCDS mapped reads CCDS Mapping Mapped reads Tumor mRNA reads Read Merging Genome Mapping SNVs Detection Genome mapped reads Tumor-specific SNVs EpitopePrediction Close SNV Haplotypes Haplotyping Tumor specific epitopes Primers for Sanger Sequencing Primers Design

  25. Immunology Background J.W. Yedell, E Reits and J Neefjes. Making sense of mass destruction: quantitating MHC class I antigen presentation. Nature Reviews Immunology, 3:952-961, 2003

  26. Epitope Prediction C. Lundegaard et al. MHC Class I Epitope Binding Prediction Trained on Small Data Sets. In Lecture Notes in Computer Science, 3239:217-225, 2004

  27. Results on Tumor Data

  28. Experimental Validation • Mutations reported by [Noguchi et al 94] found by the pipeline • Confirmed with Sanger sequencing 18 out of 20 mutations for MethA and 26 out of 28 mutations for CMS5 • Immunogenic potential under experimental validation in the Srivastava lab at UCHC

  29. Outline • Background on high-throughput sequencing • Identification of tumor-specific epitopes • Estimation of gene and isoform expression levels • Viral quasispecies reconstruction • Future work

  30. RNA-Seq Make cDNA & shatter into fragments Sequence fragment ends Map reads A B C D E Isoform Expression (IE) Gene Expression (GE) Isoform Discovery (ID) A B C A C D E

  31. Alternative Splicing [Griffith and Marra 07]

  32. Challenges to Accurate Estimation of Gene Expression Levels • Read ambiguity (multireads) • What is the gene length? A B C D E

  33. Previous approaches to GE • Ignore multireads • [Mortazavi et al. 08] • Fractionally allocate multireads based on unique read estimates • [Pasaniuc et al. 10] • EM algorithm for solving ambiguities • Gene length: sum of lengths of exons that appear in at least one isoform  Underestimates expression levels for genes with 2 or more isoforms [Trapnell et al. 10]

  34. Read Ambiguity in IE A B C D E A C

  35. Previous approaches to IE • [Jiang&Wong 09] • Poisson model + importance sampling, single reads • [Richard et al. 10] • EM Algorithm based on Poisson model, single reads in exons • [Li et al. 10] • EM Algorithm, single reads • [Feng et al. 10] • Convex quadratic program, pairs used only for ID • [Trapnell et al. 10] • Extends Jiang’s model to paired reads • Fragment length distribution

  36. Our contribution • Unified probabilistic model and Expectation-Maximization Algorithm for IE considering • Single and/or paired reads • Fragment length distribution • Strand information • Base quality scores

  37. Read-Isoform Compatibility

  38. Fragment length distribution • Paired reads A B C A C Fa(i) i A B C A B C j A C Fa(j) A C

  39. Fragment length distribution • Single reads A B C A C i j Fa(i) Fa(j) A B C A B C A C A C

  40. IsoEM algorithm E-step M-step

  41. Error Fraction Curves - Isoforms • 30M single reads of length 25 (simulated)

  42. Error Fraction Curves - Genes • 30M single reads of length 25 (simulated)

  43. Validation on MAQC Samples

  44. Outline • Background on high-throughput sequencing • Identification of tumor-specific epitopes • Estimation of gene and isoform expression levels • Viral quasispecies reconstruction • Future work

  45. Viral Quasispecies • RNA viruses (HIV, HCV) • Many replication mistakes • Quasispecies (qsps) = co-existing closely related variants • Variants differ in • virulence • ability to escape the immune system • resistance to antiviral therapies • tissue tropism • How do qsps contribute to viral persistence and evolution?

  46. 454 Pyrosequencing • Pyrosequencing =Sequencing by Synthesis. • GS FLX Titanium : • Fragments (reads): 300-800 bp • Sequence of the reads • System software assembles reads into a singlegenome • We need a software that assembles reads into multiple genomes!

  47. Quasispecies Spectrum Reconstruction (QSR) Problem • Given • pyrosequencing reads from a quasispecies population of unknown size and distribution • Reconstruct the quasispecies spectrum • sequences • frequencies

  48. ViSpA Viral Spectrum Assembler

  49. 454 Sequencing Errors • Error rate ~0.1%. • Fixed number of incorporated bases vs. light intensity value. • Incorrect resolution of homopolymers => • over-calls (insertions) • 65-75% of errors • under-calls (deletions) • 20-30% of errors

  50. Preprocessing of Aligned Reads • Deletions in reads: D • Replace deletion, confirmed by a single read, with either allele value that is present in all other reads or N. • Insertions into reference: I • Remove insertions, confirmed by a single read. • Imputation of missing values N

More Related