Next-Generation Sequencing: Challenges and Opportunities

Next-Generation Sequencing: Challenges and Opportunities Ion Mandoiu Computer Science and Engineering Department University of Connecticut

Outline • Background on high-throughput sequencing • Identification of tumor-specific epitopes • Estimation of gene and isoform expression levels • Viral quasispecies reconstruction • Future work

Advances in High-Throughput Sequencing (HTS) Roche/454 FLX Titanium 400-600 million reads/run 400bp avg. length Illumina HiSeq 2000 Up to 6 billion PE reads/run 35-100bp read length http://www.economist.com/node/16349358 SOLiD 4 1.4-2.4 billion PE reads/run 35-50bp read length

Illumina Workflow – Library Preparation mRNA Genomic DNA

Illumina Workflow – Cluster Generation

Illumina Workflow – Sequencing by Synthesis

C.Venter Sanger@7.5x J. Watson 454@7.4x NA18507 Illumina@36x SOLiD@12x Cost of Whole Genome Sequencing

HTS is a transformative technology Numerous applications besides de novo genome sequencing: RNA-Seq Non-coding RNAs ChIP-Seq Epigenetics Structural variation Metagenomics Paleogenomics … HTS applications

Genomics-Guided Cancer Immunotherapy Peptide Synthesis Tumor mRNA Sequencing Tumor Specific Epitopes CTCAATTGATGAAATTGTTCTGAAACT GCAGAGATAGCTAAAGGATACCGGGTT CCGGTATCCTTTAGCTATCTCTGCCTC CTGACACCATCTGTGTGGGCTACCATG … AGGCAAGCTCATGGCCAAATCATGAGA Immune System Stimulation Tumor Remission SYFPEITHI ISETDLSLL CALRRNESL … Mouse Image Source: http://www.clker.com/clipart-simple-cartoon-mouse-2.html

Bioinformatics Pipeline CCDS mapped reads CCDS Mapping Mapped reads Tumor mRNA reads Read Merging Genome Mapping SNVs Detection Genome mapped reads Tumor-specific SNVs EpitopePrediction Close SNV Haplotypes Haplotyping Tumor specific epitopes Primers for Sanger Sequencing Primers Design

Mapping mRNA Reads http://en.wikipedia.org/wiki/File:RNA-Seq-alignment.png

Read Merging

SNV Detection and Genotyping Locus i AACGCGGCCAGCCGGCTTCTGTCGGCCAGCAGCCAGGAATCTGGAAACAATGGCTACAGCGTGC AACGCGGCCAGCCGGCTTCTGTCGGCCAGCCGGCAG CGCGGCCAGCCGGCTTCTGTCGGCCAGCAGCCCGGA GCGGCCAGCCGGCTTCTGTCGGCCAGCCGGCAGGGA GCCAGCCGGCTTCTGTCGGCCAGCAGCCAGGAATCT GCCGGCTTCTGTCGGCCAGCAGCCAGGAATCTGGAA CTTCTGTCGGCCAGCCGGCAGGAATCTGGAAACAAT CGGCCAGCAGCCAGGAATCTGGAAACAATGGCTACA CCAGCAGCCAGGAATCTGGAAACAATGGCTACAGCG CAAGCAGCCAGGAATCTGGAAACAATGGCTACAGCG GCAGCCAGGAATCTGGAAACAATGGCTACAGCGTGC Reference Ri r(i) : Base call of read r at locus i εr(i) : Probability of error reading base call r(i) Gi: Genotype at locus i

SNV Detection and Genotyping • Use Bayes rule to calculate posterior probabilities and pick the genotype with the largest one

SNV Detection and Genotyping • Calculate conditional probabilities by multiplying contributions of individual reads

Data Filtering

Accuracy per RPKM bins

Haplotyping • Human somatic cells are diploid, containing two sets of nearly identical chromosomes, one set derived from each parent. ACGTTACATTGCCACTCAATC--TGGA ACGTCACATTG-CACTCGATCGCTGGA Heterozygous variants

Haplotyping

RefHap Algorithm • Reduce the problem to Max-Cut. • Solve Max-Cut • Build haplotypes according with the cut 4 -1 1 3 1 2 1 -1 3 h1 00110 h2 11001

Immunology Background J.W. Yedell, E Reits and J Neefjes. Making sense of mass destruction: quantitating MHC class I antigen presentation. Nature Reviews Immunology, 3:952-961, 2003

Epitope Prediction C. Lundegaard et al. MHC Class I Epitope Binding Prediction Trained on Small Data Sets. In Lecture Notes in Computer Science, 3239:217-225, 2004

Results on Tumor Data

Experimental Validation • Mutations reported by [Noguchi et al 94] found by the pipeline • Confirmed with Sanger sequencing 18 out of 20 mutations for MethA and 26 out of 28 mutations for CMS5 • Immunogenic potential under experimental validation in the Srivastava lab at UCHC

RNA-Seq Make cDNA & shatter into fragments Sequence fragment ends Map reads A B C D E Isoform Expression (IE) Gene Expression (GE) Isoform Discovery (ID) A B C A C D E

Alternative Splicing [Griffith and Marra 07]

Challenges to Accurate Estimation of Gene Expression Levels • Read ambiguity (multireads) • What is the gene length? A B C D E

Previous approaches to GE • Ignore multireads • [Mortazavi et al. 08] • Fractionally allocate multireads based on unique read estimates • [Pasaniuc et al. 10] • EM algorithm for solving ambiguities • Gene length: sum of lengths of exons that appear in at least one isoform  Underestimates expression levels for genes with 2 or more isoforms [Trapnell et al. 10]

Read Ambiguity in IE A B C D E A C

Previous approaches to IE • [Jiang&Wong 09] • Poisson model + importance sampling, single reads • [Richard et al. 10] • EM Algorithm based on Poisson model, single reads in exons • [Li et al. 10] • EM Algorithm, single reads • [Feng et al. 10] • Convex quadratic program, pairs used only for ID • [Trapnell et al. 10] • Extends Jiang’s model to paired reads • Fragment length distribution

Our contribution • Unified probabilistic model and Expectation-Maximization Algorithm for IE considering • Single and/or paired reads • Fragment length distribution • Strand information • Base quality scores

Read-Isoform Compatibility

Fragment length distribution • Paired reads A B C A C Fa(i) i A B C A B C j A C Fa(j) A C

Fragment length distribution • Single reads A B C A C i j Fa(i) Fa(j) A B C A B C A C A C

IsoEM algorithm E-step M-step

Error Fraction Curves - Isoforms • 30M single reads of length 25 (simulated)

Error Fraction Curves - Genes • 30M single reads of length 25 (simulated)

Validation on MAQC Samples

Viral Quasispecies • RNA viruses (HIV, HCV) • Many replication mistakes • Quasispecies (qsps) = co-existing closely related variants • Variants differ in • virulence • ability to escape the immune system • resistance to antiviral therapies • tissue tropism • How do qsps contribute to viral persistence and evolution?

454 Pyrosequencing • Pyrosequencing =Sequencing by Synthesis. • GS FLX Titanium : • Fragments (reads): 300-800 bp • Sequence of the reads • System software assembles reads into a singlegenome • We need a software that assembles reads into multiple genomes!

Quasispecies Spectrum Reconstruction (QSR) Problem • Given • pyrosequencing reads from a quasispecies population of unknown size and distribution • Reconstruct the quasispecies spectrum • sequences • frequencies

ViSpA Viral Spectrum Assembler

454 Sequencing Errors • Error rate ~0.1%. • Fixed number of incorporated bases vs. light intensity value. • Incorrect resolution of homopolymers => • over-calls (insertions) • 65-75% of errors • under-calls (deletions) • 20-30% of errors

Preprocessing of Aligned Reads • Deletions in reads: D • Replace deletion, confirmed by a single read, with either allele value that is present in all other reads or N. • Insertions into reference: I • Remove insertions, confirmed by a single read. • Imputation of missing values N

Next-Generation Sequencing: Challenges and Opportunities