190 likes | 209 Views
Reconstruction of Haplotype Spectra from NGS Data. Ion Mandoiu UTC Associate Professor in Engineering Innovation Department of Computer Science & Engineering University of Connecticut. Haplotype Spectra Reconstruction. Given NGS reads, reconstruct: Full length sequences
E N D
Reconstruction of Haplotype Spectra from NGS Data Ion Mandoiu UTC Associate Professor in Engineering Innovation Department of Computer Science & Engineering University of Connecticut
Haplotype Spectra Reconstruction • Given NGS reads, reconstruct: • Full length sequences • Sequence frequencies • Example applications: • Single individual haplotyping • Allele specific transcriptome reconstruction • Viral quasispecies reconstruction
Single Individual Haplotyping • Somatic cells are diploid, containing two nearly identical copies of each autosomal chromosome • Heterozygous loci found by mapping reads to reference genome • Long haplotype fragments can be generated by sequencing fosmid pools [Duitama et al. 2012]
RefHap Algorithm [Duitama et al. 12] • Reduce the problem to Max-Cut • Solve Max-Cut • Build haplotypes according with the cut f4 h1 00110 h2 11001 -1 1 3 f1 f2 1 -1 f3 Chr. 22, 32k SNPs, 14k fragments
Haplotype Spectra Reconstruction • Given short sequence fragments, reconstruct: • Full length sequences • Sequence frequencies • Example applications: • Single individual haplotyping • Allele specific transcriptome reconstruction • Viral quasispecies reconstruction
TranscriptomeReconstruction Challenge: Alternative Splicing [Griffith and Marra 07]
1 2 3 4 5 6 7 1 2 3 4 5 6 7 t1 : 1 3 4 5 6 7 t2 : 1 2 3 4 5 7 t3 : 1 3 4 5 7 t4 :
TRIPTransciptomeReconstruction using Integer Programming • Map the RNA-Seq reads to genome • Construct Splice Graph - G(V,E) • V : exons • E: splicing events • Generate candidate transcripts • Depth-first-search (DFS) • Filter candidate transcripts • Fragment length distribution (FLD) • Integer programming Genome
How to filter? • Select the smallest set of putative transcripts that yields a good statistical fit between • empirically determined during library preparation • implied by “mapping” read pairs 500 1 2 3 200 200 200 Mean : 500; Std. dev. 50 300 1 3 Mean : 500; Std. dev. 50 200 200
Haplotype Spectra Reconstruction • Given short sequence fragments, reconstruct: • Full length sequences • Sequence frequencies • Example applications: • Single individual haplotyping • Allele specific transcriptome reconstruction • Viral quasispecies reconstruction
RNA Virus Replication High mutation rate (~10-4) Lauring & Andino, PLoS Pathogens 2011
Shotgun vs. Amplicon Reads • Shotgun reads starting positions distributed ~uniformly • Amplicon reads have predefined start/end positions covering fixed overlapping windows
Reconstruction from Shotgun Reads: ViSpA Read Error Correction Read Alignment Preprocessing of Aligned Reads Shotgun reads Frequency Estimation Read Graph Construction Contig Assembly Quasispecies sequences w/ frequencies
Reconstruction from Amplicon Reads: VirA Error-correctedSAM/BAM Read data Amplicon Read Graph Estimate Amplicons Reference in FASTAformat Viral population variants with frequencies Max-Bandwidth Paths Frequency Estimation
Amplicon Read Graph • K amplicons represented by K-layer read graph • Vertices ⇔ distinct reads • Edges ⇔ reads with consistent overlap • Vertices have count function c(v)
Read Graph Transformation • Heuristic to reduce edges in dense graphs • Replace bipartite cliques with star subgraphs
Challenges • Scalability • Exploit inherent sparsity of biological instances • E.g., exact scaffolding algorithm using non-serial dynamic programming based on SPQR trees • Flexibility • Long (noisy) reads + short • Heterogeneous data, e.g., RNA-Seq + TSSeq + PolyA-Seq • Quantifying reconstruction uncertainty • Compute intensive, e.g., bootstrapping + + - - + - + -
Acknowledgements Sahar Al Seesi Mazhar Kahn Rachel O’Neill Alexander Artyomenko Adrian Caciula Nicholas Mancuso SergheiMangul BassamTork Alex Zelikovsky Jorge Duitama Irina Astrovskaya PavelSkums