570 likes | 1.53k Views
Sequencing by Hybridization. Algorithmic Reconstruction By:Shuai Cheng, Li Presentation for CS482/682. Outline. Background and Problem Formulation Classical Method for Sequencing by Hybridization Standard Method for Sequencing by Hybridization with Gapped Probes. Sequence Reconstruction.
E N D
Sequencing by Hybridization Algorithmic Reconstruction By:Shuai Cheng, Li Presentation for CS482/682
Outline • Background and Problem Formulation • Classical Method for Sequencing by Hybridization • Standard Method for Sequencing by Hybridization with Gapped Probes
Sequence Reconstruction • Background • Sequence reconstruction can be done using gel electrophoresis. • Sequence of length several thousands can be constructed.
History of SBH • Sequencing by hybridization (SBH) is proposed by several research groups around 1988, 1999. • SBH is a potential method • Strezoska et al. (1991) reconstructed 100bp DNA sample • Morris and Huang (1999) reconstructed 125bp DNA sample • This is far from 1000 bp • Gapped probes • Preparata et al. 2000, A major breakthrough • Can construct up to 10,000 bp theoretically
Model for SBH • First Step, biochemical • A chip named microarray will detect all the k-mers (ideally) in the a given DNA sample • This step is referred as hybridization • The set of k-mers is referred as spectrum • Each k-mer is referred as a probe • Second Step, combinatorial • Algorithmic reconstruction of the original sequence from the set of k-mers
SBH, example DNA Sample
SBH, example DNA Sample hybridization Spectrum for k=3
SBH, example DNA Sample hybridization Spectrum for k=3 Problem: Reconstruct the sequence from the spectrum
SBH, example • Two sample may result in the same spectrum • The reconstruct process may need more information to construct a unique sequence
SBH, example • Two sample may result in the same spectrum • The reconstruct process may need more information to construct a unique sequence
Outline • Background and Problem Formulation • Classical Method for Sequencing by Hybridization • Standard Method for Sequencing by Hybridization with Gapped Probes
SBH and Eulerian Path • Example: • Spectrum= {ACG, ATC, CAT, CGC, GCA} • Model each (k-1)-mer as a node AC AT CA CG GC TC
SBH and Eulerian Path • Example: • Spectrum= {ACG, ATC, CAT, CGC, GCA} • Model each (k-1)-mer as a node from a probe • There is a directed edge <u, v> iff u is a prefix for a probe p (k-mer), and v is a suffix of p ACG AC AT CA CG GC TC
SBH and Eulerian Path • Example: • Spectrum= {ACG, ATC, CAT, CGC, GCA} • Model each (k-1)-mer as a node • There is a directed edge <u, v> iff u is a prefix for a probe (k-mer), and v is a suffix for a k-mer ACG AC AT CA CG GC TC ATC
SBH and Eulerian Path • Example: • Spectrum= {ACG, ATC, CAT, CGC, GCA} • Model each (k-1)-mer as a node • There is a directed edge <u, v> iff u is a prefix for a probe (k-mer), and v is a suffix for a k-mer • A directed graph G is formed, |V| and |E| are bounded by O(n), where n is the size of the spectrum AC AC AT AT CA CA CG CG GC GC TC TC
SBH and Eulerian Path • An Eulerian Path is a path which will travel each edge of the graph once • ACCG GC CA AT TC • Sequence ACGCATC will be identified • The path can be found in O(n) if there is one • Multiple paths are possible AC AC AT AT CA CA CG CG GC GC TC TC
SBH and Euler Path • Algorithm Based on Eulerian Path • Given the input spectrum S, creating a graph G with • Each vertex representing a (k-1)-prefix or (k-1)-suffix of any length-k probe in S • For each length-k probe, creating an edge connecting the vertices representing the (k-1)-prefix and (k-1)-suffix. • Find a Eulerian path of G, and reconstruct the sequence from the path
Uniqueness • ATGCGTGGCA ATGGCGTGCA Spectrum={ATG, TGC, GCG, CGT, GTG, TGG, GGC, GCA } CG GT CG GT GC AT TG CA GC AT TG CA GG GG
Uniqueness of the Reconstruction • String Rearrangement • Transpositions • attAG_CAatcaAG*CAacc • attAG*CAatcaAG_CAacc • Expected number of such case will be: nC4 (1/4)2(k-1)(3/4) • nC4 (1/4)2(k-1)(3/4)<1 will give us n < 20.25 2k • k=8 will results n <305, this is bad • This is useless even the assumption that the error free is true • Rotations • attACG_GCAacc • attACG_’GCAacc
Overcome the problems • Many new approaches are suggested • PSBH --- positional information are given (Broude et al. 1994) • Provide the a set of possible start position for each probe • NP-complete • Sequencing by hybridization in rounds or interactive sequencing (Margaritis and Skiena 1995) • Use more experiment to solve the ambiguity • Gapped probes (Preparata et al. 2000)
Gapped probes (universal bases) • Probe scheme • A binary string • Eg: 1111 • Which will give us k-mers • Eg: 110101 • A probe is obtained by position the pattern along the sequence and extracting the symbols sampled by 1s of the pattern
Gapped probes (universal bases) DNA Sample Probe scheme 110101 Hybridization Spectrum Problem: Reconstruct the sequence from the spectrum
Gapped Probing Scheme • (s,r)-probing scheme • probe pattern = 1s(0s-11)r • probe length = v = s(r+1) • Number of 1s is s+t • Eg: (2,2)-probing scheme • 110101 • length is 6 • # of ‘1’ is 4 • # of 1s is a dominating factor for the microarray size, (4 # of 1, generally)
Example Spectrum: Initial putative sequence: ACGCA
Example • A probe is a feasible extension of a putative sequence t if (v-1)-prefix matches the suffix of t. Spectrum: ACGCA Putative sequence AC*C*T ACGCAT New Putative Sequence
Example • A probe is a feasible extension of a putative sequence t if (v-1)-prefix matches the suffix of t. Spectrum: ACGCAT Putative sequence CG*A*C ACGCATC New Putative Sequence ---------------------------------------- CG*A*A ACGCATA New Putative Sequence
Example • A probe is a feasible extension of a putative sequence t if (v-1)-prefix matches the suffix of t. Spectrum: ACGCATC Putative Sequence GC*T*G ACGCATCG New Putative Sequence ------------------------------------- ACGCATA Putative Sequence GC*T*G ACGCATAG New Putative Sequence
Example • A probe is a feasible extension of a putative sequence t if (v-1)-prefix matches the suffix of t. Spectrum: ACGCATCG Putative Sequence CA*C*G ACGCATGGG New Putative Sequence ------------------------------------- ACGCATAG Putative Sequence No further extension New Putative Sequence
Reconstruction Algorithm (Gapped) • Symbol-by-symbol extension • AlgorithmGiven the current putative sequence, consider all 4 possible extensions. Let C be the set of feasible extensions. • |C| = 0: end of the construction • |C| = 1: extends the putative sequence • |C| > 1: the algorithm attempts the breadth-first extension of all paths. • The paths will be killed when they cannot be further extended. • Branching is extended up to a maximum depth H • H is some threshold • H is larger than rs+1
Failure of the algorithm • The reconstruction algorithm will fail if there are many fooling probes • Eg: Two extant paths are identical except in their initial symbols Correct path! Incorrect path due to fooling probes
The Gapped Approach • The running time of is O(n) with a high probability. • Optimal in the sense it achieves the information theory bound. • Information theory bound is O(4k), k is # of 1s in probe scheme • For (4,4)-probe, sequences of length > 10,000 can be reconstructed theoretically • Gapped micro-array “can be produced” • Not realistic since it assumes error free of the spectrum
Research on the realistic data simulation • Truncated Branch and Bound Algorithm • H.W. Leong, F.P. Preparata, W.K. Sung and H. Willy. On the control of hybridization noise in DNA Sequencing-by-Hybridization, WABI (2002). • Very slow • Tabu Search • Blazewicz J, Formanowicz P, Kasprzak M, Markiewicz WT, Swiercz A ,Tabu search algorithm for DNA sequencing by hybridization with isothermic libraries Comput Biol Chem. 2004 Feb;28(1):11-9. • Does not address the gapped case • Other Approaches • Takaho A. Endo, Probabilistic nucleotide assembling method for sequencing by hybridization, Bioinformatics 2004 20(14):2181-2188 • Does not address the gapped case • E. Halperin, S. Halperin, T. Hartman, R. Shamir. Handling Long Targets and Errors in Sequencing by Hybridization. Complexity and Cryptography Seminar, Weizmann Institute, 2002
Conclusion and Problems • Algorithm for SBH with k-mer probes • Error free assumption • If not error free, then it is NP hard • Algorithm for SBH with gapped probes • Error free assumption • If not error free, for a specified probe scheme, is it NP complete?
Reference • P.A. Pevzner. Computational Molecular Biology: An Algorithmic Approach. MIT Press 2000 • F.P. Preparata and E. Upfal. Sequencing-by-Hybridization at the information-theory bound: An optimal algorithm. International Conference on Computational Molecular Biology (2000). • F.P. Preparata. Sequencing by Hybridization Rebisited: The Analog-Spectrum Proposal. IEEE Transactions on Computational Biology and Bioinformatics. Vol. 1, NO 1, January-March 2004