1 / 34

Sequencing by Hybridization

Sequencing by Hybridization. Algorithmic Reconstruction By:Shuai Cheng, Li Presentation for CS482/682. Outline. Background and Problem Formulation Classical Method for Sequencing by Hybridization Standard Method for Sequencing by Hybridization with Gapped Probes. Sequence Reconstruction.

lis
Download Presentation

Sequencing by Hybridization

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Sequencing by Hybridization Algorithmic Reconstruction By:Shuai Cheng, Li Presentation for CS482/682

  2. Outline • Background and Problem Formulation • Classical Method for Sequencing by Hybridization • Standard Method for Sequencing by Hybridization with Gapped Probes

  3. Sequence Reconstruction • Background • Sequence reconstruction can be done using gel electrophoresis. • Sequence of length several thousands can be constructed.

  4. History of SBH • Sequencing by hybridization (SBH) is proposed by several research groups around 1988, 1999. • SBH is a potential method • Strezoska et al. (1991) reconstructed 100bp DNA sample • Morris and Huang (1999) reconstructed 125bp DNA sample • This is far from 1000 bp • Gapped probes • Preparata et al. 2000, A major breakthrough • Can construct up to 10,000 bp theoretically

  5. Model for SBH • First Step, biochemical • A chip named microarray will detect all the k-mers (ideally) in the a given DNA sample • This step is referred as hybridization • The set of k-mers is referred as spectrum • Each k-mer is referred as a probe • Second Step, combinatorial • Algorithmic reconstruction of the original sequence from the set of k-mers

  6. SBH, example DNA Sample

  7. SBH, example DNA Sample hybridization Spectrum for k=3

  8. SBH, example DNA Sample hybridization Spectrum for k=3 Problem: Reconstruct the sequence from the spectrum

  9. SBH, example • Two sample may result in the same spectrum • The reconstruct process may need more information to construct a unique sequence

  10. SBH, example • Two sample may result in the same spectrum • The reconstruct process may need more information to construct a unique sequence

  11. Outline • Background and Problem Formulation • Classical Method for Sequencing by Hybridization • Standard Method for Sequencing by Hybridization with Gapped Probes

  12. SBH and Eulerian Path • Example: • Spectrum= {ACG, ATC, CAT, CGC, GCA} • Model each (k-1)-mer as a node AC AT CA CG GC TC

  13. SBH and Eulerian Path • Example: • Spectrum= {ACG, ATC, CAT, CGC, GCA} • Model each (k-1)-mer as a node from a probe • There is a directed edge <u, v> iff u is a prefix for a probe p (k-mer), and v is a suffix of p ACG AC AT CA CG GC TC

  14. SBH and Eulerian Path • Example: • Spectrum= {ACG, ATC, CAT, CGC, GCA} • Model each (k-1)-mer as a node • There is a directed edge <u, v> iff u is a prefix for a probe (k-mer), and v is a suffix for a k-mer ACG AC AT CA CG GC TC ATC

  15. SBH and Eulerian Path • Example: • Spectrum= {ACG, ATC, CAT, CGC, GCA} • Model each (k-1)-mer as a node • There is a directed edge <u, v> iff u is a prefix for a probe (k-mer), and v is a suffix for a k-mer • A directed graph G is formed, |V| and |E| are bounded by O(n), where n is the size of the spectrum AC AC AT AT CA CA CG CG GC GC TC TC

  16. SBH and Eulerian Path • An Eulerian Path is a path which will travel each edge of the graph once • ACCG  GC  CA  AT  TC • Sequence ACGCATC will be identified • The path can be found in O(n) if there is one • Multiple paths are possible AC AC AT AT CA CA CG CG GC GC TC TC

  17. SBH and Euler Path • Algorithm Based on Eulerian Path • Given the input spectrum S, creating a graph G with • Each vertex representing a (k-1)-prefix or (k-1)-suffix of any length-k probe in S • For each length-k probe, creating an edge connecting the vertices representing the (k-1)-prefix and (k-1)-suffix. • Find a Eulerian path of G, and reconstruct the sequence from the path

  18. Uniqueness • ATGCGTGGCA ATGGCGTGCA Spectrum={ATG, TGC, GCG, CGT, GTG, TGG, GGC, GCA } CG GT CG GT GC AT TG CA GC AT TG CA GG GG

  19. Uniqueness of the Reconstruction • String Rearrangement • Transpositions • attAG_CAatcaAG*CAacc • attAG*CAatcaAG_CAacc • Expected number of such case will be: nC4 (1/4)2(k-1)(3/4) • nC4 (1/4)2(k-1)(3/4)<1 will give us n < 20.25 2k • k=8 will results n <305, this is bad • This is useless even the assumption that the error free is true • Rotations • attACG_GCAacc • attACG_’GCAacc

  20. Overcome the problems • Many new approaches are suggested • PSBH --- positional information are given (Broude et al. 1994) • Provide the a set of possible start position for each probe • NP-complete • Sequencing by hybridization in rounds or interactive sequencing (Margaritis and Skiena 1995) • Use more experiment to solve the ambiguity • Gapped probes (Preparata et al. 2000)

  21. Gapped probes (universal bases) • Probe scheme • A binary string • Eg: 1111 • Which will give us k-mers • Eg: 110101 • A probe is obtained by position the pattern along the sequence and extracting the symbols sampled by 1s of the pattern

  22. Gapped probes (universal bases) DNA Sample Probe scheme 110101 Hybridization Spectrum Problem: Reconstruct the sequence from the spectrum

  23. Gapped Probing Scheme • (s,r)-probing scheme • probe pattern = 1s(0s-11)r • probe length = v = s(r+1) • Number of 1s is s+t • Eg: (2,2)-probing scheme • 110101 • length is 6 • # of ‘1’ is 4 • # of 1s is a dominating factor for the microarray size, (4 # of 1, generally)

  24. Example Spectrum: Initial putative sequence: ACGCA

  25. Example • A probe is a feasible extension of a putative sequence t if (v-1)-prefix matches the suffix of t. Spectrum: ACGCA Putative sequence AC*C*T ACGCAT New Putative Sequence

  26. Example • A probe is a feasible extension of a putative sequence t if (v-1)-prefix matches the suffix of t. Spectrum: ACGCAT Putative sequence CG*A*C ACGCATC New Putative Sequence ---------------------------------------- CG*A*A ACGCATA New Putative Sequence

  27. Example • A probe is a feasible extension of a putative sequence t if (v-1)-prefix matches the suffix of t. Spectrum: ACGCATC Putative Sequence GC*T*G ACGCATCG New Putative Sequence ------------------------------------- ACGCATA Putative Sequence GC*T*G ACGCATAG New Putative Sequence

  28. Example • A probe is a feasible extension of a putative sequence t if (v-1)-prefix matches the suffix of t. Spectrum: ACGCATCG Putative Sequence CA*C*G ACGCATGGG New Putative Sequence ------------------------------------- ACGCATAG Putative Sequence No further extension New Putative Sequence

  29. Reconstruction Algorithm (Gapped) • Symbol-by-symbol extension • AlgorithmGiven the current putative sequence, consider all 4 possible extensions. Let C be the set of feasible extensions. • |C| = 0: end of the construction • |C| = 1: extends the putative sequence • |C| > 1: the algorithm attempts the breadth-first extension of all paths. • The paths will be killed when they cannot be further extended. • Branching is extended up to a maximum depth H • H is some threshold • H is larger than rs+1

  30. Failure of the algorithm • The reconstruction algorithm will fail if there are many fooling probes • Eg: Two extant paths are identical except in their initial symbols Correct path! Incorrect path due to fooling probes

  31. The Gapped Approach • The running time of is O(n) with a high probability. • Optimal in the sense it achieves the information theory bound. • Information theory bound is O(4k), k is # of 1s in probe scheme • For (4,4)-probe, sequences of length > 10,000 can be reconstructed theoretically • Gapped micro-array “can be produced” • Not realistic since it assumes error free of the spectrum

  32. Research on the realistic data simulation • Truncated Branch and Bound Algorithm • H.W. Leong, F.P. Preparata, W.K. Sung and H. Willy. On the control of hybridization noise in DNA Sequencing-by-Hybridization, WABI (2002). • Very slow • Tabu Search • Blazewicz J, Formanowicz P, Kasprzak M, Markiewicz WT, Swiercz A ,Tabu search algorithm for DNA sequencing by hybridization with isothermic libraries Comput Biol Chem. 2004 Feb;28(1):11-9. • Does not address the gapped case • Other Approaches • Takaho A. Endo, Probabilistic nucleotide assembling method for sequencing by hybridization, Bioinformatics 2004 20(14):2181-2188 • Does not address the gapped case • E. Halperin, S. Halperin, T. Hartman, R. Shamir. Handling Long Targets and Errors in Sequencing by Hybridization. Complexity and Cryptography Seminar, Weizmann Institute, 2002

  33. Conclusion and Problems • Algorithm for SBH with k-mer probes • Error free assumption • If not error free, then it is NP hard • Algorithm for SBH with gapped probes • Error free assumption • If not error free, for a specified probe scheme, is it NP complete?

  34. Reference • P.A. Pevzner. Computational Molecular Biology: An Algorithmic Approach. MIT Press 2000 • F.P. Preparata and E. Upfal. Sequencing-by-Hybridization at the information-theory bound: An optimal algorithm. International Conference on Computational Molecular Biology (2000). • F.P. Preparata. Sequencing by Hybridization Rebisited: The Analog-Spectrum Proposal. IEEE Transactions on Computational Biology and Bioinformatics. Vol. 1, NO 1, January-March 2004

More Related