Sequencing by Hybridization

Sequencing by Hybridization Algorithmic Reconstruction By:Shuai Cheng, Li Presentation for CS482/682

Outline • Background and Problem Formulation • Classical Method for Sequencing by Hybridization • Standard Method for Sequencing by Hybridization with Gapped Probes

Sequence Reconstruction • Background • Sequence reconstruction can be done using gel electrophoresis. • Sequence of length several thousands can be constructed.

History of SBH • Sequencing by hybridization (SBH) is proposed by several research groups around 1988, 1999. • SBH is a potential method • Strezoska et al. (1991) reconstructed 100bp DNA sample • Morris and Huang (1999) reconstructed 125bp DNA sample • This is far from 1000 bp • Gapped probes • Preparata et al. 2000, A major breakthrough • Can construct up to 10,000 bp theoretically

Model for SBH • First Step, biochemical • A chip named microarray will detect all the k-mers (ideally) in the a given DNA sample • This step is referred as hybridization • The set of k-mers is referred as spectrum • Each k-mer is referred as a probe • Second Step, combinatorial • Algorithmic reconstruction of the original sequence from the set of k-mers

SBH, example DNA Sample

SBH, example DNA Sample hybridization Spectrum for k=3

SBH, example DNA Sample hybridization Spectrum for k=3 Problem: Reconstruct the sequence from the spectrum

SBH, example • Two sample may result in the same spectrum • The reconstruct process may need more information to construct a unique sequence

Outline • Background and Problem Formulation • Classical Method for Sequencing by Hybridization • Standard Method for Sequencing by Hybridization with Gapped Probes

SBH and Eulerian Path • Example: • Spectrum= {ACG, ATC, CAT, CGC, GCA} • Model each (k-1)-mer as a node AC AT CA CG GC TC

SBH and Eulerian Path • Example: • Spectrum= {ACG, ATC, CAT, CGC, GCA} • Model each (k-1)-mer as a node from a probe • There is a directed edge <u, v> iff u is a prefix for a probe p (k-mer), and v is a suffix of p ACG AC AT CA CG GC TC

SBH and Eulerian Path • Example: • Spectrum= {ACG, ATC, CAT, CGC, GCA} • Model each (k-1)-mer as a node • There is a directed edge <u, v> iff u is a prefix for a probe (k-mer), and v is a suffix for a k-mer ACG AC AT CA CG GC TC ATC

SBH and Eulerian Path • Example: • Spectrum= {ACG, ATC, CAT, CGC, GCA} • Model each (k-1)-mer as a node • There is a directed edge <u, v> iff u is a prefix for a probe (k-mer), and v is a suffix for a k-mer • A directed graph G is formed, |V| and |E| are bounded by O(n), where n is the size of the spectrum AC AC AT AT CA CA CG CG GC GC TC TC

SBH and Eulerian Path • An Eulerian Path is a path which will travel each edge of the graph once • ACCG  GC  CA  AT  TC • Sequence ACGCATC will be identified • The path can be found in O(n) if there is one • Multiple paths are possible AC AC AT AT CA CA CG CG GC GC TC TC

SBH and Euler Path • Algorithm Based on Eulerian Path • Given the input spectrum S, creating a graph G with • Each vertex representing a (k-1)-prefix or (k-1)-suffix of any length-k probe in S • For each length-k probe, creating an edge connecting the vertices representing the (k-1)-prefix and (k-1)-suffix. • Find a Eulerian path of G, and reconstruct the sequence from the path

Uniqueness • ATGCGTGGCA ATGGCGTGCA Spectrum={ATG, TGC, GCG, CGT, GTG, TGG, GGC, GCA } CG GT CG GT GC AT TG CA GC AT TG CA GG GG

Uniqueness of the Reconstruction • String Rearrangement • Transpositions • attAG_CAatcaAG*CAacc • attAG*CAatcaAG_CAacc • Expected number of such case will be: nC4 (1/4)2(k-1)(3/4) • nC4 (1/4)2(k-1)(3/4)<1 will give us n < 20.25 2k • k=8 will results n <305, this is bad • This is useless even the assumption that the error free is true • Rotations • attACG_GCAacc • attACG_’GCAacc

Overcome the problems • Many new approaches are suggested • PSBH --- positional information are given (Broude et al. 1994) • Provide the a set of possible start position for each probe • NP-complete • Sequencing by hybridization in rounds or interactive sequencing (Margaritis and Skiena 1995) • Use more experiment to solve the ambiguity • Gapped probes (Preparata et al. 2000)

Gapped probes (universal bases) • Probe scheme • A binary string • Eg: 1111 • Which will give us k-mers • Eg: 110101 • A probe is obtained by position the pattern along the sequence and extracting the symbols sampled by 1s of the pattern

Gapped probes (universal bases) DNA Sample Probe scheme 110101 Hybridization Spectrum Problem: Reconstruct the sequence from the spectrum

Gapped Probing Scheme • (s,r)-probing scheme • probe pattern = 1s(0s-11)r • probe length = v = s(r+1) • Number of 1s is s+t • Eg: (2,2)-probing scheme • 110101 • length is 6 • # of ‘1’ is 4 • # of 1s is a dominating factor for the microarray size, (4 # of 1, generally)

Example Spectrum: Initial putative sequence: ACGCA

Example • A probe is a feasible extension of a putative sequence t if (v-1)-prefix matches the suffix of t. Spectrum: ACGCA Putative sequence AC*C*T ACGCAT New Putative Sequence

Example • A probe is a feasible extension of a putative sequence t if (v-1)-prefix matches the suffix of t. Spectrum: ACGCAT Putative sequence CG*A*C ACGCATC New Putative Sequence ---------------------------------------- CG*A*A ACGCATA New Putative Sequence

Example • A probe is a feasible extension of a putative sequence t if (v-1)-prefix matches the suffix of t. Spectrum: ACGCATC Putative Sequence GC*T*G ACGCATCG New Putative Sequence ------------------------------------- ACGCATA Putative Sequence GC*T*G ACGCATAG New Putative Sequence

Example • A probe is a feasible extension of a putative sequence t if (v-1)-prefix matches the suffix of t. Spectrum: ACGCATCG Putative Sequence CA*C*G ACGCATGGG New Putative Sequence ------------------------------------- ACGCATAG Putative Sequence No further extension New Putative Sequence

Reconstruction Algorithm (Gapped) • Symbol-by-symbol extension • AlgorithmGiven the current putative sequence, consider all 4 possible extensions. Let C be the set of feasible extensions. • |C| = 0: end of the construction • |C| = 1: extends the putative sequence • |C| > 1: the algorithm attempts the breadth-first extension of all paths. • The paths will be killed when they cannot be further extended. • Branching is extended up to a maximum depth H • H is some threshold • H is larger than rs+1

Failure of the algorithm • The reconstruction algorithm will fail if there are many fooling probes • Eg: Two extant paths are identical except in their initial symbols Correct path! Incorrect path due to fooling probes

The Gapped Approach • The running time of is O(n) with a high probability. • Optimal in the sense it achieves the information theory bound. • Information theory bound is O(4k), k is # of 1s in probe scheme • For (4,4)-probe, sequences of length > 10,000 can be reconstructed theoretically • Gapped micro-array “can be produced” • Not realistic since it assumes error free of the spectrum

Research on the realistic data simulation • Truncated Branch and Bound Algorithm • H.W. Leong, F.P. Preparata, W.K. Sung and H. Willy. On the control of hybridization noise in DNA Sequencing-by-Hybridization, WABI (2002). • Very slow • Tabu Search • Blazewicz J, Formanowicz P, Kasprzak M, Markiewicz WT, Swiercz A ,Tabu search algorithm for DNA sequencing by hybridization with isothermic libraries Comput Biol Chem. 2004 Feb;28(1):11-9. • Does not address the gapped case • Other Approaches • Takaho A. Endo, Probabilistic nucleotide assembling method for sequencing by hybridization, Bioinformatics 2004 20(14):2181-2188 • Does not address the gapped case • E. Halperin, S. Halperin, T. Hartman, R. Shamir. Handling Long Targets and Errors in Sequencing by Hybridization. Complexity and Cryptography Seminar, Weizmann Institute, 2002

Conclusion and Problems • Algorithm for SBH with k-mer probes • Error free assumption • If not error free, then it is NP hard • Algorithm for SBH with gapped probes • Error free assumption • If not error free, for a specified probe scheme, is it NP complete?

Reference • P.A. Pevzner. Computational Molecular Biology: An Algorithmic Approach. MIT Press 2000 • F.P. Preparata and E. Upfal. Sequencing-by-Hybridization at the information-theory bound: An optimal algorithm. International Conference on Computational Molecular Biology (2000). • F.P. Preparata. Sequencing by Hybridization Rebisited: The Analog-Spectrum Proposal. IEEE Transactions on Computational Biology and Bioinformatics. Vol. 1, NO 1, January-March 2004

Sequencing by Hybridization

Sequencing by Hybridization

Presentation Transcript

GeneChip Hybridization

hybridization

Hybridization

Hybridization

Hybridization

Hybridization

Hybridization

Hybridization Theory

Hybridization

Hybridization

Hybridization

Sequencing By Hybridization – A Simulation Study of Performance on Genomic Sequences

Hybridization

HYBRIDIZATION

Reconstruction of DNA sequencing by hybridization

genotyping by sequencing

Hybridization

Hybridization

Hybridization