200 likes | 288 Views
Reconstruction of DNA sequencing by hybridization. Ji-Hong Zhang, Ling-Yun Wu and Xiang-Sun Zhang ZHANGroup@aporc.org Institute of Applied Mathematics, AMSS, CAS. Bioinformatics. Human Genome Project Large molecule data in biology, such as DNA and protein
E N D
Reconstruction of DNA sequencing by hybridization Ji-Hong Zhang, Ling-Yun Wu and Xiang-Sun Zhang ZHANGroup@aporc.org Institute of Applied Mathematics, AMSS, CAS
Bioinformatics • Human Genome Project • Large molecule data in biology, such as DNA and protein • Knowledge of mathematics, computer science, information science, physics, system science, management science as well as biology • Genomics • DNA sequencing • Gene prediction • Sequence alignment
DNA Sequencing …ACGTGACTGAGGACCGTG CGACTGAGACTGACTGGGT CTAGCTAGACTACGTTTTA TATATATATACGTCGTCGT ACTGATGACTAGATTACAG ACTGATTTAGATACCTGAC TGATTTTAAAAAAATATT…
DNA Sequencing (shotgun) target DNA cut many times at random forward-reverse linked reads known dist ~500 bp ~500 bp
DNA Sequencing (SBH) • DNA array (DNA chip) with 43 probes • Target DNA: AAATGCG
Sequencing by Hybridization • Hybridize target to array containing a spot for each possible k-tuple (k-mer) • The spectrum of a sequence • multi-set of all its k-long substrings (k-tuples) • Goal • reconstruct the sequence from its spectrum • Pevzner (1989): reconstruction is polynomial • But …
Uniqueness of Reconstruction • Different sequences can have the same spectrum: • ACT, CTA, TAC • ACTAC • TACTA • Non-uniqueness Probability
Experiment Errors • Hybridization experiments are error prone • False negative error • k-tuple appears in target DNA but does not appear in its measured spectrum • Repetition of k-tuple • False positive error • k-tuple does not appear in target DNA but does appear in its measured spectrum
Sequencing by Hybridization Target DNA ……TTTTACGC…… ß Spectrum Errors: Positive(misread) / Negative(missing, repetition) TTT TTT TTA TAC ACG CGC Ideal case TTT TTT TTA TAC ACG CGC TGA With errors
SBH Reconstruction Problem • In the case of error-free SBH experiments • A desired solution of SBH is just a feasible solution including all k-tuple in the specturm • For the general case • There is no additional information except spectrum and the length of target DNA • A feasible solution composed of a maximum cardinality subset of the spectrum shall be a reasonable desired solution
SBH Reconstruction Problem • Ideal case (without repetitions and errors) • Equivalent to finding an Eulerian path in a corresponding graph (Pevzner, 1989) • A linear time algorithm (Fleischner, 1990) • General case is NP-hard problem • Branch and bound • Heuristics • Extensions • PSBH (Positional SBH) • SBH with length error
Motivations • Give some criteria which can determine the most possible k-tuples at both ends and in the middle of all possible reconstructions of the target DNA • These criterions greatly reduce ambiguities in the reconstruction of DNA • Transform the negative errors into the positive errors • These means enables us to handle both types of errors easily • Separate therepetitions from both type of errors
Methods • Estimate the number of k-tuples that does not occur in a solution • Adjacency matrix (connection matrix) • Give a lower bound of k-tuples that does not occur in all solutions from k-tuple i to j
Methods • Determine the most possible k-tuples at both ends • Reconstruct from the most possible end pairs to get an upper bound of SBH problem • Purge the end pairs that can not have better solution than current upper bound
Methods • Transform the negative errors into the positive errors • Artificial k-tuple • Fill in all the possible gaps due to false negative error • Negative error level • The maximal number of allowed consecutively missing k-tuples • Reduce the number of artificial k-tuples
Computational Experiments • 109 DNA sequence from GenBank • Simulate the SBH experiments • Error models • Randomly (probabilistic model) • Systematically (one base mismatched model)
Conclusions • Ideal case (without repetitions and errors) can be solved in polynomial time (Pevzner, 1989) • General case is NP-hard problem • Design efficient algorithms • Ji-Hong Zhang, Ling-Yun Wu and Xiang-Sun Zhang. A new approach to the reconstruction of DNA sequencing by hybridization. Bioinformatics, vol 19(1), pages 14-21, 2003. • Xiang-Sun Zhang, Ji-Hong Zhang and Ling-Yun Wu. Combinatorial optimization problems in the positional DNA sequencing by hybridization and its algorithms. System Sciences and Mathematics, vol 3, 2002. (in Chinese) • Ling-Yun Wu, Ji-Hong Zhang and Xiang-Sun Zhang. Application of neural networks in the reconstruction of DNA sequencing by hybridization. In Proceedings of the 4th ISORA, 2002.