1 / 11

Major Application: Finding Homologies

Major Application: Finding Homologies. (C) Mark Gerstein, Yale University bioinfo.mbb.yale.edu/mbb452a. AutoSimS. Local two-sequences alignment is the basis of sequence analysis, and perhaps the most widely used tool in computational molecular biology [1]

iliana
Download Presentation

Major Application: Finding Homologies

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Major Application: Finding Homologies (C) Mark Gerstein, Yale University bioinfo.mbb.yale.edu/mbb452a

  2. AutoSimS • Local two-sequences alignment is the basis of sequence analysis, and perhaps the most widely used tool in computational molecular biology [1] • The parameters of most popular local sequence alignment tools including BLAST and FASTA are set by: • Default –set to for the “average case,” which may not be appropriate for the sequences being examined • Custom –the manual settings may be difficult, which usually require fine tuning through several manual trials • AutoSimS (Automated Sequence Similarity Search) contains three modules: • A modified version of SIM/DDS (Similarity / DNA-DNA sequence) [2, 3]for finding similar regions • Adaptive simulated annealing (ASA) [4] for optimizing parameters for SIM/DDS • An AI decision-making system (not implemented) for guiding the adaptive simulated annealing 1

  3. (SIM/DDS) Similarity / DNA-DNA Sequence • Integrates features from Smith-Waterman, BLAST, Fasta and Haste (Hash-Accelerated Search) [5] • Rated as one of fastest and least space consuming (linear space complexity) tools for universal sequence alignment [6] • Provides tradeoffs between sensitivity and speed using over a dozen of parameters • Our modified SIM/DDS introduces more cutoffs • Increases flexibility of control • Sequence filtering • Word masking • Reduces the impact of short and exact matches • Allows adjusting sensitivity for weak similarity 2

  4. (ASA) Adaptive Simulated Annealing • Uses global and statistical optimization techniques that are able to handle complex, non-linear search spaces • Several improvements over the original simulated annealing technique • Computational complexity – exponential temperature schedule for annealing • Completeness – decreases the chance to miss optima • Generality – more options to better fit problems to be solved • Most attractive feature: individual considerations given to parameter range, annealing-time-dependent sensitivities, and the probability density distribution for each parameter • Provides up to 100 options • Facilitates incorporation into the AutoSimS model 3

  5. AutoSimS Model User Preferences AI Decision-Making Module (not implemented) Sequence Data Data Selection Knowledge Base Modified SIM / DDS Parameters Parameter Search Set of possible parameters with exponential probability Parameter Evaluation Exponential Annealing Value of objective function ASA Preferred similarity regions 4

  6. Summary of Model • ASA works as a “wrapper” program to select parameters for SIM/DDS • With properly specified search spaces, objective function and successor heuristics determined by the AI decision-making system, ASA is used to find the optimal parameter setting of modified SIM/DDS program. This leads to finding better similar regions • Even though the above mentioned information to be given manually to ASA, we find it easier to do so and let ASA tune the parameters for SIM/DDS than to manually tune SIM/DDS’s parameters • Adding the AI decision-making module will make AutoSimS nearly autonomous by automatically providing most of the information ASA needs 5

  7. Results • AHSC (Average of High-Scoring Chain Scores) may be used as an ASA objective function to find parameters yielding highly similar regions • We find close-to-optimal parameter settings are difficult to find manually, and that there are many different parameter settings that yield close-to-optimal search results • An automatic search for parameters may be effective • Adaptive simulated annealing may be a preferred search technique Three runs of our modified SIM/DDS program using parameters selected by adaptive simulated annealing for a 100 and 200 letter pair of DNA sequences yield similar results, but with different parameter settings. ASA settings: Annealing schedule: T = 20 * exp(-0.005*t) if t < 100 and 0 otherwise Acceptance function: exp( E / T ) 6

  8. Future Work Implement the AI decision-making system, including the decision analysis and knowledge base system Experiment on a large number of different types of molecular biological sequences to determine the proper annealing temperature schedules and successor heuristics and/or their parameters Parallelize AutoSimS Incorporate core ideas of more efficient very large-scale sequence comparison techniques, such as LSH (Locality-Sensitive Hashing) [7] Generate statistical estimates for the local alignment score distributions [1], which will be used in AutoSimS’s decision-making system Explore different ASA objective functions, which may improve results 7

  9. Conclusion ASA’s ability to fit complex functions, i.e. nonlinear search spaces and multiple variables, allows it to find a suitable set of parameters for SIM/DDS The incorporation of AI decision-making system to our ASA-SIM/DDS program should enhance our ability to achieve almost autonomous two-sequence similarity analysis with high volume throughput and acceptable performance Our use of simulated annealing to find a suitable set of parameter can be adapted to other bioinformatics analysis programs, such as alignment and clustering 8

  10. References [1] Altschul, S. F., Bundschuh, R., Olsen, R. and Hwa, T., The Estimation of Statistical Parameters for Local Alignment Score Distributions. Nucleic Acids Research, Vol. 29, No. 2, 351–361, 2001 [2] Jiang, T., Xu, Y. and Zhang, M.Q., Current Topics in Computational Molecular Biology. MIT Press, 2002 [3] Huang, X. and Miller, W., A Time-Efficient, Linear-Space Local Similarity Algorithm. Advances in Applied Mathematics 12, 337–357, 1991 [4] Ingber, L., Simulated Annealing: Practice versus Theory. Mathl. Comput. Modelling, Vol.18, No.11, 29–57, 1993 [5] Borkowski, J.A., Smith, C.P. and Huang, X., PFP—A Flexible Integrated Filtering and Masking Tool, Paracel Inc., Pasadena, CA [6] Tech Topics, Michigan Technological University, Nov. 3, 1995, Vol. XXVIII, No.9 [7] Buhler, J., Efficient Large-Scale Sequence Comparison by Locality-Sensitive Hashing. Bioinformatics 17(5) 419–428, 2001 9

More Related