Major Application: Finding Homologies

Major Application: Finding Homologies (C) Mark Gerstein, Yale University bioinfo.mbb.yale.edu/mbb452a

AutoSimS • Local two-sequences alignment is the basis of sequence analysis, and perhaps the most widely used tool in computational molecular biology [1] • The parameters of most popular local sequence alignment tools including BLAST and FASTA are set by: • Default –set to for the “average case,” which may not be appropriate for the sequences being examined • Custom –the manual settings may be difficult, which usually require fine tuning through several manual trials • AutoSimS (Automated Sequence Similarity Search) contains three modules: • A modified version of SIM/DDS (Similarity / DNA-DNA sequence) [2, 3]for finding similar regions • Adaptive simulated annealing (ASA) [4] for optimizing parameters for SIM/DDS • An AI decision-making system (not implemented) for guiding the adaptive simulated annealing 1

(SIM/DDS) Similarity / DNA-DNA Sequence • Integrates features from Smith-Waterman, BLAST, Fasta and Haste (Hash-Accelerated Search) [5] • Rated as one of fastest and least space consuming (linear space complexity) tools for universal sequence alignment [6] • Provides tradeoffs between sensitivity and speed using over a dozen of parameters • Our modified SIM/DDS introduces more cutoffs • Increases flexibility of control • Sequence filtering • Word masking • Reduces the impact of short and exact matches • Allows adjusting sensitivity for weak similarity 2

(ASA) Adaptive Simulated Annealing • Uses global and statistical optimization techniques that are able to handle complex, non-linear search spaces • Several improvements over the original simulated annealing technique • Computational complexity – exponential temperature schedule for annealing • Completeness – decreases the chance to miss optima • Generality – more options to better fit problems to be solved • Most attractive feature: individual considerations given to parameter range, annealing-time-dependent sensitivities, and the probability density distribution for each parameter • Provides up to 100 options • Facilitates incorporation into the AutoSimS model 3

AutoSimS Model User Preferences AI Decision-Making Module (not implemented) Sequence Data Data Selection Knowledge Base Modified SIM / DDS Parameters Parameter Search Set of possible parameters with exponential probability Parameter Evaluation Exponential Annealing Value of objective function ASA Preferred similarity regions 4

Summary of Model • ASA works as a “wrapper” program to select parameters for SIM/DDS • With properly specified search spaces, objective function and successor heuristics determined by the AI decision-making system, ASA is used to find the optimal parameter setting of modified SIM/DDS program. This leads to finding better similar regions • Even though the above mentioned information to be given manually to ASA, we find it easier to do so and let ASA tune the parameters for SIM/DDS than to manually tune SIM/DDS’s parameters • Adding the AI decision-making module will make AutoSimS nearly autonomous by automatically providing most of the information ASA needs 5

Results • AHSC (Average of High-Scoring Chain Scores) may be used as an ASA objective function to find parameters yielding highly similar regions • We find close-to-optimal parameter settings are difficult to find manually, and that there are many different parameter settings that yield close-to-optimal search results • An automatic search for parameters may be effective • Adaptive simulated annealing may be a preferred search technique Three runs of our modified SIM/DDS program using parameters selected by adaptive simulated annealing for a 100 and 200 letter pair of DNA sequences yield similar results, but with different parameter settings. ASA settings: Annealing schedule: T = 20 * exp(-0.005*t) if t < 100 and 0 otherwise Acceptance function: exp( E / T ) 6

Future Work Implement the AI decision-making system, including the decision analysis and knowledge base system Experiment on a large number of different types of molecular biological sequences to determine the proper annealing temperature schedules and successor heuristics and/or their parameters Parallelize AutoSimS Incorporate core ideas of more efficient very large-scale sequence comparison techniques, such as LSH (Locality-Sensitive Hashing) [7] Generate statistical estimates for the local alignment score distributions [1], which will be used in AutoSimS’s decision-making system Explore different ASA objective functions, which may improve results 7

Conclusion ASA’s ability to fit complex functions, i.e. nonlinear search spaces and multiple variables, allows it to find a suitable set of parameters for SIM/DDS The incorporation of AI decision-making system to our ASA-SIM/DDS program should enhance our ability to achieve almost autonomous two-sequence similarity analysis with high volume throughput and acceptable performance Our use of simulated annealing to find a suitable set of parameter can be adapted to other bioinformatics analysis programs, such as alignment and clustering 8

References [1] Altschul, S. F., Bundschuh, R., Olsen, R. and Hwa, T., The Estimation of Statistical Parameters for Local Alignment Score Distributions. Nucleic Acids Research, Vol. 29, No. 2, 351–361, 2001 [2] Jiang, T., Xu, Y. and Zhang, M.Q., Current Topics in Computational Molecular Biology. MIT Press, 2002 [3] Huang, X. and Miller, W., A Time-Efficient, Linear-Space Local Similarity Algorithm. Advances in Applied Mathematics 12, 337–357, 1991 [4] Ingber, L., Simulated Annealing: Practice versus Theory. Mathl. Comput. Modelling, Vol.18, No.11, 29–57, 1993 [5] Borkowski, J.A., Smith, C.P. and Huang, X., PFP—A Flexible Integrated Filtering and Masking Tool, Paracel Inc., Pasadena, CA [6] Tech Topics, Michigan Technological University, Nov. 3, 1995, Vol. XXVIII, No.9 [7] Buhler, J., Efficient Large-Scale Sequence Comparison by Locality-Sensitive Hashing. Bioinformatics 17(5) 419–428, 2001 9

Major Application: Finding Homologies