120 likes | 261 Views
Optimatization of a New Score Function for the Detection of Remote Homologs. Kann et al. Introduction. New method to calculate a score function, aiming to optimize the ability to discriminate between homologs and non-homologs Existing software uses the following to compute an alignment score:.
E N D
Optimatization of a New Score Function for the Detection of Remote Homologs Kann et al
Introduction • New method to calculate a score function, aiming to optimize the ability to discriminate between homologs and non-homologs • Existing software uses the following to compute an alignment score:
Number of times AA i is aligned with AA j Number of gaps in alignment Number of residues in each gap beyond one Score function / Substitution matrix Contribution to score for AA match/mismatch Contribution to score for gap initialization Contribution to score for gap extension
Current Methods to Calculate Homology • p(Sr > x): probability that a random pair of proteins of the same length would have that score • E: expected number of random proteins in the db that would have at least that score • P: probability that there is at least one random pair with a higher score • As p(Sr > x), E, P increase, the likelihood that the given pair is homologous decreases
Current Score Matrices • PAM (percent accepted mutations) – Dayhoff • GCB, JTT: used to apply to larger sequence datasets • BLOSUM62 – Henikoff & Henikoff, constructed using a dataset of aligned sequence blocks • STR – protein sequences aligned based on their observed structures
Limitations of Current Score Functions • Current score functions assume independent evolution of each location, overlooking correlations • Score functions derived from a db of properly aligned proteins, not on alignments between random sequences • Gap penalty a priori
Theory Z score for alignment: • Characterize the significance of alignment score by calculating the likelihood that this score or higher would be obtained by a random match • Account for variations in E with the length of the proteins
Theory • Score function optimized by maximizing the confidence <C> over the training set • Avoids dependence on extreme E values (easily detected or overly distant homologies) • Eliminates contribution of falsely identified homologies (overly distant)
Database Preparation • Use set of known homologs whose homology cannot be reliably determined with standard pairwise comparison, in order to optimize score function for detection of distant homologs • Training set: 900 pairs of protein in same COG with < 25% sequence identity
Optimization of Score Function • Align using BLOSOM62 matrix • Calculate Z and C for each pair of homologs, then averaged over pairs in training set to yield <C> • Generate initial alignments using gap penalties that yielded highest C values • ~10 cycles of optimization and realignments until score function converged
Results • Small changes in gap penalties: most of the improvement cones from refinements of • OPTIMA: resulting score function • has significantly improved average confidence <C> value compared with other score matrices • <p(Sr > x)>, <P> significantly decreased
Summary • Aim: optimize score matrix to discriminate between homologs and non-homologs • OPTIMA score function: more successful at discriminating between homologs and non-homologs compared with standard score matrices • Gap penalties treated as additional parameters to be optimized