280 likes | 361 Views
Design of score functions for recognition of protein folds. T. Galor Joint work with Ron Elber Thorsten Joachims. Structures are vital to understand protein function. Researches are interested to know: what are the active site residues for enzymatic reaction?
E N D
Design of score functions for recognition of protein folds T. Galor Joint work with Ron Elber Thorsten Joachims
Structures are vital to understand protein function • Researches are interested to know: • what are the active site residues for enzymatic reaction? • Where are the active sites for transmission of signal? The HIV protease plays a major role in cell infection. There is a need for a tool for finding Protein structure It is important for drug design to know the binding sites.
From Homologues to structure(evolutionary related protein) We present procedures for identifying Homologues structures from a given library of structures that span the Protein Data Bank. • An annotated homologues structure may give a clue for the function of the probe protein • The homologues structure may be used as scaffold for modeling the probe structure.
S1: AWHFFAI S2: AHGI Sequence alignment Only sequence information is used ,both in the query and target
Some Homologues do not share high sequence similarity Myoglobin 1mba and leghemoglobin 1bin:A share similar structure but their sequence identity is small 14%. When sequence similarity fails we can use different similarity measure- Threading
Fold prediction by threading when sequence identity is low Evaluate how amino acid sequence fits into known three-dimensional (3D) protein structure. Test fitness of the probe sequence to structures from existing library. Choose the most significant fit from sequence/structure matches. MGFPIPDPYV … KGKI
Definition of protein characteristics The number of contacts determines if a site is buried or exposed Characterize a protein is by its amino acid sequence S, S: AW….HI Or by its structure X, X: s1s0….s2s2 A structure site can be classified according to the number of contacts with its neighbors. Polar sites have law number of contacts. Hydrophobic sites have more. W H I A
Focus on threading S1: AWGHKI Sequence information is used for the probe protein. Structural information for the target. G K I H X2: s1s0s2s3s0s3
Many ways to align two proteins S1: AWHFFAI S2: AHGI There are many different alignments for two sequence:
The need to score the alignments Counts the number of amino acid type ai placed at site type sj Counts the number of gaps placed at site sj Costs
The cost matrix W A numerical value is assigned to every cell depending on the fitness of assigning an amino acid into structural site. These may be simple scores or more complicated. Related to chemical similarities or frequency observed
How to find the optimal alignment S2: MPR X2: s1s1s3 S1: PVRC Dynamic programming algorithm is an O(nm) algorithm which find the optimal alignment given a cost matrix and gap cost from an exponential number (2^(n+m)) alignments.
Present alignment scores are not accurate enough There is the need for • A better scoring cost. • A better gap cost. Or • Accompany the similarity measure (Seq2Seq/Seq2Struc) with a statistical measure which will enhance the signal.
The Z score measure • Define a set of random sequences. Each random sequence is threaded to the same target structure. • We compare the alignment cost of the true probe sequence with the average cost of random sequences. • This enables us to estimate the significance of the score of the probe compared to a typical score of a random sequence.
A novel method for designing energy cost function parameters Current methods try to recognize native versus decoys. DIFFICULTIES Not exact Exponential number of alignments But the goal is to identify Homologues The answer is Recognize Homologues versus decoys
An energy function A training set An algorithm to train proteins to recognize their correct folds An algorithm to solve the “training” conditions. Evaluation of the new score parameters. The steps to design energy cost function parameters Use Mathematical programming to solve the set of equations Evaluate the performance on an independent set.
Some notation alignment An alignment of protein SI into protein SII results with a path g. Recall that there are many possibilities of paths (2^(n+m)) The set of alignments
The cost matrix definition The total energy, denoted Etotal, of the alignment is used as a measure to score the similarity between the two proteins ,
The cost function Our cost function has the form: score alignment Alignment coefficient
HIDE SLIDEOptimize the parameter W such that the native energy is the lowest Instead of solving for the unknown optimal path, we solve for all paths. Unfortunately the number of inequality is exponentially large.
w w1 Use statistical machine learning algorithm w2 w2 constrain DX Find the middle point in the cone w s w1
Converge in polynomial time Algorithm I=0 wi Compute the optimal alignment using dynamic programming. I=I+1 NO wi+1 || wi -wi+1||<0.001 Solve DE>0 yes THE END
Toy Example-New methodsequence alignment -DE Number of sets in Training 1169 Number of error 225 Number of iteration 6 Native: 1chg, length 245. Chymotrypsinogen A Homologue: 2alp 80 20 Sets number
Toy Example-Old method Number of sets in Training 1169 Number of error 725 Native: 1chg, length 245. Chymotrypsinogen A Homologue: 2alp -DE 700 100 Set label
LOOPP http://ser-loopp.tc.cornell.edu/loopp.html Compute the cost of the alignment of the Query to proteins in LOOPP database. Query sequence Cost model Compute the Z score for a subset of proteins with highest scores List of homologue targets+alignment Select the best prediction using a number of scores
DESIGN SCORE PARAMETERE Master node Node 1 NODE n LOOPP: LOOPP: SVM Analyze with PERL script
SUMMARY • We introduce a consistent method for designing cost function parameters using threading which enable also to design gaps parameters. • We overcome the problem of solving exponentially number of inequalities using iterative SVM formulation.
Tools to determine protein folds • The number of new protein sequences is growing exponentially relative to the number of protein structures being solved by experimental methods. • There is a need for proteins annotation tool using sequence, and structural information. • The method needs to be quick. • And to give reliable answers.