350 likes | 509 Views
Remote Homology detection : A motif based approach. CS 6890: Bioinformatics - Dr. Yan Swati Adhau 04/14/06. Outline of the Presentation. Motivation Introduction Description (Remote Homology Detection) Methods Results & Discussion Q and A. Motivation.
E N D
Remote Homology detection:A motif based approach CS 6890: Bioinformatics - Dr. Yan Swati Adhau 04/14/06
Outline of the Presentation • Motivation • Introduction • Description (Remote Homology Detection) • Methods • Results & Discussion • Q and A
Motivation • Remote homology detection is the problem of detecting homology in case of low sequence similarity. • A method based on presence of discrete sequence motifs for detecting remote homology. • The motif content of a pair of sequences is used to define a similarity that is used as a kernel for support vector machine (SVM) classifier.
Testing of method is done upon two remote homology detection tasks 1) Prediction of previously unseen SCOP family (Structural classification of Proteins). 2) Prediction of an Enzyme class given other enzymes that have a similar function on other substrates.
Introduction • Protein Homology detection is one of the most important problems in computational biology. • Homology is generally established by sequence similarity. • Two established methods 1) Smith Waterman algorithm 2) Blast • Protein sequence motifs are an alternative method of detecting sequence similarity
Intro(continued) • By focussing on limited highly conserved regions of proteins, motifs can often reveal important clues to a proteins role. • Motifs often represent functionally important regions such as catalytic sites, binding sites and structural motifs. • The Blocks+ database combines various databases such as pFAM, PRINTs, ProDom, DOMO and InterPro. eMotif database contains discrete sequence motifs constructed from blocks of BLOCKS+. • This paper uses discrete sequence motifs extracted from the eBLOCKS database using the eMOTIF method.
Intro(Continued) • Based upon the motif content of a pair of sequence we introduce sequence similarity measure. • This paper uses an SVM method. • SVM method is shown to perform better than methods for Fisher-Kernel method, SAM T-98 and PSI-BLAST. • When a sequence similarity is shown to be a dot product in some space it is called the kernel. • In this paper we use protein motifs to construct a kernel that can be computed efficiently which performs better than a kernel based on BLAST or Smith-Waterman scores.
Remote Homology Detection This method was tested on the following two tasks:- 1) Prediction of a SCOP family when trained on other families in that family’s fold. 2) Prediction of the function of an enzyme when the training set contains enzyme that have same general functions but different substrates.
BackGround of the first dataset • The first dataset is composed of sequences of domains from the SCOP database. Objective:- To detect homology at the SCOP superfamily level. Recognizing a SCOP family when the training set contains other families in the family’s superfamily.
…contd • This specifies the +ve examples in the test set and training set. • The –ve examples are taken from outside of the family’s fold. • A random family is chosen to belong to -ve test set & rest of the families in it’s superfamily are added to negative training set.
The second dataset • We use the classification of Enzymes to simulate remote homology. • The function of an enzyme is given by EC number given it to by Enzyme Commision. • EC number is like n1.n2.n3.n4 For eg 1.1.3.13 for alcohol oxidase. n1 – 1-6 :- indicates the type of chemical reaction catalyzed by the enzyme. n2 – specifies donor molecule. n3 – specifies the acceptor. n4 – specifies the substrate.
…contd • In this paper author concentrates on oxidoreductase (n1 = 1). • A classifier is trained to predict oxidoreductases with a certain function (n2 & n3). • The classifier will be tested on oxidoreductases with adifferent substrate (n4) than those it was trained on.
For eg. EC class 1.14.13.8 Positive examples of training set. EC class 1.14.13.39 Positive examples of test set. • So the similarity between the +ve training & test may not be very high. • Negative test & training set are defined analogusly.
MethodsThe Motif kernel • When the similarity is a dot product it is called a kernel. The method is as follows:- Each position in the motif represents the variability in the column in a block from multiple sequence alignment. For eg the motif [as].dkf[filmv]..[filmv]…l[ast]. [filmv] is a substitution group.
….contd • If the pattern of amino acids that appear in a column of a block does not match any substitution group, then the motif contains the wild card symbol ‘.’ . • A sequence will or match above motif if it has either an a an s in some position, then any character, then d, k, f & so on, matching until the end of motif
A sequence x contains a motif m, if x contains m at some position. • A sequence x can be represented in vector space indexed by a set of motifs M as follows:- F(x) = (fm(x))mЄM where fm(x) is the number of occurences of the motif m in x. We can define motif kernel as K(x, x’) = F(x) F(x’)
As in the most cases a motif appears only once in sequence, this kernel will count the number of motifs that are common to both sequence. Q Why are we using eBlocks database over other motif databases to define a motif kernel? Ans:- • Usage of databases like PROSITE & the eMOTIF presents a problem in the evaluation of performance of the kernel. • The eBLOCKS database are generated in an unsupervised way • Increased coverage of eBLOCKS set of BLOCKS.
Computing the Motif kernel • To compute the motif content of each sequence; the subsequent computation of the kernel is simply a dot product between the vectors. • To facilitate the efficient computation of the motif content of a sequence, the motif database is stored in TRIE which is defined as follows. Let m be a motif over the alphabet A U S U {.} Every prefix of m has a node. Let m1 and m2 be prefixes of m; there is an edge from m1 to m2 if lm2l = lm1l +1. To compute all the motifs that are contained in x at any position, this search is started at each position of x. __
The Blast kernel • A query sequence by its BLAST scores against the training set is represented. • This representation in conjuction with SVMs was used to address the problem of remote homology detection • Results were better than Fisher-kernel method.
Classification Methods • We report results using two classification methods:- 1) SVMs 2) K-Nearest-Neighbour. SVM f(x) = w.x + b w weight vector b constant bias Query is classified according to the sign of f.
As a consequence of optimization process, the weight vector can be expressed as a weighted sum of the Support Vectors(SV):- w = S bixi • The decision function is now written as f(x) = S bixi * x + b • In terms of kernel function, the decision is expressed as:- f(x) = S bi K(xi , x) + b iÎSV iÎSV iÎSV
KNN classifier • We use a KNN classifier with a continuous valued decision functions. • A score for class j is defined as fj(x) = S K(xi , x) • kNNj(x)is the set of k nearest neighbors of x in class j. iÎkNNj(x)
Metrics • We consider two metrics for asessing the performance of a classifier • ROC (area under receiver operator characteristic). • RFP (the median rate of false positive) • The ROC curve describes the tradeoff between sensitivity and specificity. • More specifically we use ROC50 curve, which counts true positives only up to the first 50 false positives. • The RFP score of a positive test sequence x is the fraction of negative test sequences that have a value of the decision function that is at least as high as the value of the decision function of x.
Results • Use of astral database to obtain protein domain sequences of the SCOP database. • Retained only superfamilies having atleast two families that have atleast 10 members in each family. • A dataset with1639 domains in 23 superfamilies & 56 families was yielded. • Protein sequences annotated with EC numbers were extracted from SwissProt database. • The extracted dataset has 2187 enzymes in 65 classes. • To generate Blast kernel, authors ran an all vs all BLAST on two datasets using default parameters & E value cut off 0.1. • To generate motif kernel, datasets were computed with eBLOCKS sequence motifs using the TRIE method.
Results …contd • A family by family comparison of classification performance of the motif-SVM & BLAST-SVM methods is provided in figure in next slide. • On the SCOP task the motif-SVM method performs significantly better than BLAST-SVM method with a p-value of 3.9 * 10-9. in a wilcoxon signed rank test for the ROC50 score. • In enzyme classification task there is no significant difference in ROC50 scores. • Similar behavior is observed in the median RFP and RFP50. • The results were similar when Smith-Waterman algorithm was used instead of BLAST.
Results …contd • The motif kernel in figure 4 shows the similarity between the families in superfamily whereas none is detected by the BLAST kernel. • Increased sensitivity of motif kernel.
Results …contd • Figure 5 shows the comparison of the SVM-based method to the one that uses KNN as a classifier. • In both the motif and BLAST kernels, SVM based classifier performs significantly better than corresponding KNN classifier.
Discussion • This paper showed that an SVM classifier that uses motif kernel performs significantly better than SVM that uses a BLAST/Smith-Waterman kernel on a remote homology detection problem derived from SCOP database. • Both methods performed equally well on the task of Enzyme detection. • BLAST kernel & motif kernel worked significantly better when used in conjunction with an SVM rather than a Nearest Neighbor classifier. • Despite the relative success of motif method, there were many SCOP families & EC classes that were not detected using this method.