180 likes | 326 Views
Protein Family Classification using Sparse Markov Transducers. Proceedings of Eighth International Conference on Intelligent Systems for Molecular Biology (ISMB2000) , pp. 134-145 E. Eskin, W.N. Grundy, and Y. Singer Cho, Dong-Yeon. Abstract.
E N D
Protein Family Classification using Sparse Markov Transducers Proceedings of Eighth International Conference on Intelligent Systems for Molecular Biology (ISMB2000), pp. 134-145 E. Eskin, W.N. Grundy, and Y. Singer Cho, Dong-Yeon
Abstract • Classifying proteins into families using sparse Markov transducers (SMTs) • Estimation of a probability distribution conditioned on an input sequence • Similar to probability suffix trees • Allowing for wild-cards • Two models • Efficient data structures
Introduction • Protein Classification • Pairwise similarity • Creating profiles for protein families • Consensus patterns using motifs • HMM-based approaches • Probability suffix trees (PSTs) • A PST is a model that predicts the next symbol in a sequence based on the previous symbols. • This approach is based on the presence of common short sequences (motifs) through the protein family. • One drawback of PSTs is that they rely on exact matches to the conditional sequences (e.g., 3-hydroxyacyl-CoA dehydrogenase). VAVIGSGT VGVLGLGT V*V*G*GT – wild cards
Sparse Markov Transducers (SMTs) • A generalization of PSTs • It can condition the probability model over a sequence that contains wild-cards. • In a transducer, the input symbol alphabet and output symbol alphabet can be different. • Two methods • Single amino acid • Protein family • Efficient data structure • Experiments • Pfam database of protein family
Sparse Markov Transducers • A Markov Transducer of Order L • Conditional probability distribution • Xk are random variables over an input alphabet • Yk is a random variable over an output alphabet • Sparse Markov Transducer • Conditional probability distribution • : wild card • Two approaches for SMT-based protein classification • A prediction model for each family: single amino acid • A single model for the entire database: protein family
Sparse Markov Trees • Representationally equivalent to SMTs • The topology of a tree encodes the positions of the wild-cards in the conditioning sequence of the probability distribution.
Training a Prediction Tree • A set of training examples • The input symbols are used to identify which leaf node is associated with that training example. • The output symbol is then used to update the count of the appropriate predictor. • The predictor kept counts of each output symbol seen by that predictor. • We smooth each count by adding a constant value to the count of each output symbol. Cf) Dirichlet distribution u1 DACDADDDCAA, C CAAAACAD, D AACCAAA, ? C0.5, D0.5
Mixture of Sparse Prediction Trees • We do not know which tree topology can best estimate the distribution. • A mixture technique employs a weight sum of trees as a predictor. • Updating the weight of each tree for each input string in the data set based on how well the tree preformed on predicting the output • The prior probability of a tree is defined by the topology of the tree.
Implementation of SMTs • Two important parameters • MAX_DEPTH: the maximum depth of the tree • MAX_PHI: the maximum number of wild-cards at every node Ten tress in the mixture if MAX_DEPTH=2 and MAX_PHI = 1
Template tree • We only store these nodes which are reached during training. AA, AC and CD
Efficient Data Structures • Performance of the SMT typically improves with higher MAX_PHI and MAX_DEPTH. • The memory usage become bottleneck because it restricts these parameters to values that will allow the tree to fit in memory.
Lazy Evaluation • We store the tails of the training sequence and recompute the part of the tree on demand when necessary. • EXPAND_SEQUENCE_COUNT = 4 ACDACAC(D) ACDACAC(A), DACADAC(C), DACAAAC(D), ACACDAC(A), ADCADAC(D)
Methodology • Data • Two versions of the Pfam database • Version 1.0: for comparing results to previous one • Version 5.2: the latest version • 175 protein families • A total of 15610 single domain protein sequences containing a total 3560959 residues • Training and test data with a ratio of 4:1 for each family • transmembranereceptor: 530 protein sequence (424 + 106) • The 424 sequences of the training set give 108858 subsequences that are used to train the model.
Building SMT Prediction Models • A prediction model for each protein family • A sliding window of size 11 • Prediction of the middle symbol a6 using neighboring symbols • The input symbols are a5a7a4a8a3a9a2a10a1a11. • MAX_DEPTH = 7 and MAX_PHI = 1 • Classification of a Sequence using a SMT Prediction Model • Computation of the likelihood for an unknown sequence • A sequence is classified into a family by computing the likelihood of the fit for each of the 175 models.
Building the SMT Classifier Model • Estimation of the probability over protein families given a sequence of amino acids • Input sequence: an amino acid sequence from a protein family • Output symbol: the protein family name • A sliding window of 10 amino acids: a1,…,a10 • MAX_DEPTH=5 and MAX_PHI=1 • Classification of a Sequence using an SMT Classifier • Each position of the sequence gives us a probability over the 175 families measuring how likely the substring originated from each family.
Results • Time-Space-Performance tradeoffs
Results of Protein Classification using SMTs • The SMT models outperform the PST models. • SMT Classifier > SMT Prediction > PST Prediction
Discussion • Sparse Markov Transducers (SMTs) • We have presented two methods for protein classification using sparse Markov transducers (SMTs). • Future Work • Incorporating biological information into the model such as Dirichlet mixture priors • Combining a generative and discriminative model • Using both positive and negative examples in training