340 likes | 455 Views
Studying the Protein Folding Problem by Means of a New Data Mining Approach. by Huy N.A. Pham and Triantaphyllou Evangelos Department of Computer Science, Louisiana State University 298 Coates Hall, Baton Rouge, LA 70803 Email: hpham15@lsu.edu and trianta@lsu.edu
E N D
Studying the Protein Folding Problem by Means of a New Data Mining Approach by Huy N.A. Pham and Triantaphyllou Evangelos Department of Computer Science, Louisiana State University 298 Coates Hall, Baton Rouge, LA 70803 Email: hpham15@lsu.edu and trianta@lsu.edu ICDM 2005 Workshop on Temporal Data Mining: Algorithms, Theory and Applications November 27-30, 2005, Houston, TX This research was done under the LBRN program (www.lbrn.lsu.edu)
Brief introduction • The structure prediction problem for proteins plays an important role in understanding the protein folding process. • This is an NP-problem. • This research proposes a novel classification approach based on a new data mining technique. • This technique tries to balance the overfitting and overgeneralization properties of the derived models.
Outline • Introduction to: • Classification • The Protein Folding Problem • Classification methods • The overfitting and overgeneralization problem • The Binary Expansion Algorithm (BEA) • Experimental evaluation • Summary
Introduction to Classification • We are given a collection of records that consist the training set: • Each record contains a set of attributes and the class that it belongs to. • We are asked to find a model that describes the records of each class as a function of the values of their attributes. • The goal is to use this model to classify new records for which we do not know the class in which they belong to. • Typical Applications: • Credit approval • Target marketing • Medical diagnosis • Treatment effectiveness analysis
Introduction to the protein folding problem • At least two distinct, though related, tasks can be stated: • Structure Prediction Problem (Protein Folding Problem): given a protein amino acid sequence, determine its 3D folded shape. • Pathway Prediction Problem: given a protein amino acid sequence and its 3D structure, determine the time-ordered sequence of folding events.
Introduction to the protein folding problem - Cont'd • Protein folding is the problem of finding the 3D structure of a protein from its amino acid sequence. • There are 20 different types of amino acids (labelled with their initials as: A, C, G, ...) => A protein is a sequence of amino acids (e.g. AGGCT... ). • The folding problem is to find how this amino acid chain (1D structure) folds into its 3D structure. => Classification problem.
Introduction to the protein folding problem - Cont'd • A protein is classified into one of four structural classes [Levitt and Chothia, 1976] according to its secondary structure components: • all-α (α–helix) • all-β (β – Strand) • α/β • α+β
Outline • Introduction to Classification and Protein folding problem • Classification methods • The overfitting and overgeneralization problem • The Binary Expansion Algorithm - BEA • Experimental evaluation • Summary
Classification methods • Decision trees • A flow-chart-like tree structure. • An internal node denotes a test on an attribute. • A branch represents an outcome of the test. • Leaf nodes represent class labels or class distribution. • Use the decision tree to classify an unknown sample. • Bayesian classification • Calculate explicit probabilities for hypothesis, among the most practical approaches to certain types of learning problems. • Genetic algorithms • Based on an analogy to biological evolution.
Classification methods - Cont'd • Fuzzy set approaches • Use values between 0.0 and 1.0 to represent the degree of membership. • Attribute values are converted to fuzzy values. • Compute the truth values for each predicted category. • Rough set approaches • Approximately or “roughly” define equivalent classes. • K-Nearest Neighbor Algorithms • Calculate the mean values of theK-nearest neighbors.
Classification methods - Cont'd • Neural Networks (NNs) • A problem-solving paradigm modeled after the physiological functioning of the human brain. • The firing of a synapse is modeled by input, output, and threshold functions. • The network “learns” based on problems to which answers are known and produces answers to entirely new problems of the same type. • Support Vector Machines (SVMs) • Data that are non-separable in N-dimensions have a higher chance of being separable if mapped into a space of higher dimension. • Use a linear hyperplane to partition the high dimensional feature space.
Outline • Introduction to Classification and Protein folding problem • Classification methods • The overfitting and overgeneralization problem • The Binary Expansion Algorithm - BEA • Experimental evaluation • Summary
Overfitting and overgeneralization in Classification • Algorithms have resulted in classification and prediction systems that are highly accurate or they are not so accurate for no apparent reason. • A growing belief is that the root to that problem is the overfitting and overgeneralization behavior of such systems. • Overfitting means that the extracted model describes the behavior of known data very well but does poorly on new data points. • Overgeneralization occurs when the system uses the available data and then attempts to analyze vast amounts of data that has not seen yet. For example: • The generated tree may overfit the training data. • The SVMs method may overgeneralize the training data. => Develop an algorithm that balances overfitting and overgeneralization.
A multi-class prediction method • One-vs-Others method (Dubchak et al 1999, Brown et al 2000) • Partition the K classes into a two-class problem: one class contains proteins in one “true” class, and the “others” class combines all the other classes. • A two-class classifier is trained for this two-class problem. • Then partition the K classes into another two-class problem: one class contains another original class, and the “others” class contains the rest. • Another two-class classifier is trained. • This procedure is repeated for each of the K classes, leading to K two-class trained classifiers.
Outline • Introduction to Classification and Protein folding problem • Classification methods • The overfitting and overgeneralization problem • The Binary Expansion Algorithm - BEA • Experimental evaluation • Summary
A Some basic concepts • A clause: a description of a small area of the state space covering examples of a given class. • Homogenous Clause (HC): an area covering a set of examples of a given class and unclassified examples uniformly. • Any clause of a given class may be partitioned into of a set of smaller homogenous clauses. Example: B, A1, A2 are homogenous clauses while A is a non-homogenous clause. A can be partitioned into two smaller homogenous clauses A1 and A2. The example is a 2D representation. The high dimension cases can be treated similarly. => Unclassified examples covered by clause B can more accurately be assumed to belong to the same class than those in the original clause A.
A is superimposed to a hyper-grid and the density of all cells can be computed => standard deviation = 0 + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + B is superimposed to a hyper-grid and the density of all cells can be computed => standard deviation > 0 + + + + + + + + + + + + + + + + + + + + Determine the homogenous values of clauses A and B. Some basic concepts - Cont'd • Determining whether a clause is a homogenous clause can be decided by using its standard deviation. • The clause is superimposed by a hyper-grid with sides of some length h. If all cells have the same density, then it is a (perfectly) homogenous clause. • The density of a cell [Richard, 2001]: where n = #(examples in the cell), D = #(dimensions), and = a kernel function A B
R_Unit R_Unit A B The Density of homogenous clause A > The density of homogenous clause B Some basic concepts - Cont'd • The density: It expresses how many classified examples exist in a given clause of the state space. • The density of a homogenous clause is the number of examples of a given class per a unit area.
C F’s radius = C’s radius + (G’s radius – C’s radius)/(2 *D) - - - G - D=6 - + + - + + + + - - Stopping conditions for expansion: F’s radius ≤ D * C’s radius #(Noisy points) ≤ (D * n) / 100 BEA • Main idea of the algorithm: Input: positive and negative examples Output: a suitable classification • Find positive and negative homogenous clauses using any clustering algorithm. • Sort homogenous clauses based on their densities. • For each homogenous clause, one or more new areas are created by : • If its density > a threshold then • Expand it by: F = expanded area, C = original area, and G = enveloping area. • Accept some noisy examples. • Else • Reduce it into smaller homogenous clauses. • Use expanded homogenous clauses for the new testing data.
BEA - Cont'd • Main Algorithm: Input: positive and negative examples Output: a suitable classification Step 1: Find positive and negative clauses using the k-means clustering-based approach with the Euclidean distance. Step 2: Find positive and negative homogenous clauses from positive and negative clauses respectively. Step 3: Sort positive and negative homogenous clauses on densities. Step 4: FOR each homogenous clause C DO If (its density > a threshold = (max – min)/2 of densities) then - Expand C using its density D. - Accept (D*n)/100 noisy examples where n=#(its examples). Else Reduce C into smaller homogenous clauses by considering each cell of its hyper-grid as a new homogenous clause.
Positive Clauses Expand Homogenous Clauses Extended HC BEA - Cont'd • Example: BEA in 2D
Correctness of improvement • Definition: e is improved by e’, e > e’, if for all contexts C such that C[e] and C[e’] are closed, and if C[e] converges in n steps then C[e’] also converges in k steps where k ≤ n, [Sands, 2001]. • BEA: • Use k-means clustering based approach to find positive and negative sets. • Let e denotes results obtained from k-means clustering based approach and e’ denote results obtained from BEA. Certainly C[e] and C[e’] are closed. Moreover C[e’] can accept more examples since all homogenous clauses are expanded from e. • Accept noisy examples. • e is improved by e’ or e is refined to e’.
Outline • Introduction to Classification and Protein folding problem • Classification methods • The overfitting and overgeneralization problem • The Binary Expansion Algorithm - BEA • Experimental evaluation • Summary
Accuracy measures for multi-class classification • The accuracy of two-class problems involves calculating true positive rates and false positive rates. • The accuracy, Q, of multi-class problems can be determined as true class rates, [Rost & Scander, 1993, Baldi et al, 2000], by: qi = ci/ni where ni = #(examples in class ith) and ci = #(true examples in class ith). wi=ni/N where N = Total of examples of a given class.
Experiments • Assess the algorithm for two-class problems. Source:http://www.csie.ntu.edu.tw/~cjlin/methods/guide/data/ BEA
Experiments - Cont'd The BEA provides 15.5% improvement in the classification accuracy vs. C.J.Lin’s SVMs.
Experiments - Cont'd • Assess the algorithm for two-class problems. Source:http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary
Experiments - Cont'd • A test bed of the algorithm for the protein folding problem • Source of data sets: http://www.nersc.gov/~cding/protein by Ding and Dubchak, 2001. • Six parameter datasets extracted from protein sequences. • Use One-vs-Others method for the fourth-classes problem. • Use the Independent Test method in experiments. • BEA represents a protein as a n dimensional vector corresponding to the composition of the n amino acids in the protein.
Experiments - Cont'd • The average results obtained from [Ding and Dubchak, 2001] and [Zerrin, 2004] for the dataset with 27-class: Q1: The average accuracy of the SVMs with the independent test method in [Ding and Dubchack, 2001, Table 6, p11]. Q2: The average accuracy of the Neural Networks with the independent test method in [Ding and Dubchack, 2001, Table 6, p11]. Q3: The average accuracy of the SVMsAAC method in [Zerrin, 2004]. Q4: The average accuracy of the SVMstrioAAC method in [Zerrin, 2004].
Experiments - Cont'd • Results obtained from BEA for 4-class:
Experiments - Cont'd • The BEA provides: • 10% improvement in classification accuracy as the SVMsAAC method at the data type of Amino Acid Composition. • Approximately 36% improvement as Ding’s SVM.
Summary • This research was done to: • Enhance our understanding of the performance of a new data mining algorithm. • Propose a new approach based on balancing overfitting and overgeneralization properties to enhance the performance of data mining algorithms. • Make a contribution in a hot area in pure Bioinformatics by achieving highly accurate results in predicting protein folding properties. • Future work to focus on: • Test the BEA with other applications. • Improve the performance of the approach by: • Improving the accuracy of the algorithm by finding a suitable density for homogenous clauses. • Decreasing the execution time by using parallel computing techniques. • Studying a multi-class classification algorithm.
References • Zerrin Isik et al, “Protein Structural Class Determination Using Support Vector Machines”, Lecture Notes in Computer Science-ISCIS 2004, vol: 3280, pp. 82, Oct. 2004. http://people.sabanciuniv.edu/~berrin/methods/fold-classification-iscis04.pdf • A.C.Tan et al, “Multi-Class Protein Fold Classification Using a New Ensemble Machine Learning Approach”, Genome Informatics 14: 206–217, 2003. http://www.brc.dcs.gla.ac.uk/~actan/methods/actanGIW03.pdf • Chris H.Q.Ding et al, “Multi-class protein fold recognition using Support Vector Machines and Neural Networks”, Bioinformatics, 17:349-358, 2001. http://www.kernel-machines.org/methods/upload_4192_bioinfo.ps • D. Sands.: Improvement theory and its applications. In A. D. Gordon and A. M. Pitts, editors, Higher Order Operational Techniques in Semantics, Publications of the Newton Institute, pp 275-306. Cambridge University Press, 1998.
Thank you! Any questions?