Protein Local 3D Structure Prediction by Super Granule Support Vector Machines (Super GSVM)

Protein Local 3D Structure Prediction by Super Granule Support Vector Machines (Super GSVM) Dr. Bernard Chen Assistant Professor Department of Computer Science University of Central Arkansas Fall 2009

Goal of the Dissertation • The main purpose is trying to obtain and extract protein sequence motifs information which are universally conserved and across protein family boundaries. • And then use these information to do Protein Local 3D Structure Prediction

ResearchFlow Part1 Bioinformatics Knowledge and Dataset Collection Part2 Discovering Protein Sequence Motifs Part3 Motif Information Extraction Part4 Protein Local Tertiary Structure Prediction

Data set

HSSP matrix: 1b25

Representation of Segment • Sliding window size: 9 • Each window corresponds to a sequence segment, which is represented by a 9 × 20 matrix plus additional nine corresponding secondary structure information obtained from DSSP. • More than 560,000 segments (413MB) are generated by this method. • DSSP: Obtain 2nd Structure information

Original dataset Fuzzy C-Means Clustering Information Granule 1 ... Information Granule M New Improved or Greedy K-means Clustering ... New Improved or Greedy K-means Clustering Join Information Final Sequence Motifs Information Granular Computing Model

Reduce Time-complexity Wei’s method: 1285968 sec (15 days) * 6 = 7715568 sec (90 days) Granular Model: 154899 sec + 231720 sec * 6 = 1545219 sec (18 days) (FCM exe time) (2.7 Days)

Comparison of Quality Measures

Super GSVM-FE Motivation • First, the information we try to generate is about sequence motifs, but the original input data are derived from whole protein sequences by a sliding window technique; • Second, during fuzzy c-means clustering, it has the ability to assign one segment to more than one information granule.

Original dataset Fuzzy C-Means Clustering Information Granule 1 Information Granule M ... Five iterations of traditional K-maens Five iterations of traditional K-maens ... Greedy K-means Clustering Greedy K-means Clustering … For Each Cluster … For Each Cluster ... Ranking SVM Feature Elimination Ranking SVM Feature Elimination For Each Cluster For Each Cluster … … ... Collect Survived Segments Collect Survived Segments ... Greedy K-means Clustering Greedy K-means Clustering Join Information Final Sequence Motifs Information Super GSVM-FE Additional Portion

Extracted Motif Information

3D information • 3D information is generated from PDB (Protein Data Bank), • an example of 1a3c PDB file

Testing Data • The latest release of PISCES includes 4345 PDB files. • Compare with the dataset in our experiment, 2419 PDB files are excluded. • Therefore, we regard our 2710 protein files as the training dataset and 2419 protein files as the independent testing dataset.

Testing Data • We convert the testing dataset by the approach we introduced • more than 490,000 segments are generated as testing dataset.

Training dataset Fuzzy C-Means Clustering Information Granule 1 Information Granule M ... Five iterations of traditional K-means Five iterations of traditional K-means ... Greedy K-means Clustering Greedy K-means Clustering … For Each Cluster … For Each Cluster ... Train Ranking SVM and then Eliminate 20% lower rank members Train Ranking SVM and then Eliminate 20% lower rank members Collect all extracted clusters and Ranking-SVMs Find the closest cluster within a given distance threshold Feed to the belonging SVM If the rank belongs to cluster Independent testing Dataset All Sequence clusters All Ranking SVMs Predict the local 3D structure If not, find the next closest cluster Super GSVM

Prediction Accuracy

Prediction Coverage

Future Works • Incorporate Chou-Fasman parameter for SVM training

Training dataset Fuzzy C-Means Clustering Information Granule 1 Information Granule M ... Five iterations of traditional K-means Five iterations of traditional K-means ... Greedy K-means Clustering Greedy K-means Clustering … For Each Cluster … For Each Cluster ... Build Decision Tree Build Decision Tree Collect all extracted clusters and Ranking-SVMs Find the closest cluster within a given distance threshold Feed to the belonging DT If the rank belongs to cluster Independent testing Dataset All Sequence clusters Test by DT Predict the local 3D structure If not, find the next closest cluster Future Works • For each cluster, instead of building SVM model, we build Decision Tree instead

Protein Local 3D Structure Prediction by Super Granule Support Vector Machines (Super GSVM)

Protein Local 3D Structure Prediction by Super Granule Support Vector Machines (Super GSVM)

Presentation Transcript

Nonlinear Data Discrimination via Generalized Support Vector Machines

Support Vector Machines

Support Vector Machines

Support Vector Machines

Transmembrane Protein Topology Prediction Using Support Vector Machines

Support Vector Machines

Support Vector Machines

Support Vector Machines

Support Vector Machines

Support Vector Machines

Using Support Vector Machines for transmembrane protein topology prediction Tim Nugent

Support Vector Machines

Protein Local 3D Structure Prediction by Super Granule Support Vector Machines (Super GSVM)

Support Vector Machines

Support Vector Machines

Support Vector Machines

Support Vector Machines

Protein structure prediction

Support Vector Machines

Support Vector Machines