260 likes | 461 Views
Protein Local 3D Structure Prediction by Super Granule Support Vector Machines (Super GSVM). Dr. Bernard Chen Assistant Professor Department of Computer Science University of Central Arkansas Fall 2009. Goal of the Dissertation.
E N D
Protein Local 3D Structure Prediction by Super Granule Support Vector Machines (Super GSVM) Dr. Bernard Chen Assistant Professor Department of Computer Science University of Central Arkansas Fall 2009
Goal of the Dissertation • The main purpose is trying to obtain and extract protein sequence motifs information which are universally conserved and across protein family boundaries. • And then use these information to do Protein Local 3D Structure Prediction
ResearchFlow Part1 Bioinformatics Knowledge and Dataset Collection Part2 Discovering Protein Sequence Motifs Part3 Motif Information Extraction Part4 Protein Local Tertiary Structure Prediction
Representation of Segment • Sliding window size: 9 • Each window corresponds to a sequence segment, which is represented by a 9 × 20 matrix plus additional nine corresponding secondary structure information obtained from DSSP. • More than 560,000 segments (413MB) are generated by this method. • DSSP: Obtain 2nd Structure information
ResearchFlow Part1 Bioinformatics Knowledge and Dataset Collection Part2 Discovering Protein Sequence Motifs Part3 Motif Information Extraction Part4 Protein Local Tertiary Structure Prediction
Original dataset Fuzzy C-Means Clustering Information Granule 1 ... Information Granule M New Improved or Greedy K-means Clustering ... New Improved or Greedy K-means Clustering Join Information Final Sequence Motifs Information Granular Computing Model
Reduce Time-complexity Wei’s method: 1285968 sec (15 days) * 6 = 7715568 sec (90 days) Granular Model: 154899 sec + 231720 sec * 6 = 1545219 sec (18 days) (FCM exe time) (2.7 Days)
ResearchFlow Part1 Bioinformatics Knowledge and Dataset Collection Part2 Discovering Protein Sequence Motifs Part3 Motif Information Extraction Part4 Protein Local Tertiary Structure Prediction
Super GSVM-FE Motivation • First, the information we try to generate is about sequence motifs, but the original input data are derived from whole protein sequences by a sliding window technique; • Second, during fuzzy c-means clustering, it has the ability to assign one segment to more than one information granule.
Original dataset Fuzzy C-Means Clustering Information Granule 1 Information Granule M ... Five iterations of traditional K-maens Five iterations of traditional K-maens ... Greedy K-means Clustering Greedy K-means Clustering … For Each Cluster … For Each Cluster ... Ranking SVM Feature Elimination Ranking SVM Feature Elimination For Each Cluster For Each Cluster … … ... Collect Survived Segments Collect Survived Segments ... Greedy K-means Clustering Greedy K-means Clustering Join Information Final Sequence Motifs Information Super GSVM-FE Additional Portion
ResearchFlow Part1 Bioinformatics Knowledge and Dataset Collection Part2 Discovering Protein Sequence Motifs Part3 Motif Information Extraction Part4 Protein Local Tertiary Structure Prediction
3D information • 3D information is generated from PDB (Protein Data Bank), • an example of 1a3c PDB file
3D information • 3D information is generated from PDB (Protein Data Bank), • an example of 1a3c PDB file
Testing Data • The latest release of PISCES includes 4345 PDB files. • Compare with the dataset in our experiment, 2419 PDB files are excluded. • Therefore, we regard our 2710 protein files as the training dataset and 2419 protein files as the independent testing dataset.
Testing Data • We convert the testing dataset by the approach we introduced • more than 490,000 segments are generated as testing dataset.
Training dataset Fuzzy C-Means Clustering Information Granule 1 Information Granule M ... Five iterations of traditional K-means Five iterations of traditional K-means ... Greedy K-means Clustering Greedy K-means Clustering … For Each Cluster … For Each Cluster ... Train Ranking SVM and then Eliminate 20% lower rank members Train Ranking SVM and then Eliminate 20% lower rank members Collect all extracted clusters and Ranking-SVMs Find the closest cluster within a given distance threshold Feed to the belonging SVM If the rank belongs to cluster Independent testing Dataset All Sequence clusters All Ranking SVMs Predict the local 3D structure If not, find the next closest cluster Super GSVM
Future Works • Incorporate Chou-Fasman parameter for SVM training
Training dataset Fuzzy C-Means Clustering Information Granule 1 Information Granule M ... Five iterations of traditional K-means Five iterations of traditional K-means ... Greedy K-means Clustering Greedy K-means Clustering … For Each Cluster … For Each Cluster ... Build Decision Tree Build Decision Tree Collect all extracted clusters and Ranking-SVMs Find the closest cluster within a given distance threshold Feed to the belonging DT If the rank belongs to cluster Independent testing Dataset All Sequence clusters Test by DT Predict the local 3D structure If not, find the next closest cluster Future Works • For each cluster, instead of building SVM model, we build Decision Tree instead