110 likes | 125 Views
Eugene Ie. CS6772 Project Presentation 12/03/2003. Protein Classification Using Averaged Perceptron SVM. Protein Sequence Classification. Protein = ( )* | | = 20 amino acids Easy to sequence proteins, difficult to obtain structure. 3D Structure. Sequence.
E N D
Eugene Ie CS6772 Project Presentation 12/03/2003 Protein Classification Using Averaged Perceptron SVM
Protein Sequence Classification • Protein = ()* | | = 20 amino acids • Easy to sequence proteins, difficult to obtain structure 3D Structure Sequence VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR ? Class Globin family Globin-like superfamily Function Oxygen transport
Sequence Alignment vs. Classification • Sequence similarity through alignment distant homology SGFIEEDELKLFL SGFIEEEELKFVL close homology • Sequence classification for remote homology Classifier
Structural Hierarchy of Proteins SCOP • Remote homologs: • Structure and function conserved • Sequence similarity - low Fold Superfamily Negative Test Set Negative Training Set Family Positive Test Set Positive Training Set
Remote Homology Detection • Discriminative supervised learning approach to protein classification Approach: Support Vector Machines with String Kernels C. Leslie, E. Eskin, J. Weston, and W. Noble, Mismatch String Kernels for SVM Protein Classification. C. Leslie and R. Kuang, Fast Kernels for Inexact String Matching.
QP SVM Training Sequence Training Data >VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR … >TYFPHFDLSHGSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR Total: n sequences + n labels Learned Weights and Bias QP Solver (slow) From KKT
Averaged Perceptron SVM Training Training Algorithm: Y. Freund and R. Schapire, Large Margin Classification Using the Perceptron Algorithm.
Averaged Perceptron SVM Training Iterate t Epochs Sequence Training Data Run Perceptron Algorithm >VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR … >TYFPHFDLSHGSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR Total: n sequences + n labels Generalized Bound for k Final Weight Vector, Voting Weights s = no. of dimensions in feature space k = no. of mistakes made during perceptron run SCOP experiments show: For average n ~ 1000 Average k ~ 50-60
Averaged Perceptron SVM Classification Testing Algorithm: Note: Only k kernel products with unknown sequence x need to be computed. Recurrence relation: M is the set of “mistake indices”
Implementation Details • Built on top of protclass (Protein Classification) platform • Java Platform • Classification Task • Classification Task • Hash table scan instead of Mismatch Trie • Generate mismatch mappings once using shifts • Dynamic kernel matrix storage • Still needs debugging • Speed/Space Performance • ~80% reduction in space requirement • ~50% reduction in training time • ~50% reduction in testing time • Mainly from simple online algorithm