400 likes | 488 Views
A novel Method of Protein Secondary Structure Prediction with High Segment Overlap Measure: SVM approach by S. Hua & Z. Sun. Presenter: Sancar Adali. Outline. Problem Statement SVM primer Methods and Parameters Results Conclusion Discussion. Problem Statement.
E N D
A novel Method of Protein Secondary Structure Prediction with High Segment Overlap Measure: SVM approachby S. Hua & Z. Sun Presenter: Sancar Adali
Outline • Problem Statement • SVM primer • Methods and Parameters • Results • Conclusion • Discussion
Problem Statement • Predicting Secondary Protein Structure from amino acid Sequences Secondary Protein Structure: The local shape of polypeptide chain dependent on the hydrogen bonds between different amino acids In the big picture, the goal is to solve tertiary/quaternary structure in 3D. By solving the more easily tackled secondary structure problem in 2D, we’re solving an easier problem.
Some important terminology • residue: When two or more amino acids combine to form a peptide, the elements of water are removed, and what remains of each amino acid is called an amino acid residue. • segment: a group of consecutive residues that form a secondary structure such as helix, coil, sheet • C-terminus: The residue that has a free carboxyl group… • N-terminus: The residue in a peptide that has an amino group that is free… [International Union of Pure and Applied Chemistry definitions]
Problem Statement • Our main goal is to find Sequence to Structure relationship. We usually know the amino acid sequence of proteins. The problem is complicated, because of the many interactions/bonds between amino acid residues. We start at a lower level of complexity and try to predict secondary structure.
Amino acids to Proteins H H Carboxyl(COO) Amino C C Amino(NH3) C C Carboxyl R1 R2 O H H H Amino C C C N C C Carboxyl R1 R2 H Peptide bond C-terminus N-terminus
Linear Decision Boundary • # of classes: 2 • y: output {-1,1}; x ЄRn; L =((x1,y1), (x2,y2)…(xn,yn) • wЄRn Hyperplane: H = {x : w∙x+b = 0}
SVM(Support Vector Machine) • Linearly Separable Training Samples
SVM(Support Vector Machine) • Original Idea • Choose H by • Maximizing min{||x-xi||: w∙x+b = 0, i=1,…n} with respect to w and b • Maximizing the margin(the distance from hyperplane to the closest positive and negative training points)
Solving Linearly Separable Case • Hyperplane: H = {x : w∙x+b = 0} • M=d+ + d- • Choose w,b to maximize margin. Then fSVM=sign(w∙x+b) Since data points are linearly separable, there exist such w and b yi(w∙xi +b)>0 for all i =1, …k
Rescale w, b to make the points closest to HP satisfy |w·xi +b|=1(standardizing distance) yi(w∙xi + b)≥1 for all i = 1,2,…k Now we have two parallel planes to H Solving Linearly Separable Case
Linearly Separable Case Soln. The distance between H1 , H2 is 2/||w||.
Linearly Separable Case Soln. • We want to maximize distance between hyperplane H1 and H2 which means minimizing ||w|| • The problem has become an optimization problem where the objective function is
The decision boundary depends only on a subset of training samples. These are called “Support Vectors”. • Introduce αi≥0 :Lagrange (nonnegative) multipliers to solve the problem • Solution and the decision function is
Not Linearly Separable Case • Introduce slack variables ξi≥0 to take classification errors into account • If ξi > 1, we make an error • Add error term ξi to the objective function
Not Linearly Separable Case Solution is again
Nonlinear SVM • Map the training samples to a higher-dimensional space where they become linearly separable and solve the decision problem there • Important observation • Training samples only appear in dot products: or
Nonlinear SVM We don’t need to know Φ explicitly, just need It can be represented by a kernel function
Nonlinear SVM Φ exists neither in the decision function nor in the training step
Typical Kernels THE ONE USED IN ARTICLE
Methods • Use of SVMs with Radial Basis Function • Three classes H(α-helix), E(-sheet), C(coil,and the remaining part) • Construction of tertiary classifiers from multiple binary classifiers(SVMs are binary classifiers)
SVM Implementation • Gaussian RBF with r=0.10 • The authors claim this parameter is not critical and they have tried other values • C: the penalty for classification errors. Must be chosen carefully to find a balance between overfitting(too high penalty) and poor generalization(too low penalty). The optimal value for C was calculated using 7-fold CV in this paper
Encoding Amino Acid Sequence Information • 20 amino acids + 1 for N or C terminus : 21 binary unit vectors (1,0,…,0),(0,1,…,0) • Concatenate these vectors for each sliding window • For Window Length ℓ: Dimension of Feature Vector is 21∙ℓ
Encoding Example • Assume ℓ=5 and there are 3 amino acids(for our purposes),named Arg(1,0,0,0), Lys(0,1,0,0), Met(0,0,1,0). The two ends of the chain are encoded separately as (0,0,0,1) • The feature vector for Arg* residue in the sequence Arg,Lys,Arg*,Met,Arg,Met,Met would be encoded as xi = (0,0,0,1;0,1,0,0;1,0,0,0;0,1,0,0;1,0,0,0) • This will be our input to SVM.
Determination of Optimal Window Length • Determines dimension of feature vector • Determined using 7-fold cross-validation
Data Sets • RS126 Set: 126 nonhomologous proteins • non-homologous:no two proteins in the set share more than 25 % sequence identity over a length of more than 80 residues • CB513 Set: 513 nonhomologous proteins. Almost all the sequences in the RS126 set are included in the CB513 set. • non-homologous data set: i.e. an SD score of ≥5 is regarded as homology. SD score(distance of “alignment score” from the mean score of randomized sequences in terms of std) is a more stringent measure than the percentage identity. In fact, 11 pairs of proteins in the RS126 set are sequence similar when using the SD score instead of percentage identity. The CB513 set contains 16 chains of 30 residues and short chains usually decrease accuracy.
Protein Secondary Structure Definition H(α-helix), H G(310-helix) I(π-helix), E(β-strand), E B(isolated β-bridge), T(turn), C S(bend) -(rest). Reduction 1 Reduction 2
Reliability Index(of single classifications) • Using a discretized function of the sample distance to the separating hyperplane • Compares it with accuracy measure to prove the point RI increases monotonically with increasing accuracy
Accuracy Measures Estimate of Estimate of Y is the class/state variable and Ŷ is the output of classifier
Segment Overlap Measure • Segment Overlap Measure: A more realistic way of measuring accuracy compared to per-residue accuracy • In terms of relevancy to 3D structure • Problems with per-residue accuracy measurement • Proteins with same 3D folding differ by 12% in Secondary Structure • This means maximum performance of Q3 should =88% • End of segments might vary for proteins with same 3D structure. (so their classification is less relevant to determining protein structure) • Account for the following • Type and position of secondary structure segments rather than per-residue • Variation of the residues at the end of the segments
Accuracy Measures Correlation Measure where pα:# of true positives nα: :# of true positives oα: :# of false positives uα: :# of false negatives α stands for H,C or E
Tertiary Classifier Design E/~E H/~H C/~C Choose one with maximum margin: SVM_MAX_D
Tertiary Classifier Design E/~E C/~C H/C E/H E/C H/~H Vote Neural Network SVM_VOTE SVM_NN Combine all of tertiary classifiers designed so far to form a jury SVM_JURY
Results,Details – Support Vectors In typical pattern recognition problems # SVs/Training Samples is much lower. Interpretation: Remember the error bound found by LOOCV was EV(# SVs)/ Training samples. This means the error bound will be very high compared to typical pattern recognition problems.
Conclusions • Application of SVM method on prediction with good results(improvement over PHD). • Importance of accuracy measure used to measure performance of algorithms and class reduction method • Use of machine learning method in solving protein structure problem at lower complexity • The general protein folding problem in three dimensions requires too much computational power, so this is a cheap alternative.
Critique of article • Number of SVs too large. This puts the performance of SVM into question. • The article doesn’t put the machine learning aspects into clear form, too focused on results