340 likes | 1.31k Views
Protein Secondary Structure Prediction. ?. ?. TDVEAAVNSLVNLYLQASYLS. ?. Protein secondary structure prediction. Input: protein sequence Output: for each residue its associated Secondary structure (SS): alpha-helix, beta-strand, or loop. Servers for SS prediction.
E N D
Protein Secondary Structure Prediction ? ? TDVEAAVNSLVNLYLQASYLS ?
Protein secondary structure prediction • Input: protein sequence • Output: for each residue its associated Secondary structure (SS): alpha-helix, beta-strand, or loop.
Servers for SS prediction • AGADIR - An algorithm to predict the helical content of peptides • APSSP - Advanced Protein Secondary Structure Prediction Server • CFSSP - Chou & Fasman Secondary Structure Prediction Server • GOR - Garnier et al, 1996 • HNN - Hierarchical Neural Network method (Guermeur, 1997) • HTMSRAP - Helical TransMembrane Segment Rotational Angle Prediction • Jpred - A consensus method for protein secondary structure prediction at University of Dundee • JUFO - Protein secondary structure prediction from sequence (neural network) • NetSurfP - Protein Surface Accessibility and Secondary Structure Predictions • NetTurnP - Prediction of Beta-turn regions in protein sequences • nnPredict - University of California at San Francisco (UCSF) • Porter - University College Dublin • PredictProtein - PHDsec, PHDacc, PHDhtm, PHDtopology, PHDthreader, MaxHom, EvalSec from Columbia University • Prof - Cascaded Multiple Classifiers for Secondary Structure Prediction • PSA - BioMolecular Engineering Research Center (BMERC) / Boston • PSIpred - Various protein structure prediction methods at Bloomsbury Centre for Bioinformatics • SOPMA - Geourjon and Delage, 1995 • Scratch Protein Predictor • DLP-SVM - Domain linker prediction using SVM at Tokyo University of Agriculture and Technology A lot!
SS prediction Methods ~80% Other improvements Environment, solvent accessibility (ongoing) ~70% Machine learning techniques SVM, Neural network (2004/5) ~60% Conditional probabilities GOR method (1978) ~50% Most basic idea - probabilities Chou-Fasman method (1974)
Protein secondary structure prediction HHHLLLHHHEEE Query BLASTp SwissProt Machine Learning Approach psiBLAST, MaxHom Known structures MSA Query Subject Subject Subject Subject
Evaluating secondary structure prediction methods • Assume you have a new method for SS prediction. • Given the following sequence you get the result: GLGGYMLGSAMSRPMIHFGNDWEDRYYRENMYRYPNQVYYRPVDQYSNQNNFVHDCVNIT ---EEEEEEE---EEEE-------HHHHHHHH-----EEEE---------EEEEEEEEEE Coil: - , Beta strand: E , Alpha helix: H How can you assess how good your result is? Compare it to the TRUTH, assuming this structure exists. (what if it doesn’t?) Calculate the percentage of amino acids whose secondary structure class (helix, coil, or sheet) is correctly predicted.(Q3)
Evaluating secondary structure prediction methods Original sequence: GLGGYMLGSAMSRPMIHFGNDWEDRYYRENMYRYPNQVYYRPVDQYSNQNNFVHDCVNIT Prediction: ---EEEEEEE---EEEE-------HHHHHHHH-----EEEE---------EEEEEEEEEE Truth (from a PDB file): -----EE-------------HHHHHHHHHH--------EE--------HHHHHHH-----
Evaluating secondary structure prediction methods GLGGYMLGSAMSRPMIHFGNDWEDRYYRENMYRYPNQVYYRPVDQYSNQNNFVHDCVNIT ---EEEEEEE---EEEE-------HHHHHHHH-----EEEE---------EEEEEEEEEE -----EE-------------HHHHHHHHHH--------EE--------HHHHHHH----- YYYNNYYNNNYYYNNNNYYYNNNNYYYYYYNNYYYYYNYYNYYYYYYYNNNNNNNNNNNN • Overall, there are 61 AA. • Number of correctly predicted (Y) is 31. • So the Q3 score of this method would be: 50.81% What can be the problem with such calculation?
Evaluating secondary structure prediction methods • What can be the problem with such calculation? • Assume that alpha helix is the SS of 60% of the residues. • Then a constant prediction of alpha helices would yield a Q3 measurement of 60%. • This method rewards over prediction of more common secondary structure classes in the database.
Evaluating secondary structure prediction methods • There are other ways to measure correlation between the result and the ‘truth’. • Most of them rely on the ratio between • True positive (TP) = correctly identified • True negative (TN) = correctly rejected • False positive (FP) = incorrectly identified • False negative (FN) = incorrectly rejected
Evaluating secondary structure prediction methods • For instance, for the α-helix: • TP: number of α-helix residues that are correctly predicted. • TN: number of residues observed in β-strands and loops that are not predicted as α-helix. • FP: number of residues incorrectly predicted in α-helix conformation. • FN: number of residues observed in α-helices but predicted to be either in β-strands or loops.
Sensitivity and specificity • Sensitivity and specificity are statistical measures of the performance of a binary classification test. • Sensitivity measures the proportion of actual positives which are correctly identified as such (e.g. the percentage of sick people who are correctly identified as having the condition). • Specificity measures the proportion of negatives which are correctly identified (e.g. the percentage of healthy people who are correctly identified as not having the condition).
Sensitivity and specificity • Question: • If the predictor perfectly predicts the truth, what would be the sensitivity rate? The specificity rate? • Answer: • A perfect predictor would be described as ______% sensitivity (i.e. predict all people from the sick group as sick) and ______% specificity (i.e. not predict anyone from the healthy group as sick).
Sensitivity and specificity • For any test, there is usually a trade-off between the measures. • For example: in an airport security setting in which one is testing for potential threats to safety, scanners may be set to trigger on low-risk items like belt buckles and keys (low specificity), in order to reduce the risk of missing objects that do pose a threat to the aircraft and those aboard (high sensitivity).
Exercise Calculate the specificity and sensitivity of the alpha helix prediction in the following SS prediction: Original sequence: GLGGYMLGSAMSRPMIHFGNDWEDRYYRENMYRYPNQVYYRPVDQYSNQNNFVHDCVNIT Prediction: ---EEEEEEE---EEEE-------HHHHHHHH-----EEEE---------EEEEEEEEEE Truth (from a PDB file): -----EE-------------HHHHHHHHHH--------EE--------HHHHHHH-----
Answer Alpha helix: • TP = 6 • FP=2 • FN=4+7=11 • TN=61-(6+2+11)=42 ---EEEEEEE---EEEE-------HHHHHHHH-----EEEE---------EEEEEEEEEE -----EE-------------HHHHHHHHHH--------EE--------HHHHHHH----- TP - Alpha helices Correctly identified FP - Alpha helices Incorrectly identified FN - Alpha helices incorrectly rejected
MSA Final SS prediction Buried/exposed prediction Reliability score
Jpred 3 – SS prediction server Original sequence: GLGGYMLGSAMSRPMIHFGNDWEDRYYRENMYRYPNQVYYRPVDQYSNQNNFVHDCVNIT Jpred Prediction + reliability: -----HHHH------------HHHHHHHHHHH-------------------EEE------ 997500000026777567776017899988721577400467777777773000000699 Truth (from a PDB file): -----EE-------------HHHHHHHHHH--------EE--------HHHHHHH-----