460 likes | 635 Views
Tutorial 9. Secondary Structures. Agenda. Evaluating secondary structure prediction tools (and bioinformatics tools in general) Jpred - SS prediction server Pfam - protein families DB Uni-Prot - a central repository of proteins Cool story of the day: How can you help cure diseases
E N D
Tutorial 9 Secondary Structures
Agenda • Evaluating secondary structure prediction tools (and bioinformatics tools in general) • Jpred- SS prediction server • Pfam- protein families DB • Uni-Prot - a central repository of proteins • Cool story of the day: How can you help cure diseases by playing a game? (Fold-It)
Protein Secondary Structure Prediction ? ? TDVEAAVNSLVNLYLQASYLS ?
Protein secondary structure prediction • Input: protein sequence • Output: for each residue its associated Secondary structure (SS): alpha-helix, beta-strand, or loop.
Evaluating secondary structure prediction methods • Assume you have a new method for SS prediction. • Given the following sequence you get the result: GLGGYMLGSAMSRPMIHFGNDWEDRYYRENMYRYPNQVYYRPVDQYSNQNNFVHDCVNIT ---EEEEEEE---EEEE-------HHHHHHHH-----EEEE---------EEEEEEEEEE Coil: - , Beta strand: E , Alpha helix: H How can you assess how good your result is? Compare it to the TRUTH, assuming this structure exists. (what if it doesn’t?) Calculate the percentage of amino acids whose secondary structure class (helix, coil, or sheet) is correctly predicted.
Evaluating secondary structure prediction methods Original sequence: GLGGYMLGSAMSRPMIHFGNDWEDRYYRENMYRYPNQVYYRPVDQYSNQNNFVHDCVNIT Prediction: ---EEEEEEE---EEEE-------HHHHHHHH-----EEEE---------EEEEEEEEEE Truth (from a PDB file): -----EE-------------HHHHHHHHHH--------EE--------HHHHHHH-----
Evaluating secondary structure prediction methods GLGGYMLGSAMSRPMIHFGNDWEDRYYRENMYRYPNQVYYRPVDQYSNQNNFVHDCVNIT ---EEEEEEE---EEEE-------HHHHHHHH-----EEEE---------EEEEEEEEEE -----EE-------------HHHHHHHHHH--------EE--------HHHHHHH----- YYYNNYYNNNYYYNNNNYYYNNNNYYYYYYNNYYYYYNYYNYYYYYYYNNNNNNNNNNNN • Overall, there are 61 AA. • Number of correctly predicted (Y) is 31. • So the score of this method would be: 50.81% What can be the problem with such calculation?
Evaluating secondary structure prediction methods • What can be the problem with such calculation? • Assume that alpha helix is the SS of 60% of the residues. • Then a constant prediction of alpha helices would yield a score of 60%. • This method rewards over prediction of more common secondary structure classes in the database.
Evaluating secondary structure prediction methods There are other ways to measure correlation between the result and the ‘truth’. Most of them rely on the ratio between: True positive (TP) = correctly identified True negative (TN) = correctly rejected False positive (FP) = incorrectly identified False negative (FN) = incorrectly rejected
Evaluating secondary structure prediction methods • For instance, for the α-helix: • TP: number of α-helix residues that are correctly predicted. • TN: number of residues observed in β-strands and loops that are not predicted as α-helix. • FP: number of residues incorrectly predicted in α-helix conformation. • FN: number of residues observed in α-helices but predicted to be either in β-strands or loops.
Sensitivity and specificity • Sensitivity and specificity are statistical measures of the performance of a classification test. • Sensitivity measures the proportion of actual positives which are correctly identified as such (e.g. the percentage of sick people who are correctly identified as having the condition). • Specificity measures the proportion of negatives which are correctly identified (e.g. the percentage of healthy people who are correctly identified as not having the condition).
Sensitivity and specificity • Question: • If the predictor perfectly predicts the truth, what would be the sensitivity rate? The specificity rate? • Answer: • A perfect predictor would be described as ______% sensitivity (i.e. predict all people from the sick group as sick) and ______% specificity (i.e. not predict anyone from the healthy group as sick).
Sensitivity and specificity • For any classification test, there is usually a trade-off between the measures. • Example: in an airport security setting in which one is testing for potential threats to safety, scanners may be set to trigger on low-risk items like belt buckles and keys (low specificity), in order to reduce the risk of missing objects that do pose a threat to the aircraft and those aboard (high sensitivity).
Bioinformatics examples of sensitivity vs. specificity • Mapping sequenced reads to a reference genome: when increasing allowed number of mismatches per read Sensitivity Specificity • BLAST: when lowering a P-value threshold Sensitivity Specificity • K-means clustering: when increasing the number of neighbors K Sensitivity Specificity
Exercise Calculate the specificity and sensitivity of the alpha helix prediction in the following SS prediction: Original sequence: GLGGYMLGSAMSRPMIHFGNDWEDRYYRENMYRYPNQVYYRPVDQYSNQNNFVHDCVNIT Prediction: ---EEEEEEE---EEEE-------HHHHHHHH-----EEEE---------EEEEEEEEEE Truth (from a PDB file): -----EE-------------HHHHHHHHHH--------EE--------HHHHHHH-----
Answer Pred. ---EEEEEEE---EEEE-------HHHHHHHH-----EEEE---------EEEEEEEEEE -----EE-------------HHHHHHHHHH--------EE--------HHHHHHH----- Alpha helix: • TP = 6 • FP=2 • FN=4+7=11 • TN=61-(6+2+11)=42 Truth TP - Alpha helices Correctly identified FP - Alpha helices Incorrectly identified FN - Alpha helices incorrectly rejected
MSA Final SS prediction Buried/exposed prediction Reliability score
Jpred 3 – SS prediction server Original sequence: GLGGYMLGSAMSRPMIHFGNDWEDRYYRENMYRYPNQVYYRPVDQYSNQNNFVHDCVNIT Jpred Prediction + reliability: -----HHHH------------HHHHHHHHHHH-------------------EEE------ 997500000026777567776017899988721577400467777777773000000699 Truth (from a PDB file): -----EE-------------HHHHHHHHHH--------EE--------HHHHHHH-----
Pfam http://pfam.sanger.ac.uk/ Proteins are generally composed of one or more functional regions, commonly termed domains. Different combinations of domains give rise to the diverse range of proteins found in nature. The Pfam database is a large collection of protein families, each represented by multiple sequence alignments and hidden Markov models (HMMs).
Glossary Domain A structural unit which can be found in multiple protein contexts. Domains are long motifs (30-100 aa). Family A collection of related proteins
What kind of domains can we find in Pfam? Trusted Domains Repeats Fragment Domains Nested Domains Disulfide bonds Important residues (e.g active sites) Trans membrane domains
Domains Domain range and score
Description Structure info Gene Ontology Links
UniProt http://www.uniprot.org/ • The Universal Protein Resource (UniProt) is a central repository of protein sequence, function, classification and cross reference. • It was created by joining the information contained in swiss-Prot and TrEMBL.
Protein search Uniprot input Reviewed protein
Sequence download Uniprot output Accession number Protein status organism length
Information for one protein General information annotations
General keywords GO annotation (MF, BP, CC)
Alternative splicing isoforms Features in the sequence
Sequences References
Cool Story of the day How can you help cure diseases by playing a game?
Foldit is an online game in which humans try to solve one of the hardest computational problems in biology: protein folding. You don't need to know anything about biology to play the game, although a little background will help. http://fold.it/portal/
Even small proteins have on the order of 1000 degrees of freedom. Human reasoning may optimize the complex algorithms.