750 likes | 933 Views
C. E. N. T. E. R. F. O. R. I. N. T. E. G. R. A. T. I. V. E. B. I. O. I. N. F. O. R. M. A. T. I. C. S. V. U. Master Course Sequence Alignment Lecture 9 Database searching (3). Dot-plots a simple way to visualise sequence similarity. Filter:
E N D
C E N T E R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U Master Course Sequence Alignment Lecture 9Database searching (3)
Dot-plotsa simple way to visualise sequence similarity Filter: 6/10 residues have to match... Can be a bit messy, though...
Dot-plots, what about... • Insertions/deletions -- DNA and proteins • Duplications (e.g. tandem repeats) – DNA and proteins • Inversions -- DNA Dot plots are calculated using a diagonal window of preset length that is slid through the search matrix --typically the central cell holds the window score (e.g. sum, average)
Direct repeat Tandem repeat Inverted repeat Dot-plots, self-comparison
Protein structure hierarchical levels PRIMARY STRUCTURE (amino acid sequence) VHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKVKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKYH QUATERNARY STRUCTURE (oligomers) SECONDARY STRUCTURE (helices, strands) TERTIARY STRUCTURE (fold)
Globin fold protein myoglobin PDB: 1MBN Helices are labelled ‘A’ (blue) to ‘H’ (red). D helix can be missing in some globins: what happens with the alignment?
sandwich protein immunoglobulin PDB: 7FAB
TIM barrel / protein Triose phosphate IsoMerase PDB: 1TIM
Pyruvate kinase Phosphotransferase b barrel regulatory domain a/b barrel catalytic substrate binding domain a/b nucleotide binding domain
What does this mean for alignment? • Alignments need to be able to skip secondary structural elements to complete domains (i.e. putting gaps opposite these motifs in the shorter sequence). • Depending on gap penalties chosen, the algorithm might have difficulty with making such long gaps (for example when using high affine gap penalties), resulting in incorrect alignment.
What does this mean for homology searching? • Database searching algorithms just need to decide if the alignment score is good enough for inferring homology • Sometimes, alignments can be incorrect but the score can be close enough for the database searching method to correctly identify the DB sequence as a homolog (or not) • However, for distant hits alignments become crucial
Sequence Analysis/Database Searching Finding relationships between genes and gene products of different species, including those at large evolutionary distances
Compared to the preceding plot, RMSD is better able to pin-point relationships between more divergent sequences (RMSD stays relatively small for a longer time as compared to PAM distance) – Structure more conserved than sequence. Note that the spread around RMSD is larger
Structural superpositioning RMSD: how far are equivalenced Cα atoms separated on average?
Two superposed protein structures with two well-superposed helices Red: well superposed Blue: low match quality C5 anaphylatoxin -- human (PDB code 1kjs) and pig (1c5a)) proteins are superposed
How to assess homology search methods • We need an annotated database, so we know which sequences belong to what homologous (super)families • Examples of databases of homologous families are PFAM, Homstrad or Astral • The idea is to take a protein sequence from a given homologous family, then run the search method, and then assess how well the method has carried out the search • This should be repeated for many query sequences and then the overall performance can be measured
C; family: zinc finger -- CCHH-type C; class: small C; reordered by kitschorder 1.0a C; reordered by kitschorder 1.0a C; last update 7/9/98 >P1;1zaa1 structureX:1zaa: 3 :C: 33 :C:zinc-finger (ZIF268, domain 1):Mus musculus:2.10:18.20 ------RPYACPVESCDRRFSRSDELTRHI-RI-HTGQK* >P1;1zaa2 structureX:1zaa: 34 :C: 61 :C:zinc-finger (ZIF268, domain 2):Mus musculus:2.10:18.20 -------PFQCRI--CMRNFSRSDHLTTHI-RT-HTGEK* >P1;1zaa3 structureX:1zaa: 62 :C: 87 :C:zinc-finger (ZIF268, domain 3):Mus musculus:2.10:18.20 -------PFACDI--CGRKFARSDERKRHT-KI-HLR--* >P1;1ard structureN:1ard: 102 : : 130 : :zinc-finger (transcription factor ADR1):Saccharomyces cerevisiae:-1.00:-1.00 ------RSFVCEV--CTRAFARQEHLKRHY-RS-HTNEK* >P1;1znf structureN:1znf: 1 : : 25 : :zinc-finger (XFIN, 31st domain):Xenopus laevis:-1.00:-1.00 --------YKCGL--CERSFVEKSALSRHQ-RV-HKN--* >P1;2drp2 structureX:2drp: 137 :A: 165:A:zinc-finger (tramtrack, domain 2):Drosophila melanogaster:2.80:19.30 ----NVKVYPCPF--CFKEFTRKDNMTAHV-KIIHK---* >P1;3znf structureN:3znf: 1 : : 30 : :zinc-finger (enhancer binding protein):Homo sapiens:-1.00:-1.00 ------RPYHCSY--CNFSFKTKGNLTKHMKSKAHSKK-* >P1;5znf structureN:5znf: 1 : : 30 : :zinc-finger (ZFY-6T):Homo sapiens:-1.00:-1.00 ------KTYQCQY--CEYRSADSSNLKTHIKTK-HSKEK* Example You can also look at superposed structures..
Sequence searching QUERY DATABASE True Positive True Positive True Negative POSITIVES False Positive T NEGATIVES False Negative True Negative
So what have we got Observed P N TP P FP Predicted N TN FN
Receiver Operator Curve (ROC) • Plot Sensitivity (TP/(TP+FN)) against 1-Specificity (1 - TN/(FP+TN)), where the latter is called error Sensitivity is also called Coverage Sensitivity Error = 1 - specificity
Database Search Algorithms:Sensitivity, Selectivity • Sensitivity – the ability to detect weak similarities between sequences (often due to long evolutionary separation). Increasing sensitivity reduces false negatives, i.e. those database sequences similar to thequery, but rejected. Sensitivity (or Coverage) = TP / (TP+FN) • Selectivity – the ability to screen out similarities due to chance. Increasing selectivity reduces false positives, those sequences recognized as similar when they are not. Selectivity (or Positive Prediction Value) = TP / (TP + FP) • Specificity also describes the ability of the method to select proper hits Specificity = TN / (TN + FP) Sensitivity Selectivity, Specificity Courtesy of Gary Benson (ISSCB 2003)
COG – Cluster of Orthologous Groups • Orthologues found using bi-directional best hit searching with PSI-BLAST • All COG family members are supposed to have the same function • Searching with an unknown sequence only needs to hit a single member of a COG family, annotation can then be transferred COG2813 http://www.ncbi.nlm.nih.gov/COG/
Structure-based function prediction • SCOP (http://scop.berkeley.edu/) is a protein structure classification database where proteins are grouped into a hierarchy of families, superfamilies, folds and classes, based on their structural and functional similarities
Structure-based function prediction • SCOP hierarchy – the top level: 11 classes
Structure-based function prediction All-alpha protein membrane protein Alpha-beta protein Coiled-coil protein All-beta protein
Structure-based function prediction • SCOP hierarchy – the second level: 800 folds
Structure-based function prediction • SCOP hierarchy - third level: 1294 superfamilies
Structure-based function prediction • SCOP hierarchy - third level: 2327 families
Structure-based function prediction • Using sequence-structure alignment method, one can predict a protein belongs to a • SCOP family, superfamily or fold • Proteins predicted to be in the same SCOP family are orthologous • Proteins predicted to be in the same SCOP superfamily are homologous • Proteins predicted to be in the same SCOP fold are structurally analogous folds superfamilies families