Master Course Sequence Alignment Lecture 9 Database searching (3)

C E N T E R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U Master Course Sequence Alignment Lecture 9Database searching (3)

Dot-plotsa simple way to visualise sequence similarity Filter: 6/10 residues have to match... Can be a bit messy, though...

Dot-plots, what about... • Insertions/deletions -- DNA and proteins • Duplications (e.g. tandem repeats) – DNA and proteins • Inversions -- DNA Dot plots are calculated using a diagonal window of preset length that is slid through the search matrix --typically the central cell holds the window score (e.g. sum, average)

Direct repeat Tandem repeat Inverted repeat Dot-plots, self-comparison

charge

(cysteine bridge)

Protein structure hierarchical levels PRIMARY STRUCTURE (amino acid sequence) VHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKVKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKYH QUATERNARY STRUCTURE (oligomers) SECONDARY STRUCTURE (helices, strands) TERTIARY STRUCTURE (fold)

Globin fold  protein myoglobin PDB: 1MBN Helices are labelled ‘A’ (blue) to ‘H’ (red). D helix can be missing in some globins: what happens with the alignment?

 sandwich  protein immunoglobulin PDB: 7FAB

TIM barrel  /  protein Triose phosphate IsoMerase PDB: 1TIM

Pyruvate kinase Phosphotransferase b barrel regulatory domain a/b barrel catalytic substrate binding domain a/b nucleotide binding domain

What does this mean for alignment? • Alignments need to be able to skip secondary structural elements to complete domains (i.e. putting gaps opposite these motifs in the shorter sequence). • Depending on gap penalties chosen, the algorithm might have difficulty with making such long gaps (for example when using high affine gap penalties), resulting in incorrect alignment.

What does this mean for homology searching? • Database searching algorithms just need to decide if the alignment score is good enough for inferring homology • Sometimes, alignments can be incorrect but the score can be close enough for the database searching method to correctly identify the DB sequence as a homolog (or not) • However, for distant hits alignments become crucial

Sequence Analysis/Database Searching Finding relationships between genes and gene products of different species, including those at large evolutionary distances

Compared to the preceding plot, RMSD is better able to pin-point relationships between more divergent sequences (RMSD stays relatively small for a longer time as compared to PAM distance) – Structure more conserved than sequence. Note that the spread around RMSD is larger

Structural superpositioning RMSD: how far are equivalenced Cα atoms separated on average?

Two superposed protein structures with two well-superposed helices Red: well superposed Blue: low match quality C5 anaphylatoxin -- human (PDB code 1kjs) and pig (1c5a)) proteins are superposed

How to assess homology search methods • We need an annotated database, so we know which sequences belong to what homologous (super)families • Examples of databases of homologous families are PFAM, Homstrad or Astral • The idea is to take a protein sequence from a given homologous family, then run the search method, and then assess how well the method has carried out the search • This should be repeated for many query sequences and then the overall performance can be measured

C; family: zinc finger -- CCHH-type C; class: small C; reordered by kitschorder 1.0a C; reordered by kitschorder 1.0a C; last update 7/9/98 >P1;1zaa1 structureX:1zaa: 3 :C: 33 :C:zinc-finger (ZIF268, domain 1):Mus musculus:2.10:18.20 ------RPYACPVESCDRRFSRSDELTRHI-RI-HTGQK* >P1;1zaa2 structureX:1zaa: 34 :C: 61 :C:zinc-finger (ZIF268, domain 2):Mus musculus:2.10:18.20 -------PFQCRI--CMRNFSRSDHLTTHI-RT-HTGEK* >P1;1zaa3 structureX:1zaa: 62 :C: 87 :C:zinc-finger (ZIF268, domain 3):Mus musculus:2.10:18.20 -------PFACDI--CGRKFARSDERKRHT-KI-HLR--* >P1;1ard structureN:1ard: 102 : : 130 : :zinc-finger (transcription factor ADR1):Saccharomyces cerevisiae:-1.00:-1.00 ------RSFVCEV--CTRAFARQEHLKRHY-RS-HTNEK* >P1;1znf structureN:1znf: 1 : : 25 : :zinc-finger (XFIN, 31st domain):Xenopus laevis:-1.00:-1.00 --------YKCGL--CERSFVEKSALSRHQ-RV-HKN--* >P1;2drp2 structureX:2drp: 137 :A: 165:A:zinc-finger (tramtrack, domain 2):Drosophila melanogaster:2.80:19.30 ----NVKVYPCPF--CFKEFTRKDNMTAHV-KIIHK---* >P1;3znf structureN:3znf: 1 : : 30 : :zinc-finger (enhancer binding protein):Homo sapiens:-1.00:-1.00 ------RPYHCSY--CNFSFKTKGNLTKHMKSKAHSKK-* >P1;5znf structureN:5znf: 1 : : 30 : :zinc-finger (ZFY-6T):Homo sapiens:-1.00:-1.00 ------KTYQCQY--CEYRSADSSNLKTHIKTK-HSKEK* Example You can also look at superposed structures..

Sequence searching QUERY DATABASE True Positive True Positive True Negative POSITIVES False Positive T NEGATIVES False Negative True Negative

So what have we got Observed P N TP P FP Predicted N TN FN

Sensitivity and Specificity – medical world

Receiver Operator Curve (ROC) • Plot Sensitivity (TP/(TP+FN)) against 1-Specificity (1 - TN/(FP+TN)), where the latter is called error Sensitivity is also called Coverage Sensitivity Error = 1 - specificity

Database Search Algorithms:Sensitivity, Selectivity • Sensitivity – the ability to detect weak similarities between sequences (often due to long evolutionary separation). Increasing sensitivity reduces false negatives, i.e. those database sequences similar to thequery, but rejected. Sensitivity (or Coverage) = TP / (TP+FN) • Selectivity – the ability to screen out similarities due to chance. Increasing selectivity reduces false positives, those sequences recognized as similar when they are not. Selectivity (or Positive Prediction Value) = TP / (TP + FP) • Specificity also describes the ability of the method to select proper hits Specificity = TN / (TN + FP) Sensitivity Selectivity, Specificity Courtesy of Gary Benson (ISSCB 2003)

COG – Cluster of Orthologous Groups • Orthologues found using bi-directional best hit searching with PSI-BLAST • All COG family members are supposed to have the same function • Searching with an unknown sequence only needs to hit a single member of a COG family, annotation can then be transferred COG2813 http://www.ncbi.nlm.nih.gov/COG/

Structure-based function prediction • SCOP (http://scop.berkeley.edu/) is a protein structure classification database where proteins are grouped into a hierarchy of families, superfamilies, folds and classes, based on their structural and functional similarities

Structure-based function prediction • SCOP hierarchy – the top level: 11 classes

Structure-based function prediction All-alpha protein membrane protein Alpha-beta protein Coiled-coil protein All-beta protein

Structure-based function prediction • SCOP hierarchy – the second level: 800 folds

Structure-based function prediction • SCOP hierarchy - third level: 1294 superfamilies

Structure-based function prediction • SCOP hierarchy - third level: 2327 families

Structure-based function prediction • Using sequence-structure alignment method, one can predict a protein belongs to a • SCOP family, superfamily or fold • Proteins predicted to be in the same SCOP family are orthologous • Proteins predicted to be in the same SCOP superfamily are homologous • Proteins predicted to be in the same SCOP fold are structurally analogous folds superfamilies families

Profile wander

Master Course Sequence Alignment Lecture 9 Database searching (3)

Master Course Sequence Alignment Lecture 9 Database searching (3)

Presentation Transcript

Lecture 4 Sequence alignment and searching

Techniques for Protein Sequence Alignment and Database Searching

Sequence Analysis, Pair Wise Alignment, and Database Searching

Sequence Alignment

Sequence Alignment and Database Searching

Sequence-based database searching Unit 9

Sequence Alignment and Approaches to Database Searching

Sequence Database Searching

Master Course Sequence Alignment Lecture 9 Motif searches

Introduction to bioinformatics Lecture 9 Multiple sequence alignment (3)

Sequence Alignment vs. Database

Sequence Alignment and Database Searching

Master Course Sequence Alignment Lecture 9b Pattern matching part II

Previous Lecture: Sequence Database Searching

Sequence Alignment

Lecture 3. Heuristic Sequence Alignment

Pairwise Sequence Alignment and Database Searching

Techniques for Protein Sequence Alignment and Database Searching

Lecture 4 Sequence alignment and searching

Master Course Sequence Alignment Lecture 13 Evolution/Phylogeny

Master Course Sequence Alignment Lecture 11 Sequence Motif Searches