Appendix: Automated Methods for Structure Comparison

Appendix: Automated Methods for Structure Comparison • Basic problem: how are any two given structures to be automatically compared in a meaningful way? • How are distant relationships to be recognized? programmethod DALI distance matrix comparison (basis for FSSP structural classification) SSAP dynamic programming (used in CATH to classify topologies) VAST convert secondary structures to vectors and align vectors

Structure comparison is pretty easy when two proteins are very similar • when two proteins are so similar that the sequences can be reliably aligned, say >35% identical, structure comparison can proceed from the seq. alignment: 1. Align the sequences sequence 1: YIREV-GKL sequence 2: YITQVRNKA 2. Superpose the structures to minimize the RMSD for equivalent residue pairs in the alignment note: these structures do not correspond to the sequences above

it is harder when the proteins are very different... • if one cannot align the sequence reliably, how does one establish which residues, if any, play equivalent structural roles in the two proteins? • the answer is to attempt to align the structures directly in such a way that structural equivalencies in the two proteins are revealed • we will discuss how the distance-matrix based algorithm of DALI solves this problem

Distance Matrices • 2D representation of 3D structure • plot sequence against itself • identify pairs of residues which are close in space to each other • usually distance between C-alpha carbons is used • identify closeness between residues as dark parts of the matrix

Distance matrices

Different substructures, such as secondary or supersecondary structures, give rise to distinct patterns in the matrix e.g. antiparallel vs. parallel beta-sheets in principle, one could recognize structural similarity in two proteins by comparing patterns in distance matrices, but it’s not that simple

Problem: two structures with the same topology may differ in the precise location of secondary structure elements along the sequence, i.e. loop lengths may differ same fold, different matrices

Or two common architectures may differ in connectivity (topology)... both three-stranded antiparallel beta-sheets how might we compare their distance matrices to reveal this similarity?

DALI algorithm • not useful to compare entire matrices • instead, chop distance matrices into all possible submatrices of 6x6 amino acids • compare this set of submatrices for pattern similarities rather than comparing entire matrix

1. identify a pair of matching submatrices within the two matrices make an initial sequence alignment from this match...

2. Identify a second pair which overlaps the first(contains one common structural element)

3. Combine overlapping pairs overall alignment of structurally equivalent sequence regions

4. Rearrange and “collapse” the matrixaccording to the aligned regions of the sequence now the common structural elements are aligned as are the structurally equivalent residues in the sequence!

All together now...

The Power of DALI • DALI is quite powerful because it can recognize architectural similarities even when topologies are different. • It is also flexible because it can be made more topologically restrictive (i.e. no swapping of segments in chain allowed) to focus on closer relationships

FSSP uses DALI alignments to classify structures 8320 947 1484 540 eliminate similar sequences divide into domains all PDB entries representative set of structures representative set of domains group domains into fold types (clusters of similar structures) and make set of representatives of each fold align domains with DALI!

Judging DALI alignments • Z-score: how much better than average is the alignment, i.e. how many standard deviations from the mean of a distribution of alignments of random pairs of proteins. >16 very close, 8-16 pretty close, <8 not so close. • RMSD: root mean square deviation of alpha carbons for the matching portion of the structures. • LALI: length of alignment (recognizably matching portion of the structures) • LSEQ2: total length of the sequence being matched. • %IDE: % sequence identity between the two sequences

if you go into FSSP, and search for a particular structure, you’ll get an output of its best DALI alignments with other structures STRID2 Z RMSD LALI LSEQ2 %IDE PROTEIN 1plc 24.4 0.0 99 99 100 Plastocyanin (cu2+, ph 6.0) 2pcy 23.4 0.2 99 99 100 Apo-plastocyanin (pH 6.0) 1bqk 12.1 2.0 89 124 29 pseudoazurin 1aac 11.0 1.9 84 104 24 amicyanin 1ibzA 9.1 2.5 83 111 19 nitrosocyanin 1qhqA 8.3 2.4 87 139 29 auracyanin 1rcy 8.2 2.5 90 151 17 rusticyanin biological_unit 1qniA 7.7 2.2 78 572 19 nitrous-oxide reductase 1kcw 7.1 2.4 81 1017 17 ceruloplasmin biological_unit 2cuaA 7.0 2.2 80 122 15 cua fragment 1nwpA 6.7 3.1 85 128 24 azurin

Appendix: Automated Methods for Structure Comparison