Centre for Integrative Bioinformatics VU (IBIVU)

Bioinformatics master courseDNA/Protein structure-function analysis and predictionLecture 5: Protein Fold Families • Centre for Integrative Bioinformatics VU (IBIVU) • Faculty of Sciences / Faculty of Earth & Life Sciences

Protein structure evolution Insertion/deletion ofsecondary structural elementscan ‘easily’ be done at loop sites

N C Protein structure evolution Insertion/deletion of structural domainscan ‘easily’ be done at loop sites

Fold classification • four broad structural protein fold classes: • all-α • all-β • α/β (α mixed with β), • α+β (separated α and β regions)

The first protein structure in 1960:Myoglobin -  fold

There are a number of examples of small proteins (or peptides) which consist of little more than a single helix. A striking example is alamethicin, a transmembrane voltage gated ion channel, acting as a peptide antibiotic.

Coiled-coil domains This long protein is involved In muscle contraction Tropomyosin

Alpha-helix interaction Two helix interface areas should have complementary surfaces. a-helix surface can be thought of as consisting of grooves and ridges, like a screw thread: for instance, the side chains of every 4th residue form a “i+4” ridge (because there are 3.6 residues per turn). The direction of this ridge is 26° from the direction of the helix axis. Therefore if 2 helices pack such that such a ridge from each fits into the other's groove, the expected angle between the two is 52°. In fact, in the observed distribution of this angle between packed alpha-helices, there is a sharp peak at 50°. Ridges can also be formed by other stacking patterns of residues, such as every 3rd residue, or indeed every residue. The "i+4" ridge is believed to be the most common because residues at every 4th position have side-chains which are more closely aligned than in "i+3" or "i+1" ridges as indicated below. http://swissmodel.expasy.org/course/text/chapter4.htm

Helix-turn-helix and 4-helix bundles Here is a diagram of Interleukin-2, human Growth Hormone, Granulocyte-macrophage colony-stimulating factor (GM-CSF) and Interleukin-4.

Beta-proteins

Beta-sheet structures porin

Greek key -strand motif

Greek key -strand motif Structure: gamma-crystallin

/ fold Flavodoxin fold 5() fold

4 3 2 5 4 3 1 2 5 1 / fold Flavodoxin family - TOPS diagrams (Flores et al., 1994)

Beta-alpha-beta structures

Alpha-beta barrel

Plait motif

 3-layer motifs (2 layers of helices with a -sheet in between) are often specified as x-y-z (e.g. 4-14-5), where x is number of helices in the first helical layer, y is number of strands in the -sheet, and y is number of helices in the second helical layer

For  proteins, there are no good classification systems. You can only count…

How many folds – Chothia 1992 The first estimate of the number of protein families has been explicitly done by Chothia in 1992. At that time about 120 structural families were known. Chothia summarized the results of several genome projects and revealed that the chances of a random protein to belong to one of the known sequence families is approximately 1/3. According to the results of sequence comparison of the PDB with sequence databases (Sander, Schneider 1991), about 1/4 of all sequences appeared to be similar to one of the PDB entries at 25% identity level. Assuming equal distribution of proteins among the families, Chothia concluded that the total number of protein structural families should be equal to 120*3*4 = 1440.

How many folds – Alexandrov & Go, 1994, updated Pfam-2.1 database consists of 101,724 domains of proteins from SwissProt (Bairoch & R., 1996) release 34, clustered in 13,816 families. There were also 7,694 proteins of 30 or more amino acids in SwissProt-34, which are not present in Pfam and are not similar to other proteins. We have added them into the database, which now contains 109,418 domains in 21,510 families. We have eliminated very similar sequences from the database, trying to make the database more homogeneous. In the final classification there were 60,601 domains, distributed within 21,510 families. All families were ranked by the number of sequences in each family. The resulting distribution fits nicely to the Zipf’s law (http://wwww.bionet.nsc.ru/bgrs/thesis/100/)

How many folds r is the rank of family, n(r) is the number of proteins in the r-th family, a is a scaling constant, depending on the number of proteins in the dataset, and b 0.64. Constant b does not depend on the size of the dataset. n(r) = ar-b

How many folds (cont.) Distribution of protein sequences among protein families. One can see that the distribution is essentially non-equal. The shape of the distribution is described very well by Zipf’s law: n(r) = ar-b, with a= 640 and b=0.64. The correlation coefficient of this approximation equals to 0.992.

Fold number according to Alexandrov & Go 60,000 protein sequence families in 14,000 different folds

Fold number according to Alexandrov & Go An important feature of Zipf’s distribution is that it has a very long tail of clusters with only few members in it. For example, if b=0.7, half of all proteins is located in 10% of all clusters.

General fold classification systems The definitions of four broad structural classes, all-α, all-β, α/β, and α+β, based on secondary structure compositions and β-sheet topologies [Levitt & Chothia, 1976] represented the first step towards a global characterization of the protein fold space. These definitions have been generally accepted and are being used by many classification systems to organize the fold hierarchy [Murzin et al., 1995; Orengo et al., 1997]. However, there is a need for methods to represent the full range of structural relationships among folds for a better understanding of the organizing principles and features of the protein fold space.

General fold classification systems(cont.) The fold family trees such as those built by Effimov [1997], Zhang and Kim [2000] and Taylor [2002] are very informative, but the construction of such trees involves extensive manual operations and, sometimes, considerable human judgment. An alternative approach is to apply a uniform measure of the structural similarity across all fold types and map the structural relationships into a low dimensional space. Two such maps have been introduced, one is represented in the CATH database by Orengo and colleages [1997] and the other in the DALI database by Holm and Sander [1993]. Although the two maps are based on different structural alignment algorithms and multivariant analysis methods, they give similar two-dimensional projections featuring three large clusters corresponding to α, β, and α/β folds, respectively.

General fold classification system references Levitt, M. and C. Chothia, Structural patterns in globular proteins. Nature, 1976. 261(5561): p. 552-8. Murzin, A.G., et al., SCOP: a structural classification of proteins database for the investigation of sequences and structures. J Mol Biol, 1995. 247(4): p. 536-40. Orengo, C.A., et al., CATH--a hierarchic classification of protein domain structures. Structure, 1997. 5(8): p. 1093-108. Taylor, W.R., A 'periodic table' for protein structures. Nature, 2002. 416(6881): p. 657-60. Orengo, C.A., et al., Identification and classification of protein fold families. Protein Eng, 1993. 6(5): p. 485-500.

General fold classification system references (cont.) Efimov, A.V., Structural trees for protein superfamilies. Proteins, 1997. 28(2): p. 241-60. Zhang, C. and S.H. Kim, A comprehensive analysis of the Greek key motifs in protein beta-barrels and betasandwiches. Proteins, 2000. 40(3): p. 409-19. Holm, L. and C. Sander, Protein structure comparison by alignment of distance matrices. J Mol Biol, 1993. 233(1): p. 123-38.

Fold distribution Metric matrix distance geometry method applied to all pair-wise “distances” (structural dissimilarities) to assign three-dimensional coordinates to a set of 498 SCOP folds such that the relative distance between two folds is inversely correlated with the DALI alignment score. The results of the mapping are shown in the figure on the left.

The first 20 eigen values of the metric matrix calculated from the 498x498 DALI structural alignment scores.

Plotting the first 3eigenvectors; i.e., the eigenvectors corresponding to the three largest eigenvalues. Again, notice the segregation of the four main structural classes..

The same as the preceding slide, but from another angle…

Comparing fold usage between two species in the eubacterial domain (Chlamydia versus Aquifex, A) and between those of two different domains (Chlamydia of bacteria versus Halobacterium of archaea, B). The usages of the 498 folds by the second organism are subtracted from the fold usages by the first organism. A contour surface (mesh) is then constructed and set at the values of 0.4% for blue and –0.4% for red. Regions within the blue contour include folds that appear more frequently in the first organism, whereas regions within the red contour include folds that occur more frequently in the second organism.

CATH database Classification Architecture Topology Homologous family

CATH database

Structural Classification of proteins (SCOP) database • All alpha proteins • All beta proteins • Alpha and beta proteins (a/b) - Mainly parallel beta sheets (beta-alpha-beta units) • Alpha and beta proteins (a+b) - Mainly antiparallel beta sheets (segregated alpha and beta regions) • Multi-domain proteins (alpha and beta) - Folds consisting of two or more domains belonging to different classes • Membrane and cell surface proteins and peptides – No proteins in the immune system

Structural Classification of proteins (SCOP) database (cont.) • Small proteins - Usually dominated by metal ligand, heme, and/or disulfide bridges • Coiled coil proteins - Not a true class • Low resolution structures - Not a true class • Peptides - Peptides and fragments. Not a true class • Designed proteins - Experimental structures of proteins with essentially non-natural sequences. Not a true class

SCOP • Gold standard of protein classification • In essence, the work of a single man (Alexei Murzin) • The classification has been constructed manually by visual inspection and comparison of structures, but with the assistance of tools to make the task manageable and help provide generality.

SCOP The different major levels in the hierarchy are: • Family: Clear evolutionarily relationshipProteins clustered together into families are clearly evolutionarily related. Generally, this means that pairwise residue identities between the proteins are 30% and greater. However, in some cases similar functions and structures provide definitive evidence of common descent in the absense of high sequence identity; for example, many globins form a family though some members have sequence identities of only 15%.

SCOP The different major levels in the hierarchy are: • Superfamily: Probable common evolutionary originProteins that have low sequence identities, but whose structural and functional features suggest that a common evolutionary origin is probable are placed together in superfamilies. For example, actin, the ATPase domain of the heat shock protein, and hexakinase together form a superfamily.

SCOP The different major levels in the hierarchy are: • Fold: Major structural similarityProteins are defined as having a common fold if they have the same major secondary structures in the same arrangement and with the same topological connections. Different proteins with the same fold often have peripheral elements of secondary structure and turn regions that differ in size and conformation. In some cases, these differing peripheral regions may comprise half the structure. Proteins placed together in the same fold category may not have a common evolutionary origin: the structural similarities could arise just from the physics and chemistry of proteins favouring certain packing arrangements and chain topologies.

DALI database • Based upon the DALI method for structural superpositioning. The programme optimises the overlay of distance plots (see next slide) • Fully automatic • Database contains clusters of protein families (e.g. a giant PDB structures tree) and structural alignments • Database is consistent, but grouping is not done manually by experts

DALI databaseContact Maps Fig (c): contact map of ROP (lower) and 256B (upper triangle). Fig (d): ‘Collapsed’ ROP (lower) and difference contact plot (upper triangle) Figures (c) and (d)..

PROTOMAP database (Linial et al.) • Number of proteins in DB (May 2000) is 365174 (341645 after merging identical entries), number of cluster is 18140, number of singletons is 43219 (of which 14384 are satellites of other clusters) • Provides software to group new protein sequences • Fully automatic • Classifies UniProt + TrEMBL (translated EMBL) databases

Folds: how many? • Chothia (1992) – appr. 1,000 folds • Estimates vary from 1,000 – 15,000 • With 30,000 human genes, ≥3 genes per fold on average (but think about alternative splicing) Chothia, C., Proteins. One thousand families for the molecular biologist. Nature, 1992. 357(6379): p. 543-4. Zhang, C. and C. DeLisi, Estimating the number of protein folds. J Mol Biol, 1998. 284(5): p. 1301-5.

Centre for Integrative Bioinformatics VU (IBIVU)