Classification: understanding the diversity and principles of

Classification: understanding the diversity and principles of protein structure and function MCSG 2001 structures

Protein structure classification • Main reference: Robert B. Russell (2002) Classification of Protein Folds. Molecular Biotechnology 20:17-28. • Importance: central to studies of protein structure, function, and evolution • Philosophy: phyletic vs. phenetic • Method: structure comparison + human knowledge

Philosophy of classification • Phyletic: based on phylogenetic relationship • Phenetic: based on study of phenomena (phenomelogical)

Classification Unit: Domain, a LEGO piece Ranganathan

From domain to assembly • Domains are shuffled, duplicated and fused to make proteins • On average, a domain is of 173 a.a. in size, compared to 466 a.a. for a yeast protein • Most of the natural domain sequences assume one of a few thousand folds, of which ~1000 are already known • no satisfactory estimate yet for the number of macromolecular complexes • On average, a yeast complex may consist of 7.5 proteins Sali et al. 2003

Distribution of Protein size Swiss-prot

Structural vs. functional domain

Russian doll: a conceptual problem Singh

Approaches • Hierarchical • Based on the types and arrangements of secondary structures • Unit (level): domain • Domain assignment - structural vs. functional (fold or function in isolation) - automated assignment methods (structure vs. sequence)

A. P. Singh

Assignment of Class • All a or All b (could be subjective) • a / b (bab unit) or a + b • Other classes

Class assignment could be subjective

All-alpha structures

All-beta structures Superoxide dimutase

Alpha/beta structures Open twisted sheet Closed barrel

B-a-b motif (barrel) (sheet)

a/b vs. a+b

Assignment of Fold • Defined by the number, type, and arrangement of SSEs • Connectivity (e.g. circular permutation, scrambled proteins)

Assignment of Superfamily • Homologous even in the absence of significant sequence similarity - certain level of structural similarity - unusual structural features - low but significant sequence similarity from structural alignment - key active site residues - sequence similarity bridges • Divergence vs. convergence

Divergent vs. convergent evolution • Divergent evolution: decent from a common ancestor; become variant due to mutation • Convergent evolution: no common ancestor; become similar due to functional or physical constraint

Anti-freeze protein: convergent evolution crystal.biochem.queensu.ca

Homologous fold Ranganathan

Analogous fold Ranganathan

C’ C N N’ C N C’ N’ Analogous or homologous? Scallop Myosin Regulatory Domain C chain Aldehyde Oxidoreductase A chain

Assignment of Family • significant sequence similarity

Classification databases • SCOP - careful assignment of evolutionary relationships; homologous vs. analogous • CATH - A:architecture • FSSP - a list of structural neighbors

CATH Class: SSE composition & packing Architecture: overall shape of domain, ignore SSE connectivity Topology (Fold): consider connectivity Homologous superfamily: a common ancestor Singh

Classification databases

Genome-scale structure analysis Curr. Opin. Str. Biol., 2003

genome-scale structure annotation

Some statistics • 80% of sequence families belong to 400 folds (top 10 folds account for 40% of sequence families) • >60% of genes encode multi-domain proteins (80% for eukaryotes) • ~50,000 protein families and ~150,000 singletons • structural superfamilies ~1800 (+/-50) and ~10,000 unifolds • 50-60% of distant homologs (<25% seq. id.) can be recognized by profile-based sequence comparison methods (e.g. psi-blast, HMM, etc) • 50-60% of the enzymes in yeast and E coli are common, and >80% of pathways are shared

superfolds, superfamilies, supersites • TIM barrel, Rossmann-like, ferredoxin-like, b-propellers, 4-helix bundle, Ig-like, b-jelly rolls, Oligonucleotide/oligosaccharride binding (OB) fold, SH3-like. • Structure -> function (only 50% correct)

Structure implicates function?

Assessing the Progress of Structural Genomics Projects 1 Nov. 2002, Science

Target Tracking by PDB (Sep 2002)

PDB content growth (May 2005)

Some statistics • Contributed 316 non-redundant PDB entries comprising 459 CATH and 393 SCOP domains by 11 SG consortia. • 14% of the targets have a homolog (>30% sequence identity) solved by another consortium • 67% of SG domains in CATH are unique vs. 21% of non-SG domains. • 19% and 11% contributed new superfamilies and new folds, respectively. • Allow new and reliable homology models for 9287 non-redundant gene sequences in 208 completely sequenced genomes.

PSI Structure Statistics2002-2003 • Unique structures (30% seq.ID) PSI 70% PDB 10% • New folds PSI 12% PDB 3% NIGMS Protein Structure Initiative

Average total cost per structure PSI Pilot phase 01 $650 K (7 centers) 02 $400 K (9 centers) 03 $240 K 04 ? 05 $100 K (goal) PSI-2 Production phase 06-10 $50 K (goal) Comparison ~$250-300 K NIGMS Protein Structure Initiative

PSI Pilot Phase -- Lessons Learned • Structural genomics pipelines can be constructed and scaled-up • High throughput operation works for many proteins • Genomic approach works for structures • Bottlenecks remain for some proteins • A coordinated, 5-year target selection policy must be developed • Homology modeling methods need improvement NIGMS Protein Structure Initiative

PSI-2 Production Phase (2005) • Interacting network for high throughput protein structure determination with three components • Large-scale centers for protein structure production of selected targets • Specialized centers for technology development leading to high throughput structure determination of difficult proteins • Specialized centers for protein structures relevant to disease (other NIH Institutes and Centers) • Included in NIH Structural Biology Roadmapplans NIGMS Protein Structure Initiative

Computational structural genomics

Summary table

Fold occurrence matrix

Common Folds

Unique Folds

Classification: understanding the diversity and principles of