680 likes | 1.01k Views
Classification: understanding the diversity and principles of. protein structure and function. MCSG 2001 structures. Protein structure classification. Main reference: Robert B. Russell (2002) Classification of Protein Folds. Molecular Biotechnology 20:17-28.
E N D
Classification: understanding the diversity and principles of protein structure and function MCSG 2001 structures
Protein structure classification • Main reference: Robert B. Russell (2002) Classification of Protein Folds. Molecular Biotechnology 20:17-28. • Importance: central to studies of protein structure, function, and evolution • Philosophy: phyletic vs. phenetic • Method: structure comparison + human knowledge
Philosophy of classification • Phyletic: based on phylogenetic relationship • Phenetic: based on study of phenomena (phenomelogical)
Classification Unit: Domain, a LEGO piece Ranganathan
From domain to assembly • Domains are shuffled, duplicated and fused to make proteins • On average, a domain is of 173 a.a. in size, compared to 466 a.a. for a yeast protein • Most of the natural domain sequences assume one of a few thousand folds, of which ~1000 are already known • no satisfactory estimate yet for the number of macromolecular complexes • On average, a yeast complex may consist of 7.5 proteins Sali et al. 2003
Distribution of Protein size Swiss-prot
Approaches • Hierarchical • Based on the types and arrangements of secondary structures • Unit (level): domain • Domain assignment - structural vs. functional (fold or function in isolation) - automated assignment methods (structure vs. sequence)
Assignment of Class • All a or All b (could be subjective) • a / b (bab unit) or a + b • Other classes
All-beta structures Superoxide dimutase
Alpha/beta structures Open twisted sheet Closed barrel
B-a-b motif (barrel) (sheet)
Assignment of Fold • Defined by the number, type, and arrangement of SSEs • Connectivity (e.g. circular permutation, scrambled proteins)
Assignment of Superfamily • Homologous even in the absence of significant sequence similarity - certain level of structural similarity - unusual structural features - low but significant sequence similarity from structural alignment - key active site residues - sequence similarity bridges • Divergence vs. convergence
Divergent vs. convergent evolution • Divergent evolution: decent from a common ancestor; become variant due to mutation • Convergent evolution: no common ancestor; become similar due to functional or physical constraint
Anti-freeze protein: convergent evolution crystal.biochem.queensu.ca
Homologous fold Ranganathan
Analogous fold Ranganathan
C’ C N N’ C N C’ N’ Analogous or homologous? Scallop Myosin Regulatory Domain C chain Aldehyde Oxidoreductase A chain
Assignment of Family • significant sequence similarity
Classification databases • SCOP - careful assignment of evolutionary relationships; homologous vs. analogous • CATH - A:architecture • FSSP - a list of structural neighbors
CATH Class: SSE composition & packing Architecture: overall shape of domain, ignore SSE connectivity Topology (Fold): consider connectivity Homologous superfamily: a common ancestor Singh
Genome-scale structure analysis Curr. Opin. Str. Biol., 2003
Some statistics • 80% of sequence families belong to 400 folds (top 10 folds account for 40% of sequence families) • >60% of genes encode multi-domain proteins (80% for eukaryotes) • ~50,000 protein families and ~150,000 singletons • structural superfamilies ~1800 (+/-50) and ~10,000 unifolds • 50-60% of distant homologs (<25% seq. id.) can be recognized by profile-based sequence comparison methods (e.g. psi-blast, HMM, etc) • 50-60% of the enzymes in yeast and E coli are common, and >80% of pathways are shared
superfolds, superfamilies, supersites • TIM barrel, Rossmann-like, ferredoxin-like, b-propellers, 4-helix bundle, Ig-like, b-jelly rolls, Oligonucleotide/oligosaccharride binding (OB) fold, SH3-like. • Structure -> function (only 50% correct)
Assessing the Progress of Structural Genomics Projects 1 Nov. 2002, Science
Some statistics • Contributed 316 non-redundant PDB entries comprising 459 CATH and 393 SCOP domains by 11 SG consortia. • 14% of the targets have a homolog (>30% sequence identity) solved by another consortium • 67% of SG domains in CATH are unique vs. 21% of non-SG domains. • 19% and 11% contributed new superfamilies and new folds, respectively. • Allow new and reliable homology models for 9287 non-redundant gene sequences in 208 completely sequenced genomes.
PSI Structure Statistics2002-2003 • Unique structures (30% seq.ID) PSI 70% PDB 10% • New folds PSI 12% PDB 3% NIGMS Protein Structure Initiative
Average total cost per structure PSI Pilot phase 01 $650 K (7 centers) 02 $400 K (9 centers) 03 $240 K 04 ? 05 $100 K (goal) PSI-2 Production phase 06-10 $50 K (goal) Comparison ~$250-300 K NIGMS Protein Structure Initiative
PSI Pilot Phase -- Lessons Learned • Structural genomics pipelines can be constructed and scaled-up • High throughput operation works for many proteins • Genomic approach works for structures • Bottlenecks remain for some proteins • A coordinated, 5-year target selection policy must be developed • Homology modeling methods need improvement NIGMS Protein Structure Initiative
PSI-2 Production Phase (2005) • Interacting network for high throughput protein structure determination with three components • Large-scale centers for protein structure production of selected targets • Specialized centers for technology development leading to high throughput structure determination of difficult proteins • Specialized centers for protein structures relevant to disease (other NIH Institutes and Centers) • Included in NIH Structural Biology Roadmapplans NIGMS Protein Structure Initiative