310 likes | 424 Views
Burkhard Rost CUBIC Columbia University http://cubic.bioc.columbia.edu/mis/talks/ http://cubic.bioc.columbia.edu. Comprehensive strategy for integrated target selection in structural genomics. Comprehensive strategy for integrated target selection. Our research goal and current reality
E N D
Burkhard Rost CUBIC Columbia University http://cubic.bioc.columbia.edu/mis/talks/ http://cubic.bioc.columbia.edu Comprehensive strategyfor integrated target selectionin structural genomics
Comprehensive strategy for integrated target selection • Our research goal and current reality • Unit: sequence-structure familiesGoals: cover allentire families with good models • STAGE 1: CHOP + CLUP + filtering -> novel automatic organization of sequence-structure space • STAGE 2: Refined, manual selection ->model all family members? stop-work/hold-work? • STAGE 3: Explore experimental structure • Answers and perspectives • How many structures needed for completion? • Euka-proka-archae: overlap? • Why collaborate on targets? • Multiplexing helpful? • High-throughput protein production in eukaryotes?
Sequence-structure family Sequence-structure family U Sequence-structure family U’
EVA: comparative modelling Cumulative distribution PSI-BLAST 10-3 Marc Marti Renom & Andrej Sali (UCSF) http://eva.compbio.ucsf.edu/~eva/cm/http://cubic.bioc.columbia.edu/eva Accuracy Coverage V Eyrich, MA Marti-Renom, D Przybylski, A Fiser, F Pazos, A Valencia, A Sali & B Rost (2001) Bioinformatics 17, 1242-1243 MA Marti-Renom, MS Madhusudhan, A Fiser, B Rost, A Sali (2002) Structure 10, 435-440
How to decide when we exclude/include? C Sander & R Schneider 1991 Proteins, 9, 56-68 B Rost 1999 Prot Engng, 12, 85-94
Scooping families from proteomes, in practice Problems: • domains • overlaps
Choose targets: single-linkage clustering Conclusions: • NO clustering of full- length proteins • have to chop into structural-domain- like fragments (single-linkage DOES work on PrISM) ~100,000 eukaryotic proteins (yeast, fly, worm, weed, human) 22 112 clusters 46 318 in largest cluster NONSENSE! Liu, Hegyi, Acton, Montelione & Rost 2003 Proteins, in press Liu & Rost 2003 Proteins, submitted
CHOP proteins into structural domains Liu & Rost 2003 Proteins, submitted
CHOP: dissection of proteins into domains Average domain length • in proteins ≥ 2 domains: ~100 residues • in proteins with 1 domain: 1.7-3 times longer Single-domain proteins: 61% in PDB 28% in 62 proteomes Liu, Hegyi, Acton, Montelione & Rost 2003 Proteins, in press Liu & Rost 2003 Proteins, submitted
To take or not to take Take if > 50 globular residues and no known 3D
Structural residue coverage in reality (any) 53% of residues to do ! ~28% ~19% J Liu & B Rost 2002 Bioinformatics, 18, 922-933
If you believe 53% is pessimistic ... 53% residue coverage today based on E-value 1!!
Clustering after CHOP 21,000 fragment clusters Jinfeng • 103 796 eukaryotic proteins (Yeast, Fly, Worm, Arabidopsis, Human/30)247 222 domain-like fragments167 717 no PDB (E-value 10-1, HSSP-distance -3) 44 718 not good 4 us (membrane, coil, SEG, NORS, signal peptide) • 122 999 2 go 95 330 non-singleton Liu, Montelione & Rost 2003 Proteins, in press
Main goal of Stage 2 analysis Diana Murray, Cornell • Refine Stage 1 automatic target selection through manual sequence analysis • Concept: USE comparative modeling and structural features directly for refined target selection • For each sequence-structure family from Stage 1:predict minimal set of exp. structures needed to high-quality model entire family.
Refinement protocol 4 new 3DTarget re-prioritization based on weekly PDB updates Diana Murray, Cornell Toolbox Input: PDB + NESG cluster 1. Fold recognitionand sequence-to-structure profiles 2. Comparative modeling (PrISM, Nest) 3. Structure evaluation tools (e.g. Verify3d) 4. Calculate biophysical properties Recommend 2 do additional structure if: 1) NESG-cluster members poorly modeled 2) Biophysical properties of models incompatible with known function 3) Models suggest novel functionality
Target Status Example of stop work recommendation IR21 solved, PDB: 1MOS ET28 Purified JR15 Expressed TT777 Expressed GR7 Expressed AR12 Cloned WR204 Selected XR4 Expressed Diana Murray, Cornell Experimental structure of IR21 yielded high-quality models for all members of this NESG sequence/structure family Stop work SPINE/ ZebaView
HR291 AR1731 HR2295 HR291 AR1731 HR2295 A HR291 AR1731 HR2295 KR12 DR11 B KR12 DR11 Diana Murray, Cornell Two structures required to cover family: Predicted by Stage 2 analysis and verified by Stage 3 analysis NESG family: HR291 (99% identical to 1P9O), AR1731, HR2295, KR12, DR11 breaks into two clusters: A = (HR291, AR1731, HR2295) and B = (KR12, DR11) Recommendation: Solve structure of KR12 (purified)
Model suggests novel function: 30S ribosomal protein S27 Archaeal structure Diana Murray, Cornell NESG ID: GR2; PDB ID: 1QXF Archaeoglobus fulgidis S27e protein has only archae and eukaryotic members. Archae and eukaryotes share conserved hydrophobic motif (yellow). Only eukaryotes have N-terminal extension, and their models have strikingly different electrostatic properties. Human protein recommended for structure determination! Model for human homologue
Summary Stage 2 refinement Diana Murray, Cornell • Statistics: • Many families currently under investigation Hold work recommendation: • family member at advanced experimental stage • predicted to yield good models for entire family -> hold-work for members at early exp. stages re-assess once structure done!
Exploit structure to speculate about function • 43 no previous annotation about functiondefined by ‘no publication in biological journal’39 analyzed • 31 result in some predictions about function • 8 clear success: functional annotation achieved e.g. predicted active site based on structure typically: conformation of annotation transfer • 23 some hints (16 ‘hypothetical proteins’) e.g. some clue about active site mostly completely new! • 8 no clue Sharon Goldsmith & Barry Honig
Answers • How many structures needed for completion? • Euka-proka-archae: overlap? • Why collaborate on targets? • Multiplexing helpful? • High-throughput protein production in eukaryotes?
How many targets for prokaryotes + archae? 16,000 min 8,000 give: 72% fragments 72% proteins 67% residues
How many targets for euka-proka-archae? 8,000 8,000 give: 67% fragments 67% proteins 59% residues BUT: 50% of residues remaining
Overlap between euka-proka-archae? • ~60% of fragments from eukaryotes no sequence-structure family member from prokaryotes or archae • much higher for ‘largest 8,000’: • 2,690 (34%) proka+archae only • 4,277 (53%) euka only • 1,033 (13%) mix • surprisingly small overlap overall • even lower for largest families • most big families are eukaryotic!
Why collaborate on target list? 32% overlap competition between consortia has already hampered success-rate considerably!
Does multiplexing help? Date: 2003-07-28 ~4% Multiplex DOUBLES success rate!
Integrated strategy • NESG unique, comprehensive, integrated strategy optimized to organize sequence space in structural terms: • Stage 1: CHOP+CLUP+filter yields high success in focusing on sequence-structure families • Stage 2: detailed refinement embeds comparative models into selection and optimizes structural coverage for family • Stage 3: use experimental structure to increase structural family coverage and to allow functional exploitation • Needed to do ‘em all: • ~38,000 non-singletons • 8,000 largest -> 50% of the residues that remain! • Genomics: Surprises + our structural perspective changed the ‘world’! The revolutions continue ...
Thanksgiving Data: Jinfeng Liu (CUBIC) Hedi Hegyi & Phil Carter (CUBIC), Marc-Marti Renom (UCSD) NESG: Guy Montelione (Rutgers) Barry Honig (Columbia) Diana Murray (Cornell, NYC) Tom Acton (Rutgers), Liang Tong & John Hunt (Columbia), George DeTitta (Buffalo), Cheryl Arrowsmith (Toronto) Wayne Hendrickson (Columbia) EVA: Andrej Sali & Marc-Marti Renom (UCSD), Alfonso Valencia (Madrid) Volker Eyrich, Ingrid Koh & Dariusz Przybylski (CUBIC) $$: NIH/NSF