250 likes | 376 Views
Selecting Targets which Probe Family and Function Space. How many protein families can we identify in the genomes with/without structures? Which families should we target to maximise the structural coverage of the genomes? Can we optimise function coverage?. CATH , Gene3D.
E N D
Selecting Targets which Probe Family and Function Space • How many protein families can we identify in the genomes with/without structures? • Which families should we target to maximise the structural coverage of the genomes? • Can we optimise function coverage? CATH,Gene3D NIH Funded Midwest Consortium James Bray, David Lee, Russell Marsden,Annabel Todd Janet Thornton, Andrzej Joachimiak MCSG Site Visit, Argonne, January 30, 2003
Identify protein families in the genomes protein families
Identify domain families and consider domain compositions of the protein families domain families
Identify structurally characterised domain families domain family of known structure
Protein Families in Complete Genomes with Structural/Functional Annotations Gene3D Buchan, Thornton, Orengo, Genome Research (2002), NAR (2002) 650,000 protein sequences from 120 completed genomes 14 eukaryotic genomes including human, mouse, fly, worm 92 bacterial genomes 14 archael genomes Currently being updated with 30 more complete genomes
Clustering Sequences into Protein Families of Known Domain Composition PFscape - Protein Family Landscape • BLAST all the sequences from 120 completed genomes against each and cluster into protein families • For each protein family identify domain composition (by mapping CATH and Pfam domains) TRIBE-MCL - Markov Clustering Enright & Ouzounis, Genome Research, 2002 SAM-T99 - sequence mapping of CATH & Pfam Karplus et al., NAR, 2000
Consistency of TribeMCL Clusters for Genes of Known Structure in CATH Database Percentage of Genes with common family annotation Granularity of Clustering
clustering ~650,000 genes from 120 complete genomes Protein Family 4 Protein Family 1 PFscape Protein Family 3 Protein Family 2 ~50,000 protein families of 2 or more sequences, ~60,000 singletons on average 10-15% of sequences in a genome are singletons
Mapping CATH and Pfam Domains onto Genome Sequences Library of profiles (HMMs) built for representative sequences from each CATH and Pfam domain superfamily E-value thresholds validated by structure comparison Scan against CATH & Pfam SAM-T99 HMM library (1467 CATH 6190 Pfam) assign domains to CATH and Pfam families protein sequences from genomes
Performance of Sequence Mapping Method Percentage of remote, structurally validated CATH homologues (<35% sequence identity) identified by SAM-T99 1D-HMM (SAM-T99) (%) of homologues found Error rate Library of 1D-HMM models detects >80% of remote homologues
50,000 protein families in Gene3D Use HMMs to identify CATH and Pfam domains in the genome sequences CATH Pfam NewFam domain compositions for protein families in Gene3D
CATH and Pfam domain families cover nearly 60-90% of genome sequences Archaea Bacteria Eukaryotes 100 80 60 Percentage of sequences annotated 40 20 Pfam Gene3Ddatabase CATH organism
Gene3DDatabase:Protein Families in 120 Completed Genomes • 120 genomes clustered into ~50,000 protein families • structural domain assignments from CATH • functional domain assignments from Pfam, • domain compositions for each protein family • Also: SWISS-PROT, EC, COGs, GO, KEGG annotations Gene3D Iterative Profile SearchMethodology http://www.biochem.ucl.ac.uk/bsm/Gene3D Buchan, Thornton, Orengo, 2002, Genome Research Recent update submitted to Proteins (2004)
Maximise structural coverage of the genomes by targetting the largest domain families CATH Pfam Percentage of Families Number of Non-identical Relatives Number of Non-identical Relatives NewFam • NewFam families are very small • Target large structurally uncharacterised Pfam families to increase structural coverage of genomes Percentage of Families Number of Non-identical Relatives
Genome Coverage by Domain Families 100 Percentage of Non-singleton Domain Sequences in 120 Completed Genomes 50 0 0 5,000 10,000 15,000 20,000 25,000 30,000 Domain Families Ordered by Size ~70% of genomes are contained in ~2000 largest CATH and/or Pfam domain families (1345 Pfam families with no structural representative) ->Target large structurally uncharacterised Pfam families to increase coarse grained structural coverage of the genomes
Fine Grained Target Selection Structural Family (CATH) Profile Family (HMM based/Pfam) Close Sequence Family (30%ID) 2000 of the largest domain families cover 70% of genome sequences (~650 CATH + ~1350 Pfam families) How many fine grained targets should be selected to provide good homology models for all the relatives in these families?
45,000 targets are needed to give good homology models for 70% of eukaryotic and prokaryotic domains? prokaryotes eukaryotes eukaryotes plus prokaryotes Percentage of Non-singleton domain sequences 25,000 30,000 45,000 Number of Targets for Close Sequence Families
Target Selection Strategy • ~2000 of the largest CATH and/or Pfam families cover >70% of domain sequences in the genomes • it is not feasible to target all the close sequence families in these families to build good homology models for all relatives (45,000 targets) • accurate homology models are not needed for all families ->target sequence families of biological or medical interest (these could be small families or singletons) ->target additional representatives in very large families especially functionally diverse families MCSG Site Visit, Argonne, January 30, 2003
Domain Recurrences in the Genomes 730 570 number of families large,extensively duplicated families occurrences
structural family (CATH) profile family (Pfam) close sequence family (30%) in these very large families we will need finer grained selection of targets to understand the evolution of new functions/biological roles in different organisms
Changes in Domain Partnerships can Modulate Function 67% of enzyme families in CATH show variation in functional properties of relatives domain duplication domain fusion, change in domain partner • In >87% of families -> changes in substrate specificity modulated by changes in domain partners • In >92% of these families -> conservation or semi-conservation of reaction chemistry
Change in Domain Partner Modulates Function Methionine Aminopeptidase Type 1 (1mat) Creatinase (1chmA) dimer/small molecule substrates monomer/protein substrates
profile family (Pfam) close sequence family (30%) representative structures for large families may also help to identify functional families
Surface clefts Residue conservation Conserved surface patches ProFunc: Predicting Functional Sites Laskowski and Thornton Most likely binding site
Representative Structures for Superfamilies will help identify Functional Subfamilies functional clusters family_1 family_2 family_3 Superfamily family_4 family_5 functional subclusters identified by: - domain partnerships from Gene3D - sequence conservation - functional annotations stored in Gene3D - results from ProFunc analysis