230 likes | 324 Views
Genome Analysis to Select Targets which Probe Fold and Function Space. How many protein superfamilies and families can we identify in the proteomes How many structures needed to cover a high fraction of prokaryotic, eukaryotic families
E N D
Genome Analysis to Select Targets which Probe Fold and Function Space • How many protein superfamilies and families can we identify in the proteomes • How many structures needed to cover a high fraction of prokaryotic, eukaryotic families • Targeting Universal Recurrent Superfamilies (SCOP/CATH/Pfam) to optimise coverage of fold and function space Midwest Consortium Russell Marsden, Alastair Grant, David Lee, Annabel Todd Janet Thornton, Andrzej Joachim MCSG Site Visit, Argonne, January 30, 2003
Protein Families in Complete Genomes with Structural/Functional Annotations Gene3D Buchan, Thornton, Orengo, Genome Research (2002) 800,000 protein sequences from 120 completed genomes 14 eukaryotic genomes including human, mouse, rat, plant,fly, worm, fugu 92 bacterial genomes 14 archael genomes
Protein Families in Complete Genomes with Structural/Functional Annotations Gene3D Buchan, Thornton, Orengo, Genome Research (2002) 800,000 protein sequences from 120 completed genomes 14 eukaryotic genomes including human, mouse, rat, plant,fly, worm, fugu 92 bacterial genomes 14 archael genomes
Clustering Sequences into Protein Superfamilies of Known Domain Composition PFscape - Protein Family Landscape • BLAST all the sequences from 120 completed genomes against each and cluster into protein families • For each sequence identify CATH and Pfam domains TRIBE-MCL - Markov Clustering Enright & Ouzounis, Genome Research, 2002 SAM-T99 - sequence mapping of CATH & Pfam Karplus et al., NAR, 2000
Clustering ~800,000 genes from 120 complete genomes Gene Superfamily 1 PFscape Gene Superfamily 4 Gene Superfamily 2 Gene Superfamily 3 ~50,000 gene superfamilies of 2 or more sequences, 150,000 singletons
Mapping CATH and Pfam Domains onto Genome Sequences • Library of HMMs built for representative sequences from each CATH and Pfam domain superfamily Scan against CATH & Pfam SAM-T99 HMM library protein sequences from genomes assign domains to CATH and Pfam superfamilies
Performance of Sequence Mapping Method Percentage of remote, structurally validated CATH homologues (<35% sequence identity) identified by SAM-T99 1D-HMM (SAM-T99) (%) of homologues found Error rate Library of 1D-HMM models detects ~80% of remote homologues
Use HMMs to annotate Gene Superfamilies with CATH and Pfam domains Gene Superfamily 1 CATH Pfam Gene Superfamily 2 Gene Superfamily 4 Gene Superfamily 3 NewFam 50,000 Gene Superfamilies
Merge superfamilies with the same domain combinations Gene Superfamily 1 Gene Superfamily 2 Gene Superfamily 3 Gene3D: 50,000 -> 36,000 Superfamilies
Superfamilies Further Classified into Families Multi-linkage clustering relatives in each sequence family have 35% or more sequence identity Superfamily Families (35%ID) For good homology models one structure is needed for each family within a superfamily
Percentage of Sequence Families with and without Close Structural Homologues (>35% identity) 100 Percentage of Families No close PDB homologue 50 NewFam CATH Pfam Number of domain superfamilies and families with no close structural homologue CATH (1400)+Pfam(4100)+Newfam(46,384) = 51,844 Superfamilies CATH (60,360)+Pfam(53,907)+Newfam(56,973) = 171,240 Families
Preferentially Target Largest Superfamilies CATH Pfam Number of Superfamilies containing given number of Non-identical relatives as percentage of the total Number of Non-identical Relatives Number of Non-identical Relatives Fitted power-laws (with gradients) Newfam CATH (-0.4) Pfam (-1.0) Newfam (-1.9) Number of Non-identical Relatives Number of Non-identical Relatives
Proteome Coverage by Superfamilies 100 Percentage of Proteomes (Number of non-identical proteins in 120 completed genomes) 50 50 0 Superfamilies Ordered by Size ~70% of Proteomes are contained in < 2500 Largest CATH + Pfam + NewFamTarget Superfamilies
Proteome Coverage by Superfamilies CATH (superfamilies of known fold) Pfam 50 Percentage of Proteomes (120 completed genomes) NewFam Superfamilies Ordered by Size
What Fraction of the Proteomes is covered by Bacterial Family Targets? eukaryotes plus prokaryotes 100 eukaryotes 50 prokaryotes Percentage of Proteomes (120 completed genomes) 50 40 o 0 100,000 200,000 0 Number of Target Families ~100,000 prokaryotic targets cover nearly 60% of proteomes
How many family targets cover a significant proportion of the eukaryotes and/or prokaryotes? prokaryotes eukaryotes eukaryotes plus prokaryotes Percentage of Kingdom Proteomes (120 completed genomes) 50 40 o 25,000 30,000 45,000 Number of Target Families 25,000 - 45,000 family targets cover 70% of proteomes (< 2500 largest superfamily targets)
Target Selection Strategy • the largest < 2500 superfamily targets give 70% of proteomes • this corresponds to 25,000 - 45,000 family targets • accurate homology models are not needed for all families • target families of biological interest or containing human homologues with disease association • targets families from functionally diverse superfamilies to understand how changes in the structure can modify function • For example, Universal, Highly Recurrent Superfamilies are an interesting biological subset with diverse functions MCSG Site Visit, Argonne, January 30, 2003
Universal CATH Domain Superfamilies 100 Proportion of CATH domain annotations 50 0 30 representative eukaryotic and prokaryotic organisms ~60-70% of CATH domain annotations within each organism are from < 200 CATH universal superfamilies common to all kingdoms of life some of which are very extensively duplicated
Domain Recurrences in the Genomes 730 570 number of superfamilies Highly Recurrent, Extensively Duplicated Superfamilies occurrences
56 Universal and Highly Recurrent Superfamilies Poorly charac. Information stor. & proce. Cellular processes and signalling Metabolism U O S R Z Y V W T COG functional annotation (25 Functional Categories) N M D L A J B K Q P I H G F E C Analysis in bacterial genomes showed that 56 Universal Superfamilies recurred in proportion to the genome size and accounted for 45% of the CATH domain annotations E (Amino acid metabolism) J (Translation and protein biosynthesis) K (Transcription) T (Signal Transduction) 15,000 bacterial family targets
In Functionally Diverse Superfamilies Select More Targets Relative with most neighbours for which homology model can be built or function assigned For >95% confidence when inheriting functional properties, homologues should have at least 60% identity (Todd, Valencia, Rost)
Representative Structures for Superfamilies will help identify Functional Families functional clusters S60_1 S60_2 S60_3 Superfamily S60_4 S60_5 functional clusters identified by sequence conservation annotations (GO, Kegg, Pfam, EC, COGS, SWISS-PROT) stored in Gene3D
Target Selection Strategy • Targeting the 2500 largest superfamilies will cover a significant proportion (70%) of the proteomes • For good homology models between 25,000 - 45,000 family targets are needed • Preferentially select targets from medically important and/or structurally and functionally diverse superfamilies • For example, targeting Universal and Recurrent superfamilies which exhibit significant structural and functional divergence will help to improve function prediction methods MCSG Site Visit, Argonne, January 30, 2003