170 likes | 301 Views
Set for a model building : 67 microbial genomes with identified protein sequences (Table 1) Set for a model estimation : 73 microbial genomes (Table 2). METHOD: Family Classification Scheme. PSI-BLAST search against all sequences in Genbank. Domain parsing. Domain clustering.
E N D
Set for a model building: 67 microbial genomes with identified protein sequences (Table 1) • Set for a model estimation: 73 microbial genomes (Table 2) METHOD:Family Classification Scheme • PSI-BLAST search against all sequences in Genbank. • Domain parsing. • Domain clustering. • The model was tuned on 50,000 randomly chosen full-length SwissProt protein sequences containing domains present in a 721 family subset of PfamA. • E-value (e-4) and number of rounds (6) for (1) and rules for (3) have been chosen to maximize the agreement between the generated and PfamA families.
METHOD:Domain boundaries are identified based on the location of indels in the PSI-BLAST alignment The result is that 96% of single domain PfamA proteins are predicted as such, but only 24% of PfamA two-domain proteins are predicted correctly.
METHOD:Evaluation of domain family construction on SCOP40 1224 superfamilies (5226 domains, 760 folds) • SCOP domains were clustered into families using the authors’ procedure (PSI-BLAST against nr and clustering). • For comparison, the same SCOP domains were clustered using BLAST, PSI-BLAST, SAM-T99 (HMM) and PRC (profile-profile) methods. • 2) Generated families were compared with the SCOP superfamilies: any pair of domains in a generated family presented in the same SCOP superfamily is considered as a true positive, presented in different SCOP fold – as a false positive.
METHOD: Evaluation of domain family construction on SCOP40 1224 superfamilies (5226 domains, 760 folds) The threshold of 1% ratio of false positives versus true positives Authors’ method gives 32% of TPs HMM and PRC detect 28% of TPs PSI-BLAST detects 18% of true positives BLAST detects 9% of true positives
RESULT 1: Distribution of domain family size for 67 genomes (178,310 proteins, 249,574 domains, 31,874 families) 87.7% coverage would be provided by 6072 structures
RESULT 2: Estimation of the number of families in 1000 microbial genomes 1) Examination of a number of families as a function of the number of genomes through 100 times repeated procedure of randomly chosen genomes among 67. 67 genomes
RESULT 2:Estimationof the number of families in 1000 microbial genomes 2) Extrapolation of the data on 1000 genomes. ~1,000,000 families for 10,000 genomes The model predicts ~250,000 families for 1000 genomes The model predicts ~140,000 singletons for 1000 genomes
RESULT 2: Estimation of the number of families in 1000 microbial genomes 3) Testing of the extrapolation model on the complete set of 140 genomes.
RESULT 3:Structural coverage of protein families for 67 genomes • For each non-membrane family with more than 2 members (4907 families), a sequence profile (PSSM) was obtained using blastgp. • PDB sequences was run against PSSMs using RPS-BLAST. Profile with E-value e-2 was considered to represent a family. Such family was considered to be structurally covered. Overall, 20% of all families have representative structures. 3926 structures would be required to complete the coverage.
RESULT 3:Structural coverage of protein families for 67 genomes Suggested strategy of obtaining representative structures for the largest families first: ~4000 structures are required to obtain complete coverage of all families with three or more members, covering 88% of the domains in these genomes.
RESULT 4:Estimation of structural coverage for 1000 microbial genomes 1) Examination of a number of structures needed to obtain coverage of various fractions of proteins (assuming structures for the largest families are obtained first) through 100 times repeated simulation of randomly chosen genomes among 67.
RESULT 4:Estimation of structural coverage for 1000 microbial genomes 2) Extrapolation of the data on 1000 genomes. . A total of 250,000 structures would be required to obtain 100% coverage of protein domains families from 1000 genomes
RESULT 4:Estimation of structural coverage for 1000 microbial genomes 2) Extrapolation of the data on 1000 genomes. Representative structures for about 8,000 families will provide 70% coverage
RESULT 4:Estimation of structural coverage for 1000 microbial genomes 3) Testing of the extrapolation model on the complete set of 140 genomes. ?
DISCUSSION • Domain boundaries detection method. It doesn’t split at many domain boundaries that are obvious at the structure level. The result is that 96% of single domain PfamA proteins are predicted as such, but only 24% of PfamA two-domain proteins are predicted correctly. • Clusterization procedure was tuned on a set of PfamA domains. PfamA omits a large number of evolutionary relationships (placing related proteins in different families). PfamA also focuses on larger families, whereas the genome data are dominated by smaller families. As a result, a better method for PfamA may not necessarily be optimum on genome data. • The final procedure of family generation was benchmarked on a set of SCOP superfamilies. SCOP may be unrepresentative of proteins in genomes as a whole (not including disorder proteins and that form part of complexes).