200 likes | 417 Views
Protein World. SARA 12-12-2002 Amsterdam Tim Hulsen. Genome sequencing. Since 1995: sequencing of complete ‘genomes’ (DNA): A/C/G/T order ACGTCATCGTAGCTAGCTAGTCGTACGTATG TGCAGTAGCATCGATCGATCAGCATGCATAC
E N D
Protein World SARA 12-12-2002 Amsterdam Tim Hulsen
Genome sequencing • Since 1995: sequencing of complete ‘genomes’ (DNA): A/C/G/T order ACGTCATCGTAGCTAGCTAGTCGTACGTATG TGCAGTAGCATCGATCGATCAGCATGCATAC • At this moment more than 80 genomes have been sequenced and published, of all kinds of organisms: • Animals • Plants • Fungi • Bacteria
Genomes Proteins • ‘Transcription’ and ‘translation’ of specific regions of the genome leads to proteins, consisting of twenty types of ‘amino acids’: ATG ACG CTG AGC TGC GGA CGT TGA -> TLSCGR • Proteins are responsible for all kinds of life processes • All the proteins that can be produced in an organism together are called the ‘proteome’ • Sequence comparisons make possible the classification of proteins
Protein families • e.g. The GPCR family: • Sequence comparison helps in predicting the function of new proteins
Determining protein functions • Function of 40-50% of the new proteins is unknown • Understanding of protein functions and relationships is important for: • Study of fundamental biological processes • Drug design • Genetic engineering
Sequence comparison • Smith-Waterman dynamic programming algorithm (1981): calculates similarity/distance between two sequences: Query ---PLIT-LETRESV- Subject NEQPKVTMLETRQTAD (bold=similar) • Results in a SW-score that is a measure for how similar the two sequences are to each other • Disadvantage: score is dependent of length • After the alignments, the proteins are ‘clustered’ (divided into families) according to their similarity
Existent databases • Domain-based clusterings: Prosite, Pfam, ProDom, Prints, Domo, Blocks • Protein-based clusterings: ProtoMap, COGs, Systers, PIR, ClusTr • Structural classifications: SCOP, CATH, FSSP Why should there be another database?
Another method • Enhanced Smith-Waterman algorithm: Monte-Carlo evaluation (Lipman et al., 1984) • How big is the chance that two sequences are similar but not related? • One of the two sequences is randomized and recalculated (200 times). Randomization leads to sequences with the same length and the same composition, but different order • Method leads to calculation of the Z-value: S(A,B) - µ Z(A,B) = ------------------- σ
Advantages • The obtained Z-value is a very reliable measure for sequence, compared to SW-score: • SW-score is dependent of length, Z-value is not • Amino acid bias does not affect the Z-value • Independent of the database size • Easier updating of the database, without a total recalculation
Disadvantage • LOTS of calculation time needed, especially when all proteins in all proteomes are compared to each other (“all-against-all”)! SARA
SARA calculation • Proteomes of 82 organisms compared ‘all-against-all’ with the use of the Monte Carlo algorithm: more than 400,000 proteins! • 21,600 CPU days (~520,000 CPU hours) • = 21,600 PCs running parallel over 24 hours / 1 PC running for ~ 60 years • Using supercomputer TERAS (1024-CPU SGI Origin 3800) at SARA: less than two months!
Parties involved • Gene-IT (Paris, France) • SARA (Amsterdam, the Netherlands) • CMBI (Nijmegen, the Netherlands) • Organon (Oss, the Netherlands) • EBI (Hinxton, UK)
Supporting parties • Financed by NCF, foundation in support of supercomputing • Under the auspices of BioASP, the new Dutch knowledge and service center for Bioinformatics
Results available through BioASP • http://www.bioasp.nl • Log in and click on links ‘Research’ and ‘Protein World’: 1 2
Results available through BioASP • Organism selection screen:
Results available through BioASP • Results screen:
Results available through BioASP • Alignment screen:
Conclusions • Currently the most comprehensive and most accurate data-set of protein comparisons • A start for a maintainable and unique database of all proteins currently known • A rich data-source for clustering, data-mining and orthology determination
Orthology determination • Orthologs: genes/proteins in different species that derive from a common ancestor • Orthologs often have the same function • Interesting! Information from other species could help in annotating a protein
Thank you for your attention Any questions?