Protein World

Protein World SARA 12-12-2002 Amsterdam Tim Hulsen

Genome sequencing • Since 1995: sequencing of complete ‘genomes’ (DNA): A/C/G/T order ACGTCATCGTAGCTAGCTAGTCGTACGTATG TGCAGTAGCATCGATCGATCAGCATGCATAC • At this moment more than 80 genomes have been sequenced and published, of all kinds of organisms: • Animals • Plants • Fungi • Bacteria

Genomes  Proteins • ‘Transcription’ and ‘translation’ of specific regions of the genome leads to proteins, consisting of twenty types of ‘amino acids’: ATG ACG CTG AGC TGC GGA CGT TGA -> TLSCGR • Proteins are responsible for all kinds of life processes • All the proteins that can be produced in an organism together are called the ‘proteome’ • Sequence comparisons make possible the classification of proteins

Protein families • e.g. The GPCR family: • Sequence comparison helps in predicting the function of new proteins

Determining protein functions • Function of 40-50% of the new proteins is unknown • Understanding of protein functions and relationships is important for: • Study of fundamental biological processes • Drug design • Genetic engineering

Sequence comparison • Smith-Waterman dynamic programming algorithm (1981): calculates similarity/distance between two sequences: Query ---PLIT-LETRESV- Subject NEQPKVTMLETRQTAD (bold=similar) • Results in a SW-score that is a measure for how similar the two sequences are to each other • Disadvantage: score is dependent of length • After the alignments, the proteins are ‘clustered’ (divided into families) according to their similarity

Existent databases • Domain-based clusterings: Prosite, Pfam, ProDom, Prints, Domo, Blocks • Protein-based clusterings: ProtoMap, COGs, Systers, PIR, ClusTr • Structural classifications: SCOP, CATH, FSSP Why should there be another database?

Another method • Enhanced Smith-Waterman algorithm: Monte-Carlo evaluation (Lipman et al., 1984) • How big is the chance that two sequences are similar but not related? • One of the two sequences is randomized and recalculated (200 times). Randomization leads to sequences with the same length and the same composition, but different order • Method leads to calculation of the Z-value: S(A,B) - µ Z(A,B) = ------------------- σ

Advantages • The obtained Z-value is a very reliable measure for sequence, compared to SW-score: • SW-score is dependent of length, Z-value is not • Amino acid bias does not affect the Z-value • Independent of the database size • Easier updating of the database, without a total recalculation

Disadvantage • LOTS of calculation time needed, especially when all proteins in all proteomes are compared to each other (“all-against-all”)!  SARA

SARA calculation • Proteomes of 82 organisms compared ‘all-against-all’ with the use of the Monte Carlo algorithm: more than 400,000 proteins! • 21,600 CPU days (~520,000 CPU hours) • = 21,600 PCs running parallel over 24 hours / 1 PC running for ~ 60 years • Using supercomputer TERAS (1024-CPU SGI Origin 3800) at SARA: less than two months!

Parties involved • Gene-IT (Paris, France) • SARA (Amsterdam, the Netherlands) • CMBI (Nijmegen, the Netherlands) • Organon (Oss, the Netherlands) • EBI (Hinxton, UK)

Supporting parties • Financed by NCF, foundation in support of supercomputing • Under the auspices of BioASP, the new Dutch knowledge and service center for Bioinformatics

Results available through BioASP • http://www.bioasp.nl • Log in and click on links ‘Research’ and ‘Protein World’: 1 2

Results available through BioASP • Organism selection screen:

Results available through BioASP • Results screen:

Results available through BioASP • Alignment screen:

Conclusions • Currently the most comprehensive and most accurate data-set of protein comparisons • A start for a maintainable and unique database of all proteins currently known • A rich data-source for clustering, data-mining and orthology determination

Orthology determination • Orthologs: genes/proteins in different species that derive from a common ancestor • Orthologs often have the same function • Interesting! Information from other species could help in annotating a protein

Thank you for your attention Any questions?

Protein World

Protein World

Presentation Transcript

Protein-protein Interactions

Protein stability, protein-protein interactions

Protein-protein interactions

Protein-Protein Interactions

Protein-protein interaction

Protein-protein interactions

Protein-protein interactions

Protein-Protein Interactions

Protein protein interactions

Real-world protein aligners

Protein – Protein Interactions

Protein-protein interactions

Protein-protein interactions

Protein – protein interaction

Protein-Protein Interactions

RNA-protein world

Protein-Protein Interactions

Protein-protein interactions

RNA-protein world

Protein-protein Interactions

protein protein docking