1 / 62

Domains and domain databases

C. E. N. T. E. R. F. O. R. I. N. T. E. G. R. A. T. I. V. E. B. I. O. I. N. F. O. R. M. A. T. I. C. S. V. U. Lecture 17:. Domains and domain databases. Introduction to Bioinformatics. Protein interaction domains. http://pawsonlab.mshri.on.ca/html/domains.html.

pello
Download Presentation

Domains and domain databases

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. C E N T E R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U Lecture 17: Domains and domain databases Introduction to Bioinformatics

  2. Protein interaction domains http://pawsonlab.mshri.on.ca/html/domains.html

  3. Domain function Active site / binding cleft

  4. Protein-protein (domain-domain) interaction Shape complementarity

  5. A domain is a: • Compact, semi-independent unit (Richardson, 1981). • Stable unit of a protein structure that can fold autonomously (Wetlaufer, 1973). • Recurring functional and evolutionary module (Bork, 1992). “Nature is a tinkerer and not an inventor” (Jacob, 1977).

  6. Identification of domains is essential for: • High resolution structures (e.g. Pfuhl & Pastore, 1995). • Sequence analysis (Russell & Ponting, 1998) • Multiple alignment methods • Sequence database searches • Prediction algorithms • Fold recognition • Structural/functional genomics

  7. Domain connectivity linker

  8. False sequence relationship by local search

  9. Domain size The size of individual structural domains varies widely from 36 residues in E-selectin to 692 residues in lipoxygenase-1 (Jones et al., 1998), the majority (90%) having less than 200 residues (Siddiqui and Barton, 1995) with an average of about 100 residues (Islam et al., 1995). Small domains (less than 40 residues) are often stabilised by metal ions or disulphide bonds. Large domains (greater than 300 residues) are likely to consist of multiple hydrophobic cores (Garel, 1992).

  10. Analysis of chain hydrophobicity in multidomain proteins

  11. Analysis of chain hydrophobicity in multidomain proteins

  12. Domain characteristics Domains are genetically mobile units, and multidomain families are found in all three kingdoms (Archaea, Bacteria and Eukarya) underlining the finding that ‘Nature is a tinkerer and not an inventor’ (Jacob, 1977). The majority of genomic proteins, 75% in unicellular organisms and more than 80% in metazoa, are multidomain proteins created as a result of gene duplication events (Apic et al., 2001). Domains in multidomain structures are likely to have once existed as independent proteins, and many domains in eukaryotic multidomain proteins can be found as independent proteins in prokaryotes (Davidson et al., 1993).

  13. Protein function evolution-Gene (domain) duplication - Active site Chymotrypsin

  14. Detecting Structural Domains • A structural domain may be detected as a compact, globular substructure with more interactions within itself than with the rest of the structure (Janin and Wodak, 1983). • Therefore, a structural domain can be determined by two shape characteristics: compactness and its extent of isolation (Tsai and Nussinov, 1997). • Measures of local compactness in proteins have been used in many of the early methods of domain assignment (Rossmann et al., 1974; Crippen, 1978; Rose, 1979; Go, 1978) and in several of the more recent methods (Holm and Sander, 1994; Islam et al., 1995; Siddiqui and Barton, 1995; Zehfus, 1997; Taylor, 1999).

  15. Detecting Structural Domains However, approaches encounter problems when faced with discontinuous or highly associated domains and many definitions will require manual interpretation. Consequently there are discrepancies between assignments made by domain databases (Hadley and Jones, 1999).

  16. Taylor method (1999) • Easy and clever • Uses spin glass theory (disordered magnetic systems) to delineate domains in a protein 3D structure • Steps: • Take sequence with residue numbers (1..N) • Look at neighbourhood of each residue (first shell) • If (“average nghhood residue number” > res no) resno = resno+1 else resno = resno-1 • If (convergence) then take regions with identical “residue number” as domains and terminate

  17. SNAPDRAGONDomain boundary prediction protocol using sequence information alone (Richard George) • Input: Multiple sequence alignment (MSA) and predicted secondary structure • Generate 100 DRAGON models for the protein structure associated with the MSA • DRAGON folds proteins based on the requirement that (conserved) hydrophobic residues cluster together • (Predicted) secondary structures are used to further estimate distances between residues (e.g. between the first and last residue in a b-strand). • It first constructs a random high dimensional Ca (and pseudo Cb) distance matrix • Distance geometry is used to find the 3D conformation corresponding to a prescribed matrix of desired distances between residues (by gradual inertia projection and based on input MSA and predicted secondary structure) • Assign domain boundaries to each of the models (Taylor, 1999) • Sum proposed boundary positions within 100 models along the length of the sequence, and smooth boundaries using a weighted window George R.A. and Heringa J.(2002) SnapDRAGON - a method to delineate protein structural domains from sequence data, J. Mol. Biol. 316, 839-851.

  18. SnapDragon Folds generated by Dragon Multiple alignment Boundary recognition (Taylor, 1999) Predicted secondary structure Summed and Smoothed Boundaries CCHHHCCEEE

  19. Lysozyme 4lzm PDB DRAGON

  20. Methyltransferase 1sfe PDB DRAGON

  21. Phosphatase 2hhm-A PDB DRAGON

  22. Domain fusion For example, vertebrates have a multi-enzyme protein (GARs-AIRs-GARt) comprising the enzymes GAR synthetase (GARs), AIR synthetase (AIRs), and GAR transformylase (GARt) 1. In insects, the polypeptide appears as GARs-(AIRs)2-GARt. However, GARs-AIRs is encoded separately from GARt in yeast, and in bacteria each domain is encoded separately (Henikoff et al., 1997). 1GAR: glycinamide ribonucleotide synthetase AIR: aminoimidazole ribonucleotide synthetase

  23. Domain fusion Genetic mechanisms influencing the layout of multidomain proteins include gross rearrangements such as inversions, translocations, deletions and duplications, homologous recombination, and slippage of DNA polymerase during replication (Bork et al., 1992). Although genetically conceivable, the transition from two single domain proteins to a multidomain protein requires that both domains fold correctly and that they accomplish to bury a fraction of the previously solvent-exposed surface area in a newly generated inter-domain surface.

  24. Domain swapping Domain swapping is a structurally viable mechanism for forming oligomeric assemblies (Bennett et al., 1995). In domain swapping, a secondary or tertiary element of a monomeric protein is replaced by the same element of another protein. Domain swapping can range from secondary structure elements to whole structural domains. It also represents a model of evolution for functional adaptation by oligomerization, e.g. of oligomeric enzymes that have their active site at sub-unit interfaces (Heringa and Taylor, 1997).

  25. Domain databases

  26. BLOCKS Domain database Blocks are short multiply aligned ungapped segments corresponding to the most highly conserved regions of proteins. The rationale behind searching a database of blocks is that information from multiply aligned sequences is present in a concentrated form, reducing noise and increasing sensitivity to distant relationships. The BLOCKS Database (Henikoff et al., 2000) is automatically generated by looking for the most highly conserved regions in groups of proteins documented in various domain databases. Version 13.0 of the BLOCKS database consists of 8656 sequence blocks generated specifically from proteins in the PROSITE database.

  27. COGS Domain database The COGs (Clusters of Orthologous Groups) database is a phylogenetic classification of the proteins encoded within complete genomes (Tatusov et al., 2001). It primarily consists of bacterial and archaeal genomes. Operational definition of orthology is based on bidirectional best hit Incorporation of the larger genomes of multicellular eukaryotes into the COG system is achieved by identifying eukaryotic proteins that fit into already existing COGs. Eukaryotic proteins that have orthologs within different COGs are split into their individual domains. The COGs database currently consists of 3166 COGs including 75,725 proteins from 44 genomes.

  28. COGs: the beginning (1997) In order to extract the maximum amount of information from the rapidly accumulating genome sequences, all conserved genes need to be classified according to their homologous relationships. Comparison of proteins encoded in seven complete genomes from five major phylogenetic lineages and elucidation of consistent patterns of sequence similarities allowed the delineation of 720 clusters of orthologous groups (COGs). Each COG consists of individual orthologous proteins or orthologous sets of paralogs from at least three lineages. Orthologs typically have the same function, allowing transfer of functional information from one member to an entire COG. This relation automatically yields a number of functional predictions for poorly characterized genomes. The COGs comprise a framework for functional and evolutionary genome analysis.

  29. COG2813:16S RNA G1207 methylase RsmC COG members are mapped onto the genomes included in the DB

  30. PRINTS database • PRINTS is a compendium of protein fingerprints. • A fingerprint is a group of conserved motifs used to characterise a protein family; its diagnostic power (false positives and false negatives) is refined by iterative scanning of a SWISS-PROT/TrEMBL composite database. • Usually the motifs do not overlap, but are separated along a sequence, though they may be contiguous in 3D-space. • Fingerprints can encode protein folds and functionalities more flexibly and powerfully than can single motifs, full diagnostic potency deriving from the mutual context provided by motif neighbours • PRINTS contains the most discriminating groups of regular expressions for each protein sequence • Release 31.0 of PRINTS contains 1,550 entries, encoding 9,531 individual motifs.

  31. BETAHEAM: 2 of 5 PRINTS motifs making the fingerprint INITIAL MOTIF SETS BETAHAEM1 Length of motif = 17 Motif number = 1 Beta haemoglobin motif I - 1 PCODE ST INT GRLLVVYPWTQRYFDSF HBB1_RAT 29 29 GRLLVVYPWTQRYFDSF HBB1_MOUSE 29 29 GRLLVVYPWTQRFFEHF HBB_ALCAA 28 28 GRLLVVYPWTQRFFEHF HBB_ODOVI 28 28 GRLLVVYPWTQRFFESF HBB_BOVIN 28 28 GRLLVVYPWTQRFFESF HBB_ATEGE 29 29 GRLLVVYPWTQRFFESF HBB_HUMAN 29 29 GRLLVVYPWTQRFFESF HBB_ANTPA 29 29 ARLLIVYPWTQRFFASF HBB_ANAPL 29 29 SRCLIVYPWTQRHFSGF HBB_NOTAN 29 29 BETAHAEM2 Length of motif = 16 Motif number = 2 Beta haemoglobin motif II - 1 PCODE ST INT DLSSASAIMGNPKVKA HBB1_RAT 47 1 DLSSASAIMGNAKVKA HBB1_MOUSE 47 1 DLSTADAVMHNAKVKE HBB_ALCAA 46 1 DLSSAGAVMGNPKVKA HBB_ODOVI 46 1 DLSTADAVMNNPKVKA HBB_BOVIN 46 1 DLSTPDAVMSNPKVKA HBB_ATEGE 47 1 DLSTPDAVMGNPKVKA HBB_HUMAN 47 1 DLSNAGAVMGNAKVKA HBB_ANTPA 47 1 NLSSPTAILGNPMVRA HBB_ANAPL 47 1 NLYNAEAILGNANVAA HBB_NOTAN 47 1 After iteration the number of sequences for each motif can grow dramatically. Both the initial motifs (example here) and final motifs are provided to the user

  32. The PRODOM Database ProDom is a comprehensive set of protein domain families automatically generated from the SWISS-PROT and TrEMBL sequence databases

  33. The PRODOM Database ProDom (Corpet et al., 2000) is a database of protein domain families automatically generated from SWISSPROT and TrEMBL sequence databases (Bairoch and Apweiler, 2000) using a novel procedure based on recursive PSI-BLAST searches (Altschul et al., 1997). Release 2001.2 of ProDom contains 283,772 domain families, 101,957 having at least 2 sequence members. ProDom-CG (Complete Genome) is a version of the ProDom database which holds genome-specific domain data.

  34. The PROSITE Database PROSITE is a database of protein families and domains. It consists of biologically significant sites, patterns and profiles that help to reliably identify to which known protein family (if any) a new sequence belongs PROSITE (Hofmann et al., 1999) is a good source of high quality annotation for protein domain families. A PROSITE sequence family is represented as a pattern or profile, providing a means of sensitive detection of common protein domains in new protein sequences. PROSITE release 16.46 contains signatures specific for 1,098 protein families or domains. Each of these signatures comes with documentation providing background information on the structure and function of these proteins.

  35. The PROSITE Database A PROSITE sequence family is represented as a pattern or a profile. A pattern is given as a regular expression (next slide) The generalised profiles used in PROSITE carry the same increased information as compared to classical profiles as Hidden Markov Models (HMMs).

  36. Regular expressions Alignment ADLGAVFALCDRYFQ SDVGPRSCFCERFYQ ADLGRTQNRCDRYYQ ADIGQPHSLCERYFQ Regular expression [AS]-D-[IVL]-G-x4-{PG}-C-[DE]-R-[FY]2-Q {PG} = not (P or G) For short sequence stretches, regular expressions are often more suitable to describe the information than alignments (or profiles)

  37. Regular expressions Regular expression No. of exact matches in DB D-A-V-I-D 71 D-A-V-I-[DENQ] 252 [DENQ]-A-V-I-[DENQ] 925 [DENQ]-A-[VLI]-I-[DENQ] 2739 [DENQ]-[AG]-[VLI]2-[DENQ] 51506 D-A-V-E 1088

  38. Rationale for regular expressions • “I want to see all sequences that ... • ... contain a C” --- C • ... contain a C or an F” -- [CF] • ... contain a C and an F” -- (C.*F | F.*C) (‘|’ means ‘or’ and ‘.*’ means don’t care for any length) • ... contain a C immediately followed by an F” -- CF • ... contain a C later followed by an F” -- C.*F • ... begin with a C” -- ^C (‘^’ means ‘starting with’) • ... do not contain a C” -- {C} • ... contain at least three Cs” -- C3- • ... contain exactly three Cs” -- C3 • ... has a C at the seventh position” -- .6C • ... either contain a C, an E, and an F in any order except CFE, unless there are also at most three Ps, or there is a ....

  39. Regex limitations • regex cannot remember indeterminate counts !!! • “I want to see all sequences with ... • ... six Cs followed by six Ts” • C6T6 • ... any number of Cs followed by any number of Ts” • C*T* • ... Cs followed by an equal number of Ts” (This cannot be done..) • CnTn • (CT|CCTT|CCCTTT|C4T4| ... )?

  40. The PFAM Database Pfam is a large collection of multiple sequence alignments and hidden Markov models covering many common protein domains and families. For each family in Pfam you can: • Look at multiple alignments • View protein domain architectures • Examine species distribution • Follow links to other databases • View known protein structures • Search with Hidden Markov Model (HMM) for each alignment

  41. The PFAM Database Pfam is a database of two parts, the first is the curated part of Pfam containing over 5193 protein families (Pfam-A). Pfam-A comprises manually crafted multiple alignments and profile-HMMs . To give Pfam a more comprehensive coverage of known proteins we automatically generate a supplement called Pfam-B. This contains a large number of small families taken from the PRODOM database that do not overlap with Pfam-A. Although of lower quality Pfam-B families can be useful when no Pfam-A families are found.

  42. The PFAM Database Sequence coverage Pfam-A : 73% (Gr) Sequence coverage Pfam-B : 20% (Bl) Other (Grey)

  43. A PFAM alignment CYB_TRYBB/1-197          M...LYKSG..EKRKG..LLMSGC.....LYR.....IYGVGFSLGFFIALQIIC..GVCLAWLFFSCFICSNWYFVLFLCYB_MARPO/1-208          M.ARRLSILKQPIFSTFNNHLIDY.....PTPSNISYWWGFGSLAGLCLVIQILTGVFLAMHYTPHVDLAFLSVEHIMR.CYB_HETFR/1-205          MATNIRKTH..PLLKIINHALVDL.....PAPSNISAWWNFGSLLVLCLAVQILTGLFLAMHYTADISLAFSSVIHICR.CYB_STELO/1-204          M.TNIRKTH..PLMKILNDAFIDL.....PTPSNISSWWNFGSLLGLCLIMQILTGLFLAMHYTPDTTTAFSSVAHICR.CYB_ASCSU/1-196          ...........MKLDFVNSMVVSL.....PSSKVLTYGWNFGSMLGMVLGFQILTGTFLAFYYSNDGALAFLSVQYIMY.CYB6_SPIOL/1-210         M.SKVYDWF..EERLEIQAIADDITSKYVPPHVNIFYCLGGITLT..CFLVQVATGFAMTFYYRPTVTDAFASVQYIMT.CYB6_MARPO/1-210         M.GKVYDWF..EERLEIQAIADDITSKYVPPHVNIFYCLGGITLT..CFLVQVATGFAMTFYYRPTVTEAFSSVQYIMT.CYB6_EUGGR/1-210         M.SRVYDWF..EERLEIQAIADDVSSKYVPPHVNIFYCLGGITFT..CFIIQVATGFAMTFYYRPTVTEAFLSVKYIMN.CYB_TRYBB/1-197          WDFDLGFVIRSVHICFTSLLYLLLYIHIFKSITLIILFDTH..IL....VWFIGFILFVFIIIIAFIGYVLPCTMMSYWGCYB_MARPO/1-208          .DVKGGWLLRYMHANGASMFFIVVYLHFFRGLY....YGSY..ASPRELVWCLGVVILLLMIVTAFIGYVLPWGQMSFWGCYB_HETFR/1-205          .DVNYGWLIRNIHANGASLFFICIYLHIARGLY....YGSY..LLKE..TWNIGVILLFLLMATAFVGYVLPWGQMSFWGCYB_STELO/1-204          .DVNYGWFIRYLHANGASMFFICLYAHMGRGLY....YGSY..MFQE..TWNIGVLLLLTVMATAFVGYVLPWGQMSFWGCYB_ASCSU/1-196          .EVNFGWIFRVLHFNGASLFFIFLYLHLFKGLF....FMSY..RLKK..VWVSGIVILLLVMMEAFMGYVLVWAQMSFWACYB6_SPIOL/1-210         .EVNFGWLIRSVHRWSASMMVLMMILHVFRVYL....TGGFKKPREL..TWVTGVVLGVLTASFGVTGYSLPWDQIGYWACYB6_MARPO/1-210         .EVNFGWLIRSVHRWSASMMVLMMILHIFRVYL....TGGFKKPREL..TWVTGVILAVLTVSFGVTGYSLPWDQIGYWACYB6_EUGGR/1-210         .EVNFGWLIRSIHRWSASMMVLMMILHVCRVYL....TGGFKKPREL..TWVTGIILAILTVSFGVTGYSLPWDQVGYWACYB_TRYBB/1-197          LTVFSNIIATVPILGIWLCYWIWGSEFINDFTLLKLHVLHV.LLPFILLIILILHLFCLHYFMCYB_MARPO/1-208          ATVITSLASAIPVVGDTIVTWLWGGFSVDNATLNRFFSLHY.LLPFIIAGASILHLAALHQYGCYB_HETFR/1-205          ATVITNLLSAFPYIGDTLVQWIWGGFSIDNATLTRFFAFHF.LLPFLIIALTMLHFLFLHETGCYB_STELO/1-204          ATVITNLLSAIPYIGTTLVEWIWGGFSVDKATLTRFFAFHF.ILPFIITALAAVHLLFLHETGCYB_ASCSU/1-196          SVVITSLLSVIPVWGFAIVTWIWSGFTVSSATLKFFFVLHF.LVPWGLLLLVLLHLVFLHETGCYB6_SPIOL/1-210         VKIVTGVPDAIPVIGSPLVELLRGSASVGQSTLTRFYSLHTFVLPLLTAVFMLMHFLMIRKQGCYB6_MARPO/1-210         VKIVTGVPEAIPIIGSPLVELLRGSVSVGQSTLTRFYSLHTFVLPLLTAIFMLMHFLMIRKQGCYB6_EUGGR/1-210         VKIVTGVPEAIPLIGNFIVELLRGSVSVGQSTLTRFYSLHTFVLPLLTATFMLGHFLMIRKQG The coloured markup was created by Jalview (Michele Clamp) Alignments are colored using the ClustalX scheme in Jalview (orange:glycine (G); yellow: Proline (P); blue: small and hydrophobic amino-acids (A, V, L, I, M, F, W); green: hydroxyl and amine amino-acids (S, T, N, Q); red: charged amino-acids (D, E, R, K); cyan: histidine (H) and tyrosine(Y)).

More Related