1 / 35

Integration of Sequence and 3D structure Databases

Integration of Sequence and 3D structure Databases. EMBL-Bank DNA sequences. Uniprot Protein Sequences. Array-Express Microarray Expression Data. EnsEMBL Human Genome Gene Annotation. EMSD Macromolecular Structure Data. Integration With Uniprot eFamily Project Future Plans.

mdecker
Download Presentation

Integration of Sequence and 3D structure Databases

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Integration of Sequence and 3D structure Databases

  2. EMBL-BankDNA sequences Uniprot Protein Sequences Array-Express Microarray Expression Data EnsEMBL Human Genome Gene Annotation EMSD Macromolecular Structure Data

  3. Integration With Uniprot eFamily Project Future Plans

  4. Integration With UniProt UniProt (Universal Protein Resource) is the world's most comprehensive catalogue of information on proteins. It is a central repository of protein sequence and function created by joining the information contained in Swiss-Prot, TrEMBL, and PIR. http://www.ebi.ac.uk/uniprot/index.html

  5. Services MSD / UniProt: Two Different Database Systems UniProt MSD Agreed common mechanism for exchange of information. Services “One of the major benefits of using databases for data storage is for data sharing”

  6. MSD/Uniprot Collaboration • Collaboration between MSD (Sameer Velankar, Phil McNeil) and UniProt (Virginie Mittard, Daniel Barrell) groups • Depends upon • Clean UniProt (UNP) cross references in the DBREF records for each chain (where possible) • Clean taxonomy ids for each PDB chain • Taxonomy for PDB Source and UniProt OS must be the same

  7. Cleanup of the DBREF records in the PDB entries • Cleanup of the UniProt cross references in PDB entries • Cleanup of Source Information • NCBI Taxonomy IDs • Cleanup of the Reference information • Update UniProt entries • Source, Reference, Secondary structure information • Supply Additional Information • revision date, experimental method, resolution, R-factor • Residue-by-residue mapping between MSD and UniProt enables chimaeras to be handled correctly

  8. Sequence Schema

  9. PDB CHAIN UNP SERIAL PDB_RES PDB_SEQ UNP_RES UNP_RES ANNOTATION 1HG1 A P06608 1 ALA 22 A NOT OBSERVED 1HG1 A P06608 2 ASP 23 D NOT OBSERVED 1HG1 A P06608 3 LYS 24 K NOT OBSERVED 1HG1 A P06608 4 4 LEU 25 L 1HG1 A P06608 5 5 PRO 26 P 1HG1 A P06608 6 6 ASN 27 N 1HG1 A P06608 7 7 ILE 28 I 1HG1 A P06608 8 8 VAL 29 V 1HG1 A P06608 9 9 ILE 30 I 1HG1 A P06608 10 10 LEU 31 L 1HG1 A P06608 11 11 ALA 32 A Residue by Residue Mapping to UniProt

  10. Display of Mappings

  11. Integration With IntEnz IntEnz is the name for the Integrated relational Enzyme database and is the most up-to-date version of the Enzyme Nomenclature. The IntEnz relational database implemented and supported by the EBI is the master copy of the Enzyme Nomenclature data. MSD uses the UniProt accession code(s) mapped to each chain to link to the IntEnz EC number This done directly via the MSD and IntEnz Oracle relational databases http://www.ebi.ac.uk/intenz/index.html

  12. eFamily http://www.efamily.org.uk/ The eFamily project is designed to integrate the information contained in five of the major protein databases.

  13. eFamily Core Activities • To integrate the information contained in the five major protein databases. • The member databases (CATH, SCOP, MSD, Interpro, and Pfam) contain information describing protein domains. • For SCOP, CATH and MSD the data is primarily concerned with 3D structures • In InterPro and Pfam the focus is mainly on the sequences. • It is often difficult for biologists to navigate from protein sequence to protein structure and back again. • eFamilyaimsto provide the scientific community with a coherent and rich view of protein families that allow users seamlessly to navigate between the worlds of protein structure and protein sequence, by improved data resources and integration via grid technologies.

  14. DATA INTEGRATION Common Domains definition HMM prediction UniProt CATH SCOP GO GO Mapping & curation Mapping per residue Curated MSD mapping Residues/Sequence Mapping start – end Curated PROSITE Curated InterPro Pfam GO Curated

  15. Complexity of Mappings InterPro-UniProt(s) • An InterPro entry is a collection of one or more UniProt entries • Unlike PDB concept of CHAIN does not exist in UniProt • UniProt entry is always numbered from 1 to N • PDB SEQRES Residue numbering is from 1 to N • PDB CHAIN (ATOM Records) Residue numbering is not necessarily 1 to N • UniProt to PDB Mapping can be one to many • PDB CHAIN to UniProt Mapping can be one to many UniProt-PDBCHAIN(S) CATH/SCOP DOMAIN PDBCHAIN(S) CATH/SCOP DOMAIN UniProt InterPro-CATH/SCOP

  16. Swiss-Prot Residue Range PDB Residue Range Chains MSD-SCOP Mapping for 1cbw SCOP Domain

  17. Swiss-Prot Residue Range Chains PDB Residue Range MSD-CATH Mapping for 1cbw CATH Domains

  18. Swiss-Prot Residue Range Chains PDB Residue Range Pfam Domain MSD-Pfam Mapping for 1cbw

  19. Practical Applications of Database Integration

  20. Mappings Used in Pfam Pfam now uses UniProt to structure mapping from MSD Search Database Saves duplication of effort and weeks of compute Use mapping for annotation of alignments Pfam domains highlighted on structure of RuBisCo (8ruc)

  21. Mappings Used in Interpro

  22. Mappings Used in SCOP

  23. Comparison of SCOP, CATH and Pfam Domains SCOP, CATH and Pfam have developed web-services for describing their particular domain families. These services can be queried with a protein identifier, protein accession or PDB identifier. The databases use the MSD/UniProt mapping to translate between the sequence and structure domains

  24. XML & Web Services The eFamily project has developed a XML schema to describe: • Domains • Annotation • Sequence Alignments • Structure Alignments This will be used to provide web-services as part of the eFamily project. More information about the XML schema is available at -http://www.efamily.org.uk/xml/efamily/documentation/efamily.shtml We are also developing a perl based API for the eFamily XML which will be available from eFamily site as well as via bio-perl. The MSD residue-by-residue mapping is made available in XML format based on the eFamily schema.

  25. Mapping Annotation

  26. Future Plans

  27. Integration of IntAct database- IntAct provides a freely available, open source database system and analysis tools for protein interaction data. http://www.ebi.ac.uk/intact/index.jsp

  28. Residue Mapping Program 1 • Makes use of cleaned-up cross-reference & taxonomy data, SEQRES and ATOM/HETATM records from the PDB and the sequence from the UniProt entry to align and map each residue. • Makes connected segments from the PDB ATOM/HETATM records for each chain • These are then aligned against the SEQRES records and all the alignments for the segments are merged to get the SEQRES-ATOM alignment • This enables any unobserved residues to be considered

  29. Residue Mapping Program 2 • A similar operation is performed on the UniProt sequence and connected segments from the ATOM/HETATM records to get the UNP-ATOM alignment • The SEQRES-ATOM and UNP-ATOM alignments are then merged to get the final alignment • This is repeated for each chain in the PDB archive (with a UNP cross-reference • The mapping is loaded into the MSD relational database and validated

  30. Integrating data from MSD into CATH • Protocols have been developed for regular imports of a subset of MSD data warehouse into a local CATH database set up in ORACLE 9i • For example, information on the biological unit and on protein-ligand interactions will be integrated to increase functional annotations for CATH domain families

  31. MSD & CATH Data Exchange • Two step process of data synchronisation • Data are moved from the MSD search database to the CATH-UCL site using a combination of Oracle Export/Import and SQL*Loader utilities • Subsequent updates in the MSD database are pushed to theCATH site using an incremental replication mechanism. • Data from the CATH site are pushed to the MSD site, using the same two step process • The two databases are synchronised

  32. Structure SCOP CATH Sequence UniProt (neé Swiss-Prot /Trembl/PIR), InterPro, Go, Pfam Function IntEnz Literature Medline

More Related