1 / 25

Guest Lecture Graduate level course MCB221b - Mechanistic Enzymology Tobias Kind – November 2007

Bio-Chemical databases. Guest Lecture Graduate level course MCB221b - Mechanistic Enzymology Tobias Kind – November 2007. Database concepts - what is a “good” database (DB) How is data stored and queried and curated Enzyme DBs, Protein and peptide DBs, small molecule DBs.

minda
Download Presentation

Guest Lecture Graduate level course MCB221b - Mechanistic Enzymology Tobias Kind – November 2007

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Bio-Chemical databases Guest Lecture Graduate level course MCB221b - Mechanistic Enzymology Tobias Kind – November 2007 • Database concepts - what is a “good” database (DB) • How is data stored and queried and curated • Enzyme DBs, Protein and peptide DBs, small molecule DBs This document is hyperlinked (pictures and green text). To use WWW links in this PPT switch to slide show mode.

  2. Databases – very short primer(*) DB2 Database interface – is what you see Database queries – what you ask the database Database objects – where the data is stored (index and tables) Database types – relational databases, object oriented databases, flat file DBs Database brands – Oracle, MySQL, Apache, IBM DB2, PostgreSQL, MS SQL Database query language – how a database can be programmed (SQL) Database dump file – the whole database in a single (*.dmp) file Database Ontology – database vocabulary and used relationships Database Semantics – capture meaning by grammar or logical analysis Oracle MySQL (*) you can study this for several yearsand get a PhD in computer and database sciences.

  3. What is a good database? As in normal life its important to distinguish between good and evil • Good DB: • allows multiple input queries • exports in multiple output formats • connects to other DBs • is curated (means checked for errors by humans or machines) • is regularly updated (daily, yearly) • cost money (your money or tax payers money) or time • allows bulk download (millions of data sets can be downloaded) • has open interfaces (APIs) for query requests Source: wikimedia.org • Bad DB: • allow only single requests (which have to be typed manually) • are not databases but just lists or tables • have no link-out and no link-in • allow no bulk download • are not curated • … Source: wikimedia.org

  4. Exchange formats – SMBL, XML, BioPax XML format – general purpose data format (CML for storing chemical data) <?xml version="1.0" ?> <molecule id="m1"> <atomArray> <atom id="a1" elementType="C" x2="-3.0333333015441895" y2="2.9166667461395264" /> </atomArray> <bondArray> </bondArray> </molecule> BioPax format – used for representing pathway data (data exchange format) SBML format – representing models of biochemical reaction networks SDF format – general purpose chemical structure format (small molecules) RDF format – format for storing chemical reactions (small molecules) PDB format – general purpose chemical structure format (proteins) Methane

  5. SBML (Systems Biology Markup Language) Source: Akira Funahashi – Cell Designer Tutorial • List of supported SBML programs (more than 200) from sbml.org • List of curated and published SBML models (around 200) from biomodels DB

  6. APIs, Mashups, SQL • Application programming interfaces (API) are important to connect and automate data exchange between local programs and databases;Example: NCBI SOAP or PubChem PUG (Power User Interface) can be used to download certain data via the web to another service or to a local program • Mashups and integration services use new web technology (RDF, Yahoo Pipes) to combine data sources and create new knowledge or enhance usage • SQL used for programming databases Large Database TableSQL queryResult • yr subject winner • 1901 Chemistry Jacobus H. van 't Hoff • 1902 Chemistry Emil Fischer • 1903 Chemistry Svante Arrhenius • 1904 Chemistry Sir William Ramsay • 1905 Chemistry Adolf von Baeyer • 1906 Chemistry Henri Moissan • 1907 Chemistry Eduard Buchner • 1908 Chemistry Ernest Rutherford • 1909 Chemistry Wilhelm Ostwald • 1910 Chemistry Otto Wallach • … yr subject winner 1909 Chemistry Wilhelm Ostwald SELECT yr, subject, winner FROM nobel WHERE yr = 1909 and subject = 'chemistry' Visit the SQL Zoo

  7. Database front-ends (a good one) Enhanced NCI Database Browser Release 2 (CACTVS DB) • Small molecule DB with revolutionary web-front-end (2001) • Multiple input an output (export) methods • Allows matching of molecule lists against DB (as SMILES, CAS, NCI number) • Links to other services • Visualization modes (2D, 3D) • 20 different molecular output formats (SDF, CML, SMILES) • export to different other (calculational) services • 30 different query modes

  8. Database visualization • Visualize complex networks; uses plug-in-technology from different sources • Map your own compound data (proteins, genes, molecules) onto networks • Perform literature search with enzymes, genes, small molecules Source: Cytoscape.org Start Cytoscape via JAVA webstart

  9. Uber-portals (NCBI ENTREZ)

  10. Database and tools integration Gaggle Source: WIKIMEDIA • Frameworks • Portals • Mashups Source: http://gaggle.systemsbiology.org/docs/geese/

  11. Gaggle Integration of tools and database services Source: WIKIMEDIA ListLink The Gaggle: an open-source software system for integrating bioinformatics software and data sources. Shannon PT, Reiss DJ, Bonneau R, Baliga NS. BMC Bioinformatics. 2006 Mar 28;7:176. Use Gaggle

  12. Use or built your own local databaseExample: LipidMaps DB with Instant-JChem • Download the whole LipidMaps DB (10,000 lipids) as SDF file [LINK] • Use Instant-JChem as data DB, molecule DB, reaction DB [LINK] • Perform data and molecule queries on your laptop (PC, LINUX, MAC) (…also works with KEGG/Biometa DB)

  13. Welcome to the (database) jungle! ChemBioGrid – collection of most chemistry databases current number ~ 156 Pathguide.org – collection of pathway, enzyme, metabolite DBs current number ~ 231 Chemistry related (big players): PubChem, CAS (subscription), Beilstein (subscription), Chemspider (fast growing) Important for chemistry/metabolomics: Spectral databases (NMR, mass spectral databases), compound property DBs Pathway, Enzyme related: KEGG, Brenda, Reactome, Expasy, MetaCyc

  14. Pathguide.org Pathguide is a meta-database: Comprehensive collection of pathway, small molecule, enzyme, protein interaction databases

  15. Enzyme and kinetics related databases KDBI - Kinetic Data of Bio-molecular Interactions database http://bidd.nus.edu.sg/group/kdbi/ SABIO-RK - SABIO-Reaction Kinetics Database http://sabio.villa-bosch.de/SABIORK/ BRENDA - Comprehensive Enzyme Information System http://www.brenda.uni-koeln.de/ EMP - Enzymes and Metabolic Pathways Database http://www.empproject.com/ ENZYME - Enzyme nomenclature database (EXPASY) http://www.expasy.ch/enzyme/ IntEnz - Integrated relational Enzyme database http://www.ebi.ac.uk/intenz/index.html TECR - Thermodynamics of Enzyme-Catalyzed Reaction http://xpdb.nist.gov/enzyme_thermodynamics/ REBASE - Restriction Enzyme Database http://rebase.neb.com/ Precise - Predicted and Consensus Interaction Sites in Enzymes http://precise.bu.edu/ Source: Pathguide; Own search

  16. PubChem • Most important small molecule DB • There was no large open chemistry DB until 10 years ago (!) • All records can be downloaded via FTP • All other small molecule link to PubChem • PubChem Compounds (true chemicals) • PubChem Substances • (formulations, mixtures) • substructure search and multiple other options Goto PubChem

  17. CAS SciFinder • 33 million molecules and 60 million peptides/proteins • Largest reaction DB (14 million reactions) and literature DB • A must for chemist and biochemist/biologist • no bulk download, no good Import/ Export, no Linkouts • only proprietary Windows interface (no plugins) • no text mining (requires ANAVIST) Download Scifinder

  18. BRENDA - Comprehensive Enzyme Information System

  19. Brenda 3D model output with JMOL Example: Brenda connection to RSCB Protein Data bank Visit Brenda

  20. KEGG – Pathway DB KEGG ID: C00002 (ATP) KEGG pathway map ID: map00195 (Photosynthesis) KEGG reaction ID: R05668 (ATP + NAD reaction) Visit KEGG

  21. Reactome – curated pathway maps Example: Skypainter, map your given KEGG IDs to pathways Visit Reactome

  22. Outlook for the database lesson • Curation, Curation, Curation (costs money) • Inhale the good DB and bad DB scheme and apply when you enter a DB portal • Learn some basic database programming (Ruby on Rails, JAVA, SQL) using bioinformatics and chemoinformatics approaches is crucial for research • Learn how to import and store and handle database search results on your local computer (simple: parse important data with regular expressions) • Don’t be overwhelmed by the database jungle, take some time to play around; Finally automation and clever use of DB tools will innovate your research • Multiple unique identifier problem (Kegg ID, PubChem ID, CAS number) and biological naming problem still exist • The systems biology and chemistry database world is still different in terms of re-use. Most of the chemistry data published (including molecules) is not machine readable, hence can’t be automatically harvested by software robots.

  23. Reading List databases The Gaggle: An open-source software system for integrating bioinformatics software and data sources Correcting ligands, metabolites, and pathways Large-Scale Annotation of Small-Molecule Libraries Using Public Databases

  24. Homework for homework discussion III (30 min) • Find three bad or evil databases in the biochemistry/chemistry worldplease give a reason in a short sentence. • Find the year in which most papers about “enzyme kinetics” were published using SciFinder (use Explore enter search term, then Analyze year) • Find the molecules which were analyzed most in papers regarding "enzyme kinetics" and "crickets“using SciFinder (use Explore, then Analyze CAS Number) • Find the price for 1g ATP from Pfaltz & Bauer (in SciFinder use locate substance then use the Erlenmeyer icon for price info) • Goto Brenda and find out how many coronavirus types are in the DB(use TaxExplorer and query) • Goto Brenda and find out how many enzymes are listed as resistant againstperchloric acid, report publication title (goto Brenda, Advanced search) • Goto KEGG Ligand DB find the KEGG Numbers for D-Hexose and ATP • Goto KEGG Reaction Prediction (e-zyme) : How many similar reactions occur between D-Hexose and ATP? (Enter above KEGG IDs, press view structures; press compute) • Goto PubChem; What is the PubChem compound ID (CID) and the topological surface area for Tobias acid? Source: MS Office

  25. Pathways and enzymes http://www.biocarta.com/pathfiles/h_etcPathway.asp# SQL learning http://sqlzoo.net/ Databases http://www.google.com/search?hl=en&q=enzyme+kinetics+database&btnG=Google+Search SQL biologists I’m a biologist Jim, not a programmer SQL biologists SciView part 5: interview with Alexei Drummond Thank you!Thanks to all Wikimedia.org contributors for pictures! Thanks to the Dinesh Kumar (FiehnLab) for discussions.

More Related