250 likes | 466 Views
Bio-Chemical databases. Guest Lecture Graduate level course MCB221b - Mechanistic Enzymology Tobias Kind – November 2007. Database concepts - what is a “good” database (DB) How is data stored and queried and curated Enzyme DBs, Protein and peptide DBs, small molecule DBs.
E N D
Bio-Chemical databases Guest Lecture Graduate level course MCB221b - Mechanistic Enzymology Tobias Kind – November 2007 • Database concepts - what is a “good” database (DB) • How is data stored and queried and curated • Enzyme DBs, Protein and peptide DBs, small molecule DBs This document is hyperlinked (pictures and green text). To use WWW links in this PPT switch to slide show mode.
Databases – very short primer(*) DB2 Database interface – is what you see Database queries – what you ask the database Database objects – where the data is stored (index and tables) Database types – relational databases, object oriented databases, flat file DBs Database brands – Oracle, MySQL, Apache, IBM DB2, PostgreSQL, MS SQL Database query language – how a database can be programmed (SQL) Database dump file – the whole database in a single (*.dmp) file Database Ontology – database vocabulary and used relationships Database Semantics – capture meaning by grammar or logical analysis Oracle MySQL (*) you can study this for several yearsand get a PhD in computer and database sciences.
What is a good database? As in normal life its important to distinguish between good and evil • Good DB: • allows multiple input queries • exports in multiple output formats • connects to other DBs • is curated (means checked for errors by humans or machines) • is regularly updated (daily, yearly) • cost money (your money or tax payers money) or time • allows bulk download (millions of data sets can be downloaded) • has open interfaces (APIs) for query requests Source: wikimedia.org • Bad DB: • allow only single requests (which have to be typed manually) • are not databases but just lists or tables • have no link-out and no link-in • allow no bulk download • are not curated • … Source: wikimedia.org
Exchange formats – SMBL, XML, BioPax XML format – general purpose data format (CML for storing chemical data) <?xml version="1.0" ?> <molecule id="m1"> <atomArray> <atom id="a1" elementType="C" x2="-3.0333333015441895" y2="2.9166667461395264" /> </atomArray> <bondArray> </bondArray> </molecule> BioPax format – used for representing pathway data (data exchange format) SBML format – representing models of biochemical reaction networks SDF format – general purpose chemical structure format (small molecules) RDF format – format for storing chemical reactions (small molecules) PDB format – general purpose chemical structure format (proteins) Methane
SBML (Systems Biology Markup Language) Source: Akira Funahashi – Cell Designer Tutorial • List of supported SBML programs (more than 200) from sbml.org • List of curated and published SBML models (around 200) from biomodels DB
APIs, Mashups, SQL • Application programming interfaces (API) are important to connect and automate data exchange between local programs and databases;Example: NCBI SOAP or PubChem PUG (Power User Interface) can be used to download certain data via the web to another service or to a local program • Mashups and integration services use new web technology (RDF, Yahoo Pipes) to combine data sources and create new knowledge or enhance usage • SQL used for programming databases Large Database TableSQL queryResult • yr subject winner • 1901 Chemistry Jacobus H. van 't Hoff • 1902 Chemistry Emil Fischer • 1903 Chemistry Svante Arrhenius • 1904 Chemistry Sir William Ramsay • 1905 Chemistry Adolf von Baeyer • 1906 Chemistry Henri Moissan • 1907 Chemistry Eduard Buchner • 1908 Chemistry Ernest Rutherford • 1909 Chemistry Wilhelm Ostwald • 1910 Chemistry Otto Wallach • … yr subject winner 1909 Chemistry Wilhelm Ostwald SELECT yr, subject, winner FROM nobel WHERE yr = 1909 and subject = 'chemistry' Visit the SQL Zoo
Database front-ends (a good one) Enhanced NCI Database Browser Release 2 (CACTVS DB) • Small molecule DB with revolutionary web-front-end (2001) • Multiple input an output (export) methods • Allows matching of molecule lists against DB (as SMILES, CAS, NCI number) • Links to other services • Visualization modes (2D, 3D) • 20 different molecular output formats (SDF, CML, SMILES) • export to different other (calculational) services • 30 different query modes
Database visualization • Visualize complex networks; uses plug-in-technology from different sources • Map your own compound data (proteins, genes, molecules) onto networks • Perform literature search with enzymes, genes, small molecules Source: Cytoscape.org Start Cytoscape via JAVA webstart
Database and tools integration Gaggle Source: WIKIMEDIA • Frameworks • Portals • Mashups Source: http://gaggle.systemsbiology.org/docs/geese/
Gaggle Integration of tools and database services Source: WIKIMEDIA ListLink The Gaggle: an open-source software system for integrating bioinformatics software and data sources. Shannon PT, Reiss DJ, Bonneau R, Baliga NS. BMC Bioinformatics. 2006 Mar 28;7:176. Use Gaggle
Use or built your own local databaseExample: LipidMaps DB with Instant-JChem • Download the whole LipidMaps DB (10,000 lipids) as SDF file [LINK] • Use Instant-JChem as data DB, molecule DB, reaction DB [LINK] • Perform data and molecule queries on your laptop (PC, LINUX, MAC) (…also works with KEGG/Biometa DB)
Welcome to the (database) jungle! ChemBioGrid – collection of most chemistry databases current number ~ 156 Pathguide.org – collection of pathway, enzyme, metabolite DBs current number ~ 231 Chemistry related (big players): PubChem, CAS (subscription), Beilstein (subscription), Chemspider (fast growing) Important for chemistry/metabolomics: Spectral databases (NMR, mass spectral databases), compound property DBs Pathway, Enzyme related: KEGG, Brenda, Reactome, Expasy, MetaCyc
Pathguide.org Pathguide is a meta-database: Comprehensive collection of pathway, small molecule, enzyme, protein interaction databases
Enzyme and kinetics related databases KDBI - Kinetic Data of Bio-molecular Interactions database http://bidd.nus.edu.sg/group/kdbi/ SABIO-RK - SABIO-Reaction Kinetics Database http://sabio.villa-bosch.de/SABIORK/ BRENDA - Comprehensive Enzyme Information System http://www.brenda.uni-koeln.de/ EMP - Enzymes and Metabolic Pathways Database http://www.empproject.com/ ENZYME - Enzyme nomenclature database (EXPASY) http://www.expasy.ch/enzyme/ IntEnz - Integrated relational Enzyme database http://www.ebi.ac.uk/intenz/index.html TECR - Thermodynamics of Enzyme-Catalyzed Reaction http://xpdb.nist.gov/enzyme_thermodynamics/ REBASE - Restriction Enzyme Database http://rebase.neb.com/ Precise - Predicted and Consensus Interaction Sites in Enzymes http://precise.bu.edu/ Source: Pathguide; Own search
PubChem • Most important small molecule DB • There was no large open chemistry DB until 10 years ago (!) • All records can be downloaded via FTP • All other small molecule link to PubChem • PubChem Compounds (true chemicals) • PubChem Substances • (formulations, mixtures) • substructure search and multiple other options Goto PubChem
CAS SciFinder • 33 million molecules and 60 million peptides/proteins • Largest reaction DB (14 million reactions) and literature DB • A must for chemist and biochemist/biologist • no bulk download, no good Import/ Export, no Linkouts • only proprietary Windows interface (no plugins) • no text mining (requires ANAVIST) Download Scifinder
Brenda 3D model output with JMOL Example: Brenda connection to RSCB Protein Data bank Visit Brenda
KEGG – Pathway DB KEGG ID: C00002 (ATP) KEGG pathway map ID: map00195 (Photosynthesis) KEGG reaction ID: R05668 (ATP + NAD reaction) Visit KEGG
Reactome – curated pathway maps Example: Skypainter, map your given KEGG IDs to pathways Visit Reactome
Outlook for the database lesson • Curation, Curation, Curation (costs money) • Inhale the good DB and bad DB scheme and apply when you enter a DB portal • Learn some basic database programming (Ruby on Rails, JAVA, SQL) using bioinformatics and chemoinformatics approaches is crucial for research • Learn how to import and store and handle database search results on your local computer (simple: parse important data with regular expressions) • Don’t be overwhelmed by the database jungle, take some time to play around; Finally automation and clever use of DB tools will innovate your research • Multiple unique identifier problem (Kegg ID, PubChem ID, CAS number) and biological naming problem still exist • The systems biology and chemistry database world is still different in terms of re-use. Most of the chemistry data published (including molecules) is not machine readable, hence can’t be automatically harvested by software robots.
Reading List databases The Gaggle: An open-source software system for integrating bioinformatics software and data sources Correcting ligands, metabolites, and pathways Large-Scale Annotation of Small-Molecule Libraries Using Public Databases
Homework for homework discussion III (30 min) • Find three bad or evil databases in the biochemistry/chemistry worldplease give a reason in a short sentence. • Find the year in which most papers about “enzyme kinetics” were published using SciFinder (use Explore enter search term, then Analyze year) • Find the molecules which were analyzed most in papers regarding "enzyme kinetics" and "crickets“using SciFinder (use Explore, then Analyze CAS Number) • Find the price for 1g ATP from Pfaltz & Bauer (in SciFinder use locate substance then use the Erlenmeyer icon for price info) • Goto Brenda and find out how many coronavirus types are in the DB(use TaxExplorer and query) • Goto Brenda and find out how many enzymes are listed as resistant againstperchloric acid, report publication title (goto Brenda, Advanced search) • Goto KEGG Ligand DB find the KEGG Numbers for D-Hexose and ATP • Goto KEGG Reaction Prediction (e-zyme) : How many similar reactions occur between D-Hexose and ATP? (Enter above KEGG IDs, press view structures; press compute) • Goto PubChem; What is the PubChem compound ID (CID) and the topological surface area for Tobias acid? Source: MS Office
Pathways and enzymes http://www.biocarta.com/pathfiles/h_etcPathway.asp# SQL learning http://sqlzoo.net/ Databases http://www.google.com/search?hl=en&q=enzyme+kinetics+database&btnG=Google+Search SQL biologists I’m a biologist Jim, not a programmer SQL biologists SciView part 5: interview with Alexei Drummond Thank you!Thanks to all Wikimedia.org contributors for pictures! Thanks to the Dinesh Kumar (FiehnLab) for discussions.