1 / 11

Interoperation of Molecular Biology Databases

Interoperation of Molecular Biology Databases. Peter D. Karp, Ph.D. Bioinformatics Research Group SRI International Menlo Park, CA pkarp@ai.sri.com. Main Message. Interoperation of molecular-biology databases is a challenging problem of critical importance

Download Presentation

Interoperation of Molecular Biology Databases

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Interoperation of Molecular Biology Databases Peter D. Karp, Ph.D. Bioinformatics Research Group SRI International Menlo Park, CA pkarp@ai.sri.com

  2. Main Message • Interoperation of molecular-biology databases is a challenging problem of critical importance • DOE should initiate a program in interoperation of molecular biology databases • Pursue both warehouse approach and multidatabase approach • Major progress possible within 5 years

  3. Motivations • Important biological problems require access to multiple bioinformatics databases • Different problems require different sets of databases • Hundreds of bioinformatics databases exist • Nucleic Acids Research 32:2004 – Database issue • Nucleic Acids Research DB list: http://www3.oup.co.uk/nar/database/a/ • 350 databases listed in 2002 • 560 databases listed in 2004 • Applications of integration include • Complex queries • Comparison of overlapping sources • Data mining

  4. Bioinformatics Databases • Tremendous progress in point-and-click access for biologist users • Less progress toward providing a computable, interoperable infrastructure for large-scale data mining • Every large-scale mining/learning problem requires time consuming crafting of input/training datasets

  5. Warehouse Approach vsMultidatabase Approach • Multidatabase query approaches assume databases are in a queryable DBMS • Most sites that do operate DBMSs do not allow remote query access because of security and loading concerns • Users want to control data stability • Users want to control hardware applied to problem • Internet bandwidth limits query throughput • Users need to capture, integrate and publish locally produced data of different types • Replicating and refreshing very large sources is expensive • Multidatabase and Warehouse approaches complementary

  6. SRI BioWarehouseProject Goal • Create a toolkit for constructing bioinformatics database warehouses that integrate sets of bioinformatics databases into one physical DBMS

  7. BioWarehouse Approach • Warehouse schema defines many bioinformatics datatypes • Create loaders for public bioinformatics DBs • Parse file format for the DB • Apply semantic transformations • Insert database into warehouse tables • Oracle and MySQL implementations • Warehouse query access mechanisms • SQL queries via JDBC,Lisp,Perl, ODBC, OAA

  8. Warehouse Schema • Manages many bioinformatics datatypes simultaneously • Pathways, Reactions, Chemicals • Proteins, Genes, Replicons • Sequences, Sequence Features • Organisms, Taxonomic relationships • Computations (sequence matches) • Citations, Controlled vocabularies • Links to external databases • Each type of warehouse object implemented through one or more relational tables (currently 43)

  9. Warehouse Schema • Manages multiple datasets simultaneously • Dataset = Single version of a database • Allows version comparison • Multiple software tools or experiments require access to different versions • Each dataset is a warehouse entity • Every warehouse object is registered in a dataset • Different databases storing the same biological datatypes are coerced into same warehouse tables • Design of most datatypes inspired by multiple databases • Representational tricks to decrease schema bloat • Single space of primary keys • Single set of satellite tables such as for synonyms, citations, comments, etc.

  10. Current Databases Supported by BioWarehouse • BioCyc • 15 genomes and metabolic networks • Swiss-Prot, TrEMBL • 1.3M proteins • ENZYME • KEGG • NCBI Taxonomy • CMR • 105 genomes, 250K genes, 250K proteins • Applications: • DARPA BioSpice program on biological simulation • Study of sequence coverage of known enzymes

  11. Summary • Interoperation of molecular-biology databases is a challenging problem of critical importance • DOE should initiate a program in interoperation of molecular biology databases • Pursue both warehouse approach and multidatabase approach • Major progress possible within 5 years

More Related