110 likes | 214 Views
Interoperation of Molecular Biology Databases. Peter D. Karp, Ph.D. Bioinformatics Research Group SRI International Menlo Park, CA pkarp@ai.sri.com. Main Message. Interoperation of molecular-biology databases is a challenging problem of critical importance
E N D
Interoperation of Molecular Biology Databases Peter D. Karp, Ph.D. Bioinformatics Research Group SRI International Menlo Park, CA pkarp@ai.sri.com
Main Message • Interoperation of molecular-biology databases is a challenging problem of critical importance • DOE should initiate a program in interoperation of molecular biology databases • Pursue both warehouse approach and multidatabase approach • Major progress possible within 5 years
Motivations • Important biological problems require access to multiple bioinformatics databases • Different problems require different sets of databases • Hundreds of bioinformatics databases exist • Nucleic Acids Research 32:2004 – Database issue • Nucleic Acids Research DB list: http://www3.oup.co.uk/nar/database/a/ • 350 databases listed in 2002 • 560 databases listed in 2004 • Applications of integration include • Complex queries • Comparison of overlapping sources • Data mining
Bioinformatics Databases • Tremendous progress in point-and-click access for biologist users • Less progress toward providing a computable, interoperable infrastructure for large-scale data mining • Every large-scale mining/learning problem requires time consuming crafting of input/training datasets
Warehouse Approach vsMultidatabase Approach • Multidatabase query approaches assume databases are in a queryable DBMS • Most sites that do operate DBMSs do not allow remote query access because of security and loading concerns • Users want to control data stability • Users want to control hardware applied to problem • Internet bandwidth limits query throughput • Users need to capture, integrate and publish locally produced data of different types • Replicating and refreshing very large sources is expensive • Multidatabase and Warehouse approaches complementary
SRI BioWarehouseProject Goal • Create a toolkit for constructing bioinformatics database warehouses that integrate sets of bioinformatics databases into one physical DBMS
BioWarehouse Approach • Warehouse schema defines many bioinformatics datatypes • Create loaders for public bioinformatics DBs • Parse file format for the DB • Apply semantic transformations • Insert database into warehouse tables • Oracle and MySQL implementations • Warehouse query access mechanisms • SQL queries via JDBC,Lisp,Perl, ODBC, OAA
Warehouse Schema • Manages many bioinformatics datatypes simultaneously • Pathways, Reactions, Chemicals • Proteins, Genes, Replicons • Sequences, Sequence Features • Organisms, Taxonomic relationships • Computations (sequence matches) • Citations, Controlled vocabularies • Links to external databases • Each type of warehouse object implemented through one or more relational tables (currently 43)
Warehouse Schema • Manages multiple datasets simultaneously • Dataset = Single version of a database • Allows version comparison • Multiple software tools or experiments require access to different versions • Each dataset is a warehouse entity • Every warehouse object is registered in a dataset • Different databases storing the same biological datatypes are coerced into same warehouse tables • Design of most datatypes inspired by multiple databases • Representational tricks to decrease schema bloat • Single space of primary keys • Single set of satellite tables such as for synonyms, citations, comments, etc.
Current Databases Supported by BioWarehouse • BioCyc • 15 genomes and metabolic networks • Swiss-Prot, TrEMBL • 1.3M proteins • ENZYME • KEGG • NCBI Taxonomy • CMR • 105 genomes, 250K genes, 250K proteins • Applications: • DARPA BioSpice program on biological simulation • Study of sequence coverage of known enzymes
Summary • Interoperation of molecular-biology databases is a challenging problem of critical importance • DOE should initiate a program in interoperation of molecular biology databases • Pursue both warehouse approach and multidatabase approach • Major progress possible within 5 years