450 likes | 803 Views
Overview of Genome Databases. Peter D. Karp, Ph.D. SRI International pkarp@ai.sri.com www-db.stanford.edu/dbseminar/seminar.html. Talk Overview. Definition of bioinformatics Motivations for genome databases Issues in building genome databases. Definition of Bioinformatics.
E N D
Overview of Genome Databases Peter D. Karp, Ph.D. SRI International pkarp@ai.sri.com www-db.stanford.edu/dbseminar/seminar.html
Talk Overview • Definition of bioinformatics • Motivations for genome databases • Issues in building genome databases
Definition of Bioinformatics • Computational techniques for management and analysis of biological data and knowledge • Methods for disseminating, archiving, interpreting, and mining scientific information • Computational theories of biology • Genome Databases is a subfield of bioinformatics
Motivations for Bioinformatics • Growth in molecular-biology knowledge (literature) • Genomics • Study of genomes through DNA sequencing • Industrial Biology
Example Genomics Datatypes • Genome sequences • DOE Joint Genome Institute • 511M bases in Dec 2001 • 11.97G bases since Mar 1999 • Gene and protein expression data • Protein-protein interaction data • Protein 3-D structures
Genome Databases • Experimental data • Archive experimental datasets • Retrieving past experimental results should be faster than repeating the experiment • Capture alternative analyses • Lots of data, simpler semantics • Computational symbolic theories • Complex theories become too large to be grasped by a single mind • The database is the theory • Biology is very much concerned with qualitative relationships • Less data, more complex semantics
Bioinformatics • Distinct intellectual field at the intersection of CS and molecular biology • Distinct field because researchers in the field must know CS, biology, and bioinformatics • Spectrum from CS research to biology service • Rich source of challenging CS problems • Large, noisy, complex data-sets and knowledge-sets • Biologists and funding agencies demand working solutions
Bioinformatics Research • algorithms + data structures = programs • algorithms + databases = discoveries • Combine sophisticated algorithms with the right content: • Properly structured • Carefully curated • Relevant data fields • Proper amount of data
Reference on Major Genome Databases • Nucleic Acids Research Database Issue • http://nar.oupjournals.org/content/vol30/issue1/ • 112 databases
What are Database Goals andRequirements? • What problems will database be used to solve? • Who are the users and what is their expertise?
What is its Organizing Principle? • Different DBs partition the space of genome information in different dimensions • Experimental methods (Genbank, PDB) • Organism (EcoCyc, Flybase)
What is its Level of Interpretation? • Laboratory data • Primary literature (Genbank) • Review (SwissProt, MetaCyc) • Does DB model disagreement?
What are its Semantics and Content? • What entities and relationships does it model? • How does its content overlap with similar DBs? • How many entities of each type are present? • Sparseness of attributes and statistics on attribute values
What are Sources of its Data? • Potential information sources • Laboratory instruments • Scientific literature • Manual entry • Natural-language text mining • Direct submission from the scientific community • Genbank • Modification policy • DB staff only • Submission of new entries by scientific community • Update access by scientific community
What DBMS is Employed? • None • Relational • Object oriented • Frame knowledge representation system
Distribution / User Access • Multiple distribution forms enhance access • Browsing access with visualization tools • API • Portability
What Validation Approaches areEmployed? • None • Declarative consistency constraints • Programmatic consistency checking • Internal vs external consistency checking • What types of systematic errors might DB contain?
Database Documentation • Schema and its semantics • Format • API • Data acquisition techniques • Validation techniques • Size of different classes • Coverage of subject matter • Sparseness of attributes • Error rates • Update frequency
Relationship of Database Field toBioinformatics • Scientists generally unaware of basic DB principles • Complex queries vs click-at-a-time access • Data model • Defined semantics for DB fields • Controlled vocabularies • Regular syntax for flatfiles • Automated consistency checking • Most biologists take one programming class • Evolution of typical genome database • Finer points of DB research off their radar screen • Handfull of DB researchers work in bioinformatics
Database Field • For many years, the majority of bioinformatics DBs did not employ a DBMS • Flatfiles were the rule • Scientists want to see the data directly • Commercial DBMSs too expensive, too complex • DBAs too expensive • Most scientists do not understand • Differences between BA, MS, PhD in CS • CS research vs applications • Implications for project planning, funding, bioinformatics research
Recommendation • Teaching scientists programming is not enough • Teaching scientists how to build a DBMS is irrelevant • Teach scientists basic aspects of databases and symbolic computing • Database requirements analysis • Data models, schema design • Knowledge representation, ontologies • Formal grammars • Complex queries • Database interoperability
BioSPICE BioinformaticsDatabase Warehouse Peter Karp, Dave Stringer-Calvert, Tom Lee, Kemal Sonmez SRI International http://www.BioSPICE.org/
Project Goal • Create a toolkit for constructing bioinformatics database warehouses that collect together a set of bioinformatics databases into one physical DBMS
Motivations • Important bioinformatics problems require access to multiple bioinformatics databases • Hundreds of bioinformatics databases exist • Nucleic Acids Research 30(1) 2002 – DB issue • Nucleic Acids Research DB list: 350 DBs at http://www3.oup.co.uk/nar/database/a/ • Different problems require different sets of databases
Motivations • Combining multiple databases allows for data verification and complementation • Simulation problems require access to data on pathways, enzymes, reactions, genetic regulation
Why is the Multidatabase Approach Not Sufficient? • Multidatabase query approaches assume databases are in a DBMS • Internet bandwidth limits query throughput • Most sites that do operate DBMSs do not allow remote SQL access because of security and loading concerns • Control data stability • Need to capture, integrate and publish locally produced data of different types • Multidatabase and Warehouse approaches complementary
Scenario 1 • BioSPICE scientist wants to model multiple metabolic pathways in a given organism • Enumerate pathways and reactions • What enzymes catalyze each reaction? • What genes code for each enzyme? • What control regions regulate each gene?
Approach • Oracle and MySQL implementations • Warehouse schema defines many bioinformatics datatypes • Create loaders for public bioinformatics DBs • Parse file format for the DB • Semantic transformations • Insert database into warehouse tables • Warehouse query access mechanisms • SQL queries via Perl, ODBC, OAA
Example: Swiss-Prot DB • Version 40.0 describes 101K proteins in a 320MB file • Each protein described as one block of records (an entry) in a large text file • Loader tool parses file one entry at a time • Creates new entries in a set of warehouse tables
Warehouse Schema • Manages many bioinformatics datatypes simultaneously • Pathways, Reactions, Chemicals • Proteins, Genes, Replicons • Citations, Organisms • Links to external databases • Each type of warehouse object implemented through one or more relational tables (currently 43)
Warehouse Schema • Databases on our wish list: • Genbank (nucleotide sequences) • Protein expression database • Protein-protein interactions database • Gene expression database • NCBI Taxonomy database • Gene Ontology • CMR
Warehouse Schema • Manages multiple datasets simultaneously • Dataset = Single version of a database • Support alternative measurements and viewpoints • Version comparison • Multiple software tools or experiments that require access to different versions • Each dataset is a warehouse entity • Every warehouse object is registered in a dataset
Warehouse Schema • Different databases storing the same biological types are coerced into same warehouse tables • Design of most datatypes inspired by multiple databases • Representational tricks to decrease schema bloat • Single space of primary keys • Single set of satellite tables such as for synonyms, citations, comments, etc.
Warehouse Schema • Examples • Protein data from Swiss-Prot, TrEMBL, KEGG, and EcoCyc all loaded into same relational tables • Pathway data from MetaCyc and KEGG are loaded into the same relational tables
Example: Swiss-Prot DB ID 1A11_CUCMA STANDARD; PRT; 493 AA. AC P23599; DT 01-NOV-1991 (Rel. 20, Created) DT 01-NOV-1991 (Rel. 20, Last sequence update) DT 15-DEC-1998 (Rel. 37, Last annotation update) DE 1-AMINOCYCLOPROPANE-1-CARBOXYLATE SYNTHASE CMW33 (EC 4.4.1.14) (ACC DE SYNTHASE) (S-ADENOSYL-L-METHIONINE METHYLTHIOADENOSINE-LYASE). GN ACS1 OR ACCW.
How Swiss-Prot is Loaded intoThe Warehouse • Register Swiss-Prot in Datasets table • Create entry in Entry and Protein tables for each Swiss-Prot protein • Satellite tables store • Protein synonyms, citations, comments, accession numbers, organism, sequence features, subunits/complexes, DB links
Protein Table CREATE TABLE Protein ( WID NUMBER --The warehouse ID of this protein Name VARCHAR2(500) --Common name of the protein AASequence VARCHAR2(4000),--Amino-acid sequence for this protein Charge NUMBER, --Charge of the chemical Fragment CHAR(1), --Is this protein a fragment or not, T or F MolecularWeightCalc NUMBER, --Molecular weight calculated from sequence. Units: Daltons. MolecularWeightExp NUMBER, --Molecular Weight determined through experimentation. Units: Daltons. PICalc VARCHAR2(50), --pI calculated from its sqeuence. PIExp VARCHAR2(50), --pI value determined through experimentation. DataSetWID NUMBER --Reference to the data set from which the entity came from );
Database Loaders • Loader tool defined for each DB to be loaded into Warehouse • Example loaders available in several languages • Loaders • KEGG (C) • BioCyc collection of 15 pathway DBs (C) • Swiss-Prot (Java) • ENZYME (Java)
Model Organism Database (MOD) – DB describing genome and other information about an organism Pathway/Genome Database (PGDB) – MOD that combines information about Pathways, reactions, substrates Enzymes, transporters Genes, replicons Transcription factors, promoters, operons, DNA binding sites BioCyc – Collection of 15 PGDBs at BioCyc.org EcoCyc, AgroCyc, YeastCyc Terminology
Loader Architecture Swiss-Prot Datafile ANTLR Parser Generator Parser for SwissProt Grammar for Swiss-Prot Oracle Loadable File SQL Insert Commands
Example Warehouse Uses • Check completeness of data sources Count reactions in ENZYME database with (and without) associated protein sequences in SWISS-PROT database: 3870 reactions in ENZYME 1662 reactions (43%) with a sequence in SWISS-PROT 2208 reactions (57%) without a sequence in SWISS-PROT Count #of distinct non-partial EC numbers in SWISS-PROT: 1554 distinct EC numbers in SWISS-PROT (non-partial)