600 likes | 837 Views
Database Technology in Bioinformatics. Philip McNeil. European Bioinformatics Institute. The Information Challenge Database Technologies Which Do You Choose? Data Modelling Some Database Features In Use at EBI. The Information Challenge. Many new data intensive methodologies
E N D
Database Technology in Bioinformatics Philip McNeil European Bioinformatics Institute
The Information Challenge Database Technologies Which Do You Choose? Data Modelling Some Database Features In Use at EBI
Many new data intensive methodologies Combinatorial Chemistry Genomics (including Structural Genomics) High Throughput Screening Proteomics Transgenics Microarrays Today’s Research Generates ever increasing amounts of data:
Genome information Megabases
Size Complexity Integration ‘Data Waves’
G R Y S P L E M CAGTAGTGCACATCATTCGTCAATGCATACTGCACTAACCACACAGTAC G R Y S P L N M CAGTAAAGCACATCATTCGTCAATGCATACTGCACTAACCACACAGTAC Molecular biology has become information intensive
Nucleotide Sequence Protein Sequence Protein Structure Protein Function Macromolecular Information Data Complexity
Biological information all interrelated DNA sequence, Protein sequence, Structure, Function Specialist database for organisms HIV, Drosophila, C. Elegans Specialist databases for functions Eukaryotic promoters, Transcription factors Specialist databases for diseases and genes P53, Haemophilia B Proliferation of Databases
Artificial boundaries between databases Coarse links between databases Multitude of exchange formats Lack of robustness Varied quality But…...
Improved quality and integrity of data Data need to be well structured and robustly defined Flexible infrastructure to meet rapid changing requirements Open frameworks, management and analysis tools Integrate diverse data sources Meeting the Information Challenge
Evolution of DBMS Technology Adapted from: Barry - The Object Database Handbook (1996)
Essentially four different types: File System (Flat Files) Relational database (RDBMS) Object oriented database (ODBMS) Object-relational database (ORDBMS) Coming? XML Database Systems In Use Today
All computers have them! Most of the world’s data still consists of old file systems and legacy data Many bioinformatics databases are still distributed as flat files: EMBL-Bank MSD/PDB SWISS-PROT/TrEMBL File Systems
Well understood, mature technology Most widely used DBMS Standards: SQL92 although all vendors used proprietary extensions SQL99 Support for objects & other extensions SQL2003 XML Relational Databases
Extended the SQL92 data model: User defined, complex data types Types, subtypes, inheritance References (‘OIDs’) Now supported by the SQL99 standard Many major relational databases now have object extensions Object-Relational Databases
Persistent data store for objects created by object oriented programming languages Language binding: C++, Smalltalk, Java Standard: Object query language (OQL) Not implemented by many vendors Now mainly used in niche areas CAD/CAM, AI, telecomms Object Oriented databases
Three different types defined by the XML:DB initiative: Native XML Database XML Enabled Database Hybrid XML Database XML Databases
Defines a model for an XML document and stores and retrieves documents according to that model XML document is the fundamental unit of storage (cf. row in a relational database) Can be built on various underlying storage models (RDBMS, OODBMS, indexed compressed files) Native XML Database
Has an added XML mapping layer Original XML metadata & structure may be lost Data retrieved as XML may not have originated in XML Data manipulation via e.g. DOM or SAX or via SQL Oracle, Microsoft, IBM use this approach XML Enabled Database
Can be treated as either Native XML Database or XML Extended Database Example is Ozone Hybrid XML Database
Can store complex data (e.g. PDB) Can be indexed (at least for simple datatypes) Can be made publicly available in simple form Well suited for human browsing Avoid cost of database software Platform independent Easy to prepare for WWW Cheap Flat Files
Low data reliability, security & integrity decentralised data and therefore decentralised control Inadequate data structuring difficult to provide adequate model of ‘the real world’ variety of formats - lack of robustness Difficult to get answers to ad-hoc queries no query language; data files are distinct sophisticated query tools have been developed - e.g. SRS Low responsiveness to change data and programs are not independent hard to integrate Limitations of Flat File ‘Database’
Store large amounts of relatively simple data as tables of rows & columns Scalability Sound theoretical basis High security & reliability Performance Query optimization Parallel processing Strong support, tools, etc. Benefits of Relational Databases
Cannot adequately support complex data complex data are stored as ‘BLOBS’ BLOBS can be retrieved, but not searched, indexed or manipulated Restricted set of data types even for less complex data Numbers, character strings, dates An inadequate model of ‘the real world’ entity/relationship model loss of semantics Expensive Limitations of Relational Databases
Close to relational model, but benefit from some OO concepts Can handle large amounts of complex data ‘smart BLOBS’ Plug-in extensibility cartridges & datablades Good ad-hoc query capability: SQL99 High security & reliability Benefits of Object-Relational Databases
A compromise solution, merging two paradigms underlying model still relational Less than perfect support for object extensions Even more expensive Limitations of Object-Relational Databases
Support complex data structures Provide a much better implementation of ‘the real world’ model object oriented models map well - support for OO concepts little loss of semantics Vendor-specific: Good performance? Scalability? Ad-hoc queries are possible with OQL Closely integrated with programming languages Benefits of ODBMSs
Hard to learn? Vendor Specific: Difficult to query? Few currently support OQL, a few support SQL Queries may have to be written in a 3GL, e.g. C++ Performance? Security? Reliability? Scalability? Backup & Recovery? Also expensive Few tools Limitations of ODBMSs
It depends on the data and what you want to do with them: Which do you choose? Michael Stonebraker: “Object-relational DBMS - The Next Wave”, Illustra whitepaper, http://www.informix.com/informix/corpinfo/zines/whitpprs
Most major data repositories in molecular biology have moved to using commercial RDBMS packages to manage their collections Most groups still collect and deliver the information using flat file protocols and formats – XML is becoming dominant here DBMS to store, flat files to communicate
Start with a conceptual model Can be done using different approaches objects entity relationship This can be implemented using different physical database systems and programming languages (not always without difficulty!) Remember - your database will only be as good as the data model it supports Modelling Comes First
UML stands for Unified Modeling Language The UML combines elements from Data Modelling concepts (Entity Relationship Diagrams) Business Modelling (work flow) Object Modelling Component Modelling UML is the OMG standard language for visualizing, specifying, constructing, and documenting the artifacts of a software-intensive system UML
Study • Classes • Attributes • Links • Operations • Set and get (implicit) • Checks and constraints +details: String 1 1 * Experiment * Conditions +serial: Int * +name: String +serial: Int +ndim: Int +temperature: Float +details: String +pH: Float +__init__() 1 * ExpDim +dim: Int UML: Basics
Sequence Schema
Database designed for queries and analysis Facilitate the synchronisation with other databases Repository for derived data Modular The MSD Data Warehouse
Deposition Deposition Stage1 Warehouse replication transformation Search-Warehouse replication distribution From Deposition to Distribution
Exp. Result Assembly Chains Residues Atoms CHAIN ENTRY ASSEMBLY ALT ASSEMBLY DATA RESIDUE ATOM DATA ATOM MODEL Representing Macromolecular Structures
Need for staging databases Transformation Mechanism Deposition – Warehouse refresh Replication/Distribution Query optimisation Interfacing with the warehouse(API-web-management tools) Technical Details