160 likes | 180 Views
A Database Platform for Bioinformatics. Sandeepan Banerjee Oracle Corporation. Background. Need massive storage for more and more genomic and proteomic data generated in database
E N D
A Database Platform for Bioinformatics Sandeepan Banerjee Oracle Corporation
Background • Need massive storage for more and more genomic and proteomic data generated in database • Need high-performance computing platform to search data, identify similarities and patterns within genomic data and unify the slices of distributed developed knowledge
Steps to genomic projects • Divide the chromosomes into smaller fragments that can be isolated • Order these fragments to correspond to their respective locations on the chromosomes. • Determine the sequence of bases A,T,C & G in each fragment. • Annotate the regions of sequenced chromosomes with their function • Catalogue the differences in sequences
Computing for cataloguing • Any two individuals differ in about 1/1000 of their genetic material, i.e. about 3 million base pairs. • The global population is now about 6 billion. • So a full cataloguing of all sequence differences will run to 18*1015 entries.
Traditional Database • Few databases have had a native ability to deal with complex data • Hard to handle high-dimensional data Ex. Query on structural similarity: Given a particular sequence, what other sequences resembling this sequence exist in the database?
BLAST • A set of similarity search programs • Hard to handle due to too complex, too large and far too custom-built • Degrade performing when interactions with database increase • Query optimizations not easy • Hard to manage with database as a whole system.
Four technologies needed • Extensibility database architecture • Data mining and Data Warehousing • Data integration technologies • Internet portal technologies
Extending Databases • User-defined Types • User-defined operators • Domain-specific indexing • Optimizer extensibility
User-defined Types • Oracle Type System • Object types – structure is fully known to the database • Opaque types – not known to the database
User-defined operators • Define domain-specific operatorsresembles() • Can be invoked anywhere built-in operators can be used. Like in Select command:SELECT ID FROM DNATABLE WHERE Contains(fragment, `GCCATA`);
Extensible Indexing • Cooperative indexingUser-supplied implementations and the Oracle server cooperate to build and maintain indexes for complex types such as genetic, text or spatial data. • User implemented Indextype.
Extensible Optimizer • Gives developers control over the three main inputs used by the optimizer:statistics, selectivity, and cost.
Mining Sequence Data • Oracle Darwin for bioinformatics
Integrating Heterogeneous Data • Sequence data will be distributed all over the institutes. • Annotations to this data will make it change and grow all the time. • Oracle use ODBC and OLEDB to connect non-Oracle database system to do query, search, insert, delete.
Portal Technologies • ‘Soft Goods’ Sales • Visualization • Security & Access Control
Questions? • What’s the benefit for Oracle compared with BLAST? • Are there any other technologies required for the Bioinformatics database platform? • Is there anything Darwin can’t do for data mining in the bioinformatics database?