Enhancing Bioinformatics Database Platform with Oracle Technologies

A Database Platform for Bioinformatics Sandeepan Banerjee Oracle Corporation

Background • Need massive storage for more and more genomic and proteomic data generated in database • Need high-performance computing platform to search data, identify similarities and patterns within genomic data and unify the slices of distributed developed knowledge

Steps to genomic projects • Divide the chromosomes into smaller fragments that can be isolated • Order these fragments to correspond to their respective locations on the chromosomes. • Determine the sequence of bases A,T,C & G in each fragment. • Annotate the regions of sequenced chromosomes with their function • Catalogue the differences in sequences

Computing for cataloguing • Any two individuals differ in about 1/1000 of their genetic material, i.e. about 3 million base pairs. • The global population is now about 6 billion. • So a full cataloguing of all sequence differences will run to 18*1015 entries.

Traditional Database • Few databases have had a native ability to deal with complex data • Hard to handle high-dimensional data Ex. Query on structural similarity: Given a particular sequence, what other sequences resembling this sequence exist in the database?

BLAST • A set of similarity search programs • Hard to handle due to too complex, too large and far too custom-built • Degrade performing when interactions with database increase • Query optimizations not easy • Hard to manage with database as a whole system.

Four technologies needed • Extensibility database architecture • Data mining and Data Warehousing • Data integration technologies • Internet portal technologies

Extending Databases • User-defined Types • User-defined operators • Domain-specific indexing • Optimizer extensibility

User-defined Types • Oracle Type System • Object types – structure is fully known to the database • Opaque types – not known to the database

User-defined operators • Define domain-specific operatorsresembles() • Can be invoked anywhere built-in operators can be used. Like in Select command:SELECT ID FROM DNATABLE WHERE Contains(fragment, `GCCATA`);

Extensible Indexing • Cooperative indexingUser-supplied implementations and the Oracle server cooperate to build and maintain indexes for complex types such as genetic, text or spatial data. • User implemented Indextype.

Extensible Optimizer • Gives developers control over the three main inputs used by the optimizer:statistics, selectivity, and cost.

Mining Sequence Data • Oracle Darwin for bioinformatics

Integrating Heterogeneous Data • Sequence data will be distributed all over the institutes. • Annotations to this data will make it change and grow all the time. • Oracle use ODBC and OLEDB to connect non-Oracle database system to do query, search, insert, delete.

Portal Technologies • ‘Soft Goods’ Sales • Visualization • Security & Access Control

Questions? • What’s the benefit for Oracle compared with BLAST? • Are there any other technologies required for the Bioinformatics database platform? • Is there anything Darwin can’t do for data mining in the bioinformatics database?

Enhancing Bioinformatics Database Platform with Oracle Technologies

Enhancing Bioinformatics Database Platform with Oracle Technologies

Presentation Transcript

Tools for BioInformatics

Statistics for bioinformatics

Bioinformatics: A New Vision for Academics

Talify Searching for a platform

Bioinformatics approaches for…

A Massively Parallel Architecture for Bioinformatics

Bioinformatics for Research

A Platform for Innovation

Bioinformatics for Biologists

Perl for Bioinformatics

BioWarehouse: A Bioinformatics Database Warehouse

Programming for Bioinformatics

Database Technology in Bioinformatics

ILRI/BECA Bioinformatics Platform Introduction

Portals for Bioinformatics

CAP5510 – Bioinformatics Database Searches for Biological Sequences

Introduction of bioinformatics and Biological Database

CAP5510 – Bioinformatics Database Searches for Biological Sequences or Imperfect Alignments

A Biology Introduction for Bioinformatics

CAP5510 – Bioinformatics Database Searches for Biological Sequences

Bioinformatics Platform Three-tier Architecture

Integromics: a grid-enalbled platform for integration of advanced bioinformatics tools and data