200 likes | 400 Views
Genomics Algebra A New, Integrating Data Model, Language, and Tool for Processing and Querying Genomic Information. Joachim Hammer and Markus Schneider University of Florida CIDR 2003 Asilomar, CA Jan. 5-8, 2003. Overview. Data Management Problems in Bioinformatics Proposed Solution
E N D
Genomics AlgebraA New, Integrating Data Model, Language, and Tool for Processing and Querying Genomic Information Joachim Hammer and Markus Schneider University of Florida CIDR 2003 Asilomar, CA Jan. 5-8, 2003
Overview • Data Management Problems in Bioinformatics • Proposed Solution • Genomics Algebra and Unifying Database • Summary and Expected Impact CIDR 2003
Bioinformatics • Growing field of problems in biological sciences that require application of computing and mathematics • Bioinformatics was coined in mid 80’s • Genome Projects • Construct detailed genetic and physical maps of a variety of organisms • E.g., human genome project • Functional Genomics • What do genes do and how do they interact? • E.g., drug discovery, agro-food, pharmacogenomics (individualized medicine) CIDR 2003
Why is Bioinformatics Important? • Acquiring sequences is first step … • Ultimate goal is to decipher structural, functional, evolutionary information encoded in language of biological sequences • Alphabet (amino acids), words (motifs), sentences (proteins) • Decoding an unknown language • To date, unable to predict structure (i.e., words and sentences) from sequence • Mostly pattern-matching techniques: detect similarity between sequences and infer related structures and functions • Number of experimentally determined protein structures is VERY small CIDR 2003
An Information Revolution … • Emergence of rapid DNA sequencing and high throughput gene analysis techniques • Flood of genomic data • Nucleic acid and protein sequences, motifs, folding units, modules, interaction information, etc. • Complex data, e.g., sequential lists, deeply nested record structures, image & video data • Data stored in more than 500 repositories • E.g., EMBL (150 GB, 2001), GenBank, SWISS-PROT, SANGER Centre (20TB, 2001), … • Sequence repositories increase 4x per year • Known sequence data outweighs protein structural data ~100:1 (sequence/structure deficit) CIDR 2003
… and the Resulting Problems for Biologists • Scientists are overwhelmed by data which is awaiting further refinement and analysis • Number and size of available data sources continuously growing • Overlap and conflicting information • Proliferation of interfaces and portals • Familiar sources sometimes disappear or get merged • Little or no agreement on terminology • Unmanageable query results • Forced to understand low-level data management • Often required to learn and write SQL or code in some other programming language (Perl) • Noisy data • E.g., estimated that 30-60% of sequences in GenBank are erroneous CIDR 2003
Corresponding CS Problems • Management of heterogeneous, autonomous sources • Missing standard for genomic data representation • Formatted files prevail over conventional database representations (few sources use DBMSs) • Lots of redundancies and inconsistencies • Many different interfaces (e.g., Web-based, specialized GUIs and retrieval packages) • Query languages not suitable for intended users • Limited interaction functionality of repositories • Query results are often unmanageable CIDR 2003
CS Problems Cont’d • Low-level treatment of data • Users manipulate strings and integers instead of genes and sequences • No high-level operations either • Lack of extensibility of software managing sources • Not possible to integrate new, specialty evaluation functions • Extraction of new knowledge from existing sources without much computational support • Integration of new knowledge into repositories is tedious • E.g., no personal scratch pad that can be integrated with existing data • Dealing with uncertainty and erroneous data • E.g., frameshift problem CIDR 2003
State-of-the-Art • Current research is focused mainly on integrating existing repositories • Federated and query-driven approaches (e.g., SRS, BioNavigator, DiscoveryLink, K2/Kleisli, Tambis, …) • Work on standardizing terminology and representations (e.g., Gene Ontology, EcoCyc, …) • Analysis is performed outside of the repositories • Sequence similarity search: e.g., Basic Local Alignment Search (BLAST) and its derivatives, … • Visualization tools: e.g., BEAUTY, BioWidgets, … • Complex middleware tiers between end-users and the data servers • Inefficient, lots of user involvement (human query processor) CIDR 2003
Iterative Query and Analysis While not done … Construct a database query Store query output Analyze query results Done? Query Relevant Database(s) Store Query Output Analyze Output CIDR 2003
Fundamental Challenge • Development of a more principled approach to genomic data management • Leverage capabilities provided by modern DBMS • Services tightly integrated • Shields scientists from knowing low-level data management details as much as possible CIDR 2003
Integrating Approach to Genomics Data Management • Extensible Genomics Algebra • Formal data model, query language, and software for representing, storing, retrieving, querying, and manipulating genomic information • Provides a set of high-level genomic data types (GDTs) together with genomic operations or functions • Unifying Database • Persistent storage for high-level, structured GDT values of Genomics Algebra • Warehouse for data from existing genomic repositories CIDR 2003
Mini Genomics Algebra types codon, aminoAcid, gene, primaryTranscript, mRNA, protein operators decode: codon aminoAcid “given a codon, computes the corresponding amino acid” transcribe: gene primaryTranscript “given a gene, returns its primary transcript” splice: primaryTranscript mRNA “given a primary transcript,removes its introns to produce the mRNA” translate: mRNA protein “given a messenger RNA, determines the corresponding protein” . . CIDR 2003
What Can We Do with a Genomics Algebra? • Can use the algebra to formally express existing biological operations • E.g., Given DNA fragment and sequence, returns true if fragment contains specified sequence contains(frag,“ATTGCCATA”) • Create new operations using function composition • E.g., express central dogma of molecular biology as translate(splice(transcribe(g))) CIDR 2003
Research Challenges • What data types and operations do we need? • Need comprehensive ontology defining terminology, data objects, and operations • Formalize definition of GDTs and operations • Vague or lacking knowledge of many biological processes makes this hard • Implement algebra • Design of data structures and efficient algorithms for genomic operations • Must be extensible • Suitable for integration with a database system CIDR 2003
Unifying Database • Persistent storage manager for Genomics Algebra • Integrated repository (warehouse) for genomics sources • GUS (U Penn) is only other known genomics warehouse prototype system • Provides superior query processing performance in multi-source environments • Ability to maintain and annotate extracted source data after it has been cleansed, reconciled and corrected • Option to preserve historical data from those repositories that do not archive their contents CIDR 2003
Integrated System Architecture Genomics Algebra GUI Extensible DBMS (Oracle, DB2, …) DBMS-specific Adapter ETL … Unifying Database External Repositories (e.g, GenBank, NCBI, …) public space user space user space … user space CIDR 2003
Implementation • Adapter provides DBMS-specific coupling mechanism between Genomics Algebra and DBMS • Use UDT mechanism (opaque types and user-defined operators linked as external functions) • Supported by all major DB vendors • User interface component consisting of • Biological query language together with graphical output • XML application as standardized exchange format for sharing genomics data CIDR 2003
Research Challenges • Design of the integrated schema • Iterative process with input from domain experts • Detecting changes in underlying sources • Push capabilities are slowly being offered • Tools for computing what has changed • Database maintenance • View maintenance problem • Derived data (annotations) based on update must be recomputed • Knowing provenance of data could be used to determine which annotations need to be recomputed CIDR 2003
Vision and Expected Impact • Advocate a “back to the roots” strategy of database technology for bioinformatics • Fundamental change in way biologists analyze data • Single interface specifically designed for biologists • No need to become “computer scientists” • New knowledge about design and implementation of biological type system and its operations • Demonstrate extensibility of modern DBMS • Help development of algebras for other applications CIDR 2003