1 / 20

Joachim Hammer and Markus Schneider University of Florida CIDR 2003 Asilomar, CA Jan. 5-8, 2003

Genomics Algebra A New, Integrating Data Model, Language, and Tool for Processing and Querying Genomic Information. Joachim Hammer and Markus Schneider University of Florida CIDR 2003 Asilomar, CA Jan. 5-8, 2003. Overview. Data Management Problems in Bioinformatics Proposed Solution

yonah
Download Presentation

Joachim Hammer and Markus Schneider University of Florida CIDR 2003 Asilomar, CA Jan. 5-8, 2003

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Genomics AlgebraA New, Integrating Data Model, Language, and Tool for Processing and Querying Genomic Information Joachim Hammer and Markus Schneider University of Florida CIDR 2003 Asilomar, CA Jan. 5-8, 2003

  2. Overview • Data Management Problems in Bioinformatics • Proposed Solution • Genomics Algebra and Unifying Database • Summary and Expected Impact CIDR 2003

  3. Bioinformatics • Growing field of problems in biological sciences that require application of computing and mathematics • Bioinformatics was coined in mid 80’s • Genome Projects • Construct detailed genetic and physical maps of a variety of organisms • E.g., human genome project • Functional Genomics • What do genes do and how do they interact? • E.g., drug discovery, agro-food, pharmacogenomics (individualized medicine) CIDR 2003

  4. Why is Bioinformatics Important? • Acquiring sequences is first step … • Ultimate goal is to decipher structural, functional, evolutionary information encoded in language of biological sequences • Alphabet (amino acids), words (motifs), sentences (proteins) • Decoding an unknown language • To date, unable to predict structure (i.e., words and sentences) from sequence • Mostly pattern-matching techniques: detect similarity between sequences and infer related structures and functions • Number of experimentally determined protein structures is VERY small CIDR 2003

  5. An Information Revolution … • Emergence of rapid DNA sequencing and high throughput gene analysis techniques • Flood of genomic data • Nucleic acid and protein sequences, motifs, folding units, modules, interaction information, etc. • Complex data, e.g., sequential lists, deeply nested record structures, image & video data • Data stored in more than 500 repositories • E.g., EMBL (150 GB, 2001), GenBank, SWISS-PROT, SANGER Centre (20TB, 2001), … • Sequence repositories increase 4x per year • Known sequence data outweighs protein structural data ~100:1 (sequence/structure deficit) CIDR 2003

  6. … and the Resulting Problems for Biologists • Scientists are overwhelmed by data which is awaiting further refinement and analysis • Number and size of available data sources continuously growing • Overlap and conflicting information • Proliferation of interfaces and portals • Familiar sources sometimes disappear or get merged • Little or no agreement on terminology • Unmanageable query results • Forced to understand low-level data management • Often required to learn and write SQL or code in some other programming language (Perl) • Noisy data • E.g., estimated that 30-60% of sequences in GenBank are erroneous CIDR 2003

  7. Corresponding CS Problems • Management of heterogeneous, autonomous sources • Missing standard for genomic data representation • Formatted files prevail over conventional database representations (few sources use DBMSs) • Lots of redundancies and inconsistencies • Many different interfaces (e.g., Web-based, specialized GUIs and retrieval packages) • Query languages not suitable for intended users • Limited interaction functionality of repositories • Query results are often unmanageable CIDR 2003

  8. CS Problems Cont’d • Low-level treatment of data • Users manipulate strings and integers instead of genes and sequences • No high-level operations either • Lack of extensibility of software managing sources • Not possible to integrate new, specialty evaluation functions • Extraction of new knowledge from existing sources without much computational support • Integration of new knowledge into repositories is tedious • E.g., no personal scratch pad that can be integrated with existing data • Dealing with uncertainty and erroneous data • E.g., frameshift problem CIDR 2003

  9. State-of-the-Art • Current research is focused mainly on integrating existing repositories • Federated and query-driven approaches (e.g., SRS, BioNavigator, DiscoveryLink, K2/Kleisli, Tambis, …) • Work on standardizing terminology and representations (e.g., Gene Ontology, EcoCyc, …) • Analysis is performed outside of the repositories • Sequence similarity search: e.g., Basic Local Alignment Search (BLAST) and its derivatives, … • Visualization tools: e.g., BEAUTY, BioWidgets, … • Complex middleware tiers between end-users and the data servers • Inefficient, lots of user involvement (human query processor) CIDR 2003

  10. Iterative Query and Analysis While not done …  Construct a database query Store query output Analyze query results Done? Query Relevant Database(s)   Store Query Output  Analyze Output CIDR 2003

  11. Fundamental Challenge • Development of a more principled approach to genomic data management • Leverage capabilities provided by modern DBMS • Services tightly integrated • Shields scientists from knowing low-level data management details as much as possible CIDR 2003

  12. Integrating Approach to Genomics Data Management • Extensible Genomics Algebra • Formal data model, query language, and software for representing, storing, retrieving, querying, and manipulating genomic information • Provides a set of high-level genomic data types (GDTs) together with genomic operations or functions • Unifying Database • Persistent storage for high-level, structured GDT values of Genomics Algebra • Warehouse for data from existing genomic repositories CIDR 2003

  13. Mini Genomics Algebra types codon, aminoAcid, gene, primaryTranscript, mRNA, protein operators decode: codon  aminoAcid “given a codon, computes the corresponding amino acid” transcribe: gene  primaryTranscript “given a gene, returns its primary transcript” splice: primaryTranscript  mRNA “given a primary transcript,removes its introns to produce the mRNA” translate: mRNA  protein “given a messenger RNA, determines the corresponding protein” . . CIDR 2003

  14. What Can We Do with a Genomics Algebra? • Can use the algebra to formally express existing biological operations • E.g., Given DNA fragment and sequence, returns true if fragment contains specified sequence contains(frag,“ATTGCCATA”) • Create new operations using function composition • E.g., express central dogma of molecular biology as translate(splice(transcribe(g))) CIDR 2003

  15. Research Challenges • What data types and operations do we need? • Need comprehensive ontology defining terminology, data objects, and operations • Formalize definition of GDTs and operations • Vague or lacking knowledge of many biological processes makes this hard • Implement algebra • Design of data structures and efficient algorithms for genomic operations • Must be extensible • Suitable for integration with a database system CIDR 2003

  16. Unifying Database • Persistent storage manager for Genomics Algebra • Integrated repository (warehouse) for genomics sources • GUS (U Penn) is only other known genomics warehouse prototype system • Provides superior query processing performance in multi-source environments • Ability to maintain and annotate extracted source data after it has been cleansed, reconciled and corrected • Option to preserve historical data from those repositories that do not archive their contents CIDR 2003

  17. Integrated System Architecture Genomics Algebra GUI Extensible DBMS (Oracle, DB2, …) DBMS-specific Adapter ETL … Unifying Database External Repositories (e.g, GenBank, NCBI, …) public space user space user space … user space CIDR 2003

  18. Implementation • Adapter provides DBMS-specific coupling mechanism between Genomics Algebra and DBMS • Use UDT mechanism (opaque types and user-defined operators linked as external functions) • Supported by all major DB vendors • User interface component consisting of • Biological query language together with graphical output • XML application as standardized exchange format for sharing genomics data CIDR 2003

  19. Research Challenges • Design of the integrated schema • Iterative process with input from domain experts • Detecting changes in underlying sources • Push capabilities are slowly being offered • Tools for computing what has changed • Database maintenance • View maintenance problem • Derived data (annotations) based on update must be recomputed • Knowing provenance of data could be used to determine which annotations need to be recomputed CIDR 2003

  20. Vision and Expected Impact • Advocate a “back to the roots” strategy of database technology for bioinformatics • Fundamental change in way biologists analyze data • Single interface specifically designed for biologists • No need to become “computer scientists” • New knowledge about design and implementation of biological type system and its operations • Demonstrate extensibility of modern DBMS • Help development of algebras for other applications CIDR 2003

More Related