1 / 9

A flexible graph-based controlled vocabulary engine

A flexible graph-based controlled vocabulary engine. Johann Visagie <johann@egenetics.com>. Background. Implementation of a controlled vocabulary engine Basis for a more complex profiling system that will aid in the identification of disease gene candidates by integrating :

Download Presentation

A flexible graph-based controlled vocabulary engine

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A flexible graph-basedcontrolled vocabulary engine Johann Visagie <johann@egenetics.com>

  2. Background • Implementation of a controlled vocabulary engine • Basis for a more complex profiling system that will aid in the identification of disease gene candidates by integrating: • transcript information • standardised controlled vocabulary of expression terms • genomic sequence • genetic mapping information

  3. Structure of the Vocabulary • Orthogonal set of hierarchical schemas (trees) • Each schema describes an expression domain, e.g.: • Anatomical site, Pathology, Development Stage, Cell Type • A tree's nodes are associated with terms describing expression states in that tree's domain • Mapped 6937 cDNA libraries (incl. dbEST, SAGE), each with one or more nodes in as many trees as possible

  4. Graph-based implementation • 2nd iteration • Python modules implementing hierarchical data structures, based on generalised graph library • Flexible enough for future experimentation (different data structures, multiple relationship types, etc.) • All operations in-memory • Overcomes most limitations of prior implementation • Forced unique terms, limited to pure trees, speed issues, database-centricity

  5. Query language • Parser for a simplistic Boolean query language: • pathology:cancer AND (anatomy:liver OR anatomy:stomach) • Implicit "query sets" • Tool for the power user • Each query term resolves to set of nodes in a tree (the node matching the term, and all its children), which maps to set of cDNA libraries • Note: Multiple orthogonal classification domains allow for construction of almost arbitrary query resolution

  6. Interfaces • Python API • Under development: • SOAP v1.1 • DAS v1.5 (under investigation) • wxPython-based GUI • Curation • Query interface for users

  7. Application • Proved its worth in a number of SANBI research projects • Components of controlled vocabulary system are in use by a number of groups

  8. Acknowledgements • Soraya Bardien-Kruger • Alan Christoffels • Tania Hide • Winston Hide • Paul Hüsler • Janet Kelso • Damian Smedley

More Related