140 likes | 185 Views
The Prometheus Taxonomic Database. Cédric Raguenaud, Jessie Kennedy, Peter Barclay Napier University, Edinburgh http://www.dcs.napier.ac.uk/~prometheus. Contents. What is taxonomy? What are the features of taxonomic data/processes Which database? The Prometheus approach Schema example
E N D
The Prometheus Taxonomic Database Cédric Raguenaud, Jessie Kennedy, Peter Barclay Napier University, Edinburgh http://www.dcs.napier.ac.uk/~prometheus
Contents • What is taxonomy? • What are the features of taxonomic data/processes • Which database? • The Prometheus approach • Schema example • Particularities of the model • Example queries • Summary & Conclusions
family family family (iii) (ii) (i) tribe ? genus genus genus Red squares Yellow round shapes Yellow round shapes! Yellow round shapes Red squares Purple diamond shapes (iv) tribe (vi) genus species (v) genus species variety What is plant taxonomy?
Plant Taxonomy Data • The data is hierarchical • Multiple overlapping hierarchies co-exist • distinct hierarchies need identified - manipulation and extraction • explicit relationships (=> graphs) • querying is recursive & dependent on the context of the relationships • Nodes in the hierarchy are aggregate objects • also have association to other objects outside the hierarchy • differentiate between association and aggregation in relationships • extraction of composite objects required • Levels of the hierarchy bear information • Ranks biologically significant (e.g. “genus” vs “species”) • Domain specific rules are important • data is derived based on domain specific rules • definition of constraints necessary for defining rules • positioning of objects in a hierarchy dependent on domain specific constraints (e.g. family names must end with -eceae)
Which Database? • Existing Taxonomic Databases are inadequate due to: • simplicity of model of taxonomy • support single classifications only • limitations of underlying database: • Relational model • limited semantics, no explicit relationships, no recursive querying • Graph models • limited semantics, often no constraints • Semi-structured data • limited semantics, no a priori schema • Object-Oriented models • limited support for relationships, no recursive querying • Need OODB with relationships + Graph functionality • OODBs with relationships already exist (e.g. OMS, Albano’s, GraphDB) • limited (e.g. no QL, no semantics for relationships, or no constraints) • or based on uncommon models (e.g. collection based model of Albano)
Prometheus Approach • Prometheus Model • ODMG model extended with relationships as first class constructs • Association & Aggregation • cardinality, traversibility, sharability, dependency … • Reduces gap between design and implementation • Attributes on relationships used to distinguish classifications • POOL • OQL + operators for manipulating relationships and graphs • query relationship objects • define query on aggregation relationships only • specify a particular path to be followed through a hierarchy • specify the transitive closure of a relationship • return a hierarchy as a structure • Prometheus prototype implemented • using POET (ODMG OODB) and Java
0..n authors 0..n 0..n theAuthorAbbreviations Author givenNames surname DOB DOD Date Circumscription AuthorAbbreviation theAbbreviation Epithet theName Typedefinition PublicationAbbreviation theAbbreviation ReferenceDatabase theReference collector theCircAuthor 0..n 0..n theAuthor theRef theRef 0..n theCircumscription 0..n 0..n 0..n theDate 0..n 0..1 theEpithet 0..n theDate theCircPublication 0..n nextRank previousRank Specimen barCode herbarium collectionNumber latitude longitude Note Name TheValidity calculatedFullNameNoAuthor calculatedFullName Placement theRank LinkToDet thePublicationAbbreviation Publication thePublication thePage Rank theBinomial theName LinkToType thePublication theAuthors Simple Taxonomic Schema
Relationships in the DB • The semantics of relationships (e.g. composition) can vary: • Prometheus implements all these semantics by providing a set of behaviours, constraints, and flags that can be combined • e.g. When a classification is published, it is unchangeable (even if it includes mistakes) • the theCircumscription relationship implements the “not changeable” behaviour • Directionality of relationships is important • for propagation of operations (e.g. deletion of a composition) • as groups at any level contain groups at lower levels • a family contains several genera each of which contain several species • Attributes of relationships are important • classifying is independent from the objects classified • relationships build the classification • attributes of relationships differentiate classifications • the system is a generic classification system
Name TheValidity calculatedFullNameNoAuthor calculatedFullName 0..n Typedefinition Epithet theName theEpithet 0..n Specimen barCode herbarium collectionNumber latitude longitude Note Name TheValidity calculatedFullNameNoAuthor calculatedFullName theRank Rank theBinomial theName LinkToType Example Queries - 1 • Querying relationships • Select the Names whose rank is Genus. • select n from Name n where n.theRank.destination.theName = “Genus” • theRank is a relationship class. • n is considered the origin of theRank in the query and the relationship should be followed only from source to destination • i.e. no reverse traversing of the relationship. • Downcast operator • select the Names whose type is called graveolens. • select n from Name n where n.LinkToType[Name].theEpithet.theName = “graveolens” • the type of the object targeted by the destination attribute of the TaxonomicType relationship should be Name, and not TypeDefinition as shown in the model. • All objects which are not of type Name are discarded with no error reported.
0..n Author givenNames surname DOB DOD Circumscription Epithet theName Typedefinition Specimen barCode herbarium collectionNumber latitude longitude Note theCircumscription 0..n Name TheValidity calculatedFullNameNoAuthor calculatedFullName 0..n 0..n 0..1 theEpithet 0..n 0..n Specimen barCode herbarium collectionNumber latitude longitude Note Name TheValidity calculatedFullNameNoAuthor calculatedFullName Placement theRank Publication thePublication thePage Rank theBinomial theName LinkToType thePublication theAuthors Example Queries - 2 • Aggregate operator • Select the Names whose circumscription contains the specimen whose name is “X” • select shallow aggregate n from Name n where n.theCircumscription[Specimen].barCode = “X” • extracts the Name objects that satisfy the criterion, then finds for each Name object all objects aggregated to form the concept of Name . • Transitive Closure • Select the Names or whose subordinate Names contain the specimen whose name is “X” • select n from Name n where n.theCircumscription[Name]*.theCircumscription.destination[Specimen].barCode = “X” • we use a relationship class as a simple regular expression • follow 0 or more theCircumscription relationships to find the Name objects containing the specimen called “X”. • “*” - the repetition of a path between 0 and n times, “?” - an optional path, “+” - the repetition of a path strictly once or more
0..n Circumscription Specimen barCode herbarium collectionNumber latitude longitude Note theCircumscription 0..n Name TheValidity calculatedFullNameNoAuthor calculatedFullName Example Queries - 3 • Follow operator • select the names hierarchy • select n, n.theCircumscription from Name n follow theCircumscription • the query engine would know that Name objects in the resulting set must be related by a theCircumscription relationship object. • a hierarchy is a directed connected graph. • Therefore, the answer to such a query is a set of connected graphs. • XLINK • Select the names that have specimen “X” in their circumscription • select n from Name n where n.theCircumscription[Name]*.theCircumscription[Specimen].barCode = “X” xlink • finds Name objects that are related to a Specimen whose name is “ X” via one or more theCircumscription relationships in a single hierarchy. • Without xlink, any path relating a Name to a Specimen would be followed and hierarchies mixed up.
0..n Author givenNames surname DOB DOD Circumscription Specimen barCode herbarium collectionNumber latitude longitude Note theCircumscription 0..n Name TheValidity calculatedFullNameNoAuthor calculatedFullName theCircAuthor theCircPublication Publication thePublication thePage Example Queries - 4 • Integrity of graphs in path expressions • select the names containing specimen X in the circumscription where the classification was published in Y • select n from Name n, theCircumscription c where n.c[Name]*.c[Specimen].barCode = X” xlink where c.theCircPublication.thePublication = “Y” • finds all Name objects containing the specimen in their circumscription at any depth • but only according to one publication that is declared in the xlink clause.
Summary & Conclusions • New model (schema) of plant taxonomy defined • extensive use of relationships • Plant taxonomy DBMS implemented using Prometheus DB • final stages of testing by taxonomists • stores all examples of data provided • can answer all queries posed • demo via http interface available • Soon available for download • Conclusion • Explicit relationships in DB provide ways to improve • modelling power & mapping between model and implementation • support for graph structures • QL support necessary to profit from relationships • increased power of ad hoc querying without being domain specific
Acknowledgements • Collaborators • Dr Mark Watson, Dr Martin Pullan, Dr Mark NewmanRoyal Botanic Garden, Edinburgh • Funding • UK Engineering and Physical Sciences Research Council and Biological and Biotechnology Research Council - Bioinformatics Initiative • Project page: http://www.dcs.napier.ac.uk/~prometheus • Demo: http://146.176.18.75:8080