410 likes | 529 Views
LSID and TCS deployment in the Catalogue of Life. TDWG 2007, Bratislava Richard J. White, Andrew C. Jones & Ewen R. Orme Cardiff University, UK e.r.orme | andrew.c.jones | r.j.white@cs.cf.ac.uk. TDWG Infrastructure Project.
E N D
LSID and TCS deployment in the Catalogue of Life TDWG 2007, Bratislava Richard J. White, Andrew C. Jones & Ewen R. Orme Cardiff University, UK e.r.orme | andrew.c.jones | r.j.white@cs.cf.ac.uk
TDWG Infrastructure Project • This talk describes work in progress on a project being carried out at Cardiff University and by Species 2000, supported by GBIF, titled "TCS and LSID deployment in the Catalogue of Life" • This talk has a similar title to the project, with a small concession to the fact that the presentation will be in the LSID session
Objectives of the project To add support for Life Sciences Identifiers (LSIDs) and the Taxon Concept Schema (TCS) to the Catalogue of Life (CoL), by implementing a limited set of changes to the CoL which will • act as a test-bed or demonstrator and • inform discussion based on this prototype • leading to a plan for deployment of LSIDs and TCS in the CoL from 2008 It is hoped this project will lead to • increased use of TDWG standards, • accelerated LSID deployment and uptake of TCS, which will in turn • assist providers and users to ascribe data unambiguously to specified taxon concepts, and • speed the growth of shared biodiversity data resources
The Catalogue of Life The CoL partners, Species 2000 and ITIS, have • built a checklist to contain all the world's species, which • acts as framework for organisation and enrichment of species biodiversity data by individuals, institutions and projects (some of which were mentioned by Frank Bisby in his talk yesterday) • is delivered as the Annual and Dynamic Checklists (AC and DC)
Project participants Cardiff University, UK to investigate feasible solution(s), implement, set up an experimental system as a basis for discussion and planning: • Ewen Orme (technical and implementation issues) • Andrew Jones (ideas and concepts) • Richard White (blame for failure) Species 2000 (Secretariat at the University of Reading, UK) to survey needs and capabilities in the light of the demonstration system, test it, help CoL decide how and when to deploy LSIDs in a staged manner: • Frank Bisby (policy) • Yuri Roskov (testing)
Objectives of this talk • to tell you about our progress so far • and what we plan to do between now and December 2007 • to give an example of a data integrator adopting LSIDs and TCS • to stimulate discussion which will help us with the project • and ensure optimal cooperation between the CoL, TDWG and our data providers and consumers Therefore many of the slides pose questions rather than provide answers!
Progress (1) We have • set up experimental installations of • the CoL Spice hub software and cache • the Annual Checklist (database and user interfaces) • addressed some Spice portability issues • established an LSID resolution service to support the use of CoL LSIDs • currently available for testing • the exact LSIDs are provisional and are NOT for real use yet! • designed and implemented provisional RDF/TCS responses from LSID resolver generated from the experimental AC
Progress (2) • We will shortly modify the experimental AC database and software to issue provisional LSIDs for taxon concepts • Later we will modify the Spice hub and cache to do the same for the Dynamic Checklist
Generating CoL LSIDs • An extra LSID field in experimental copy of AC 2007 • Making the LSIDs available to users • on local AC client • on AC web site • Current LSID format is urn:lsid:lsid.sp2000.org:ac2007:159044
Access to our resolver • Directly (if you know the IP address and port details) • via Firefox plug-in • via TDWG proxy (in due course) • We wish to encourage discussion of how users will resolve LSIDs
Demo of the Firefox plug-in (OK, it’s not a demo, it’s a set of screen-shots) • The TDWG Firefox LSID resolver plug-in • converts an LSID to a URL • forwards it to a (configured known) resolver • creates a digest from the returned RDF • Enter: lsidres:urn:lsid:sp2000.cs.cf.ac.uk:AC2007:159044 • Later, it will permit lsidres:urn:lsid:lsid.sp2000.org:ac2007:159044
Direct access to the resolver (with a listing of the raw RDF) • One can enter this in the URL field of any browser: http://stilgar.cs.cf.ac.uk:8090/authority/metadata?lsid=urn:lsid:sp2000.cs.cf.ac.uk:AC2007:159044 • which produces a following response in RDF • Here is a response for Chlorosarcinopsis negevensis. • I have split it up into sections to make it easier to read
Vocabularies used First, the vocabularies (name spaces) used (RDF and three TCS elements) are declared: <rdf:RDF xmlns:rdf= "http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:TaxonConcept= "http://rs.tdwg.org/ontology/voc/TaxonConcept#" xmlns:TaxonName= "http://rs.tdwg.org/ontology/voc/TaxonName#" xmlns:TaxonRank= "http://rs.tdwg.org/ontology/voc/TaxonRank#"> [In what follows I've abbreviated "http://rs.tdwg.org/ontology/voc/" to "..."]
Accepted name The resolver first returns two names. The first is the accepted name and the second is a synonym, but there is nothing to indicate that yet - these are just their names: <rdf:Description rdf:nodeID="A0"> <rdf:type rdf:resource="...TaxonName#TaxonName"/> <TaxonName:rank rdf:resource="...TaxonRank#Species"/> <TaxonName:nameComplete>Chlorosarcinopsis negevensis </TaxonName:nameComplete> <TaxonName:genusPart> Chlorosarcinopsis </TaxonName:genusPart> <TaxonName:specificEpithet> negevensis </TaxonName:specificEpithet> <TaxonName:authorship> I. Friedmann & R. Ocampo-Paus </TaxonName:authorship> </rdf:Description>
Synonym <rdf:Description rdf:nodeID="A1"> <rdf:type rdf:resource="...TaxonName#TaxonName"/> <TaxonName:rank rdf:resource="...TaxonRank#Species"/> <TaxonName:nameComplete>Neochlorosarcina negevensis </TaxonName:nameComplete> <TaxonName:genusPart> Neochlorosarcina </TaxonName:genusPart> <TaxonName:specificEpithet> negevensis </TaxonName:specificEpithet> <TaxonName:authorship> (I. Friedmann & R. Ocampo-Paus) S. Watanabe </TaxonName:authorship> </rdf:Description>
Taxon concept The resolver then returns a set of concepts. The concept for the species to which the LSID refers (populating the "about" attribute with the LSID); it has an accepted name and two relationships referred to by nodeIDs (which ultimately link the species to its synonym and genus): <rdf:Description rdf:about="urn:lsid:lsid.sp2000.org:ac2007:174752"> <rdf:type rdf:resource="...TaxonConcept#TaxonConcept"/> <TaxonConcept:hasName rdf:nodeID="A0"/> <TaxonConcept:hasRelationship rdf:nodeID="A3"/> <TaxonConcept:hasRelationship rdf:nodeID="A5"/> </rdf:Description>
Synonym concept The concept for the synonym, which has a name and and a relationship (to its accepted taxon concept defined later): <rdf:Description rdf:nodeID="A4"> <rdf:type rdf:resource="...TaxonConcept#TaxonConcept"/> <TaxonConcept:hasName rdf:nodeID="A1"/> <TaxonConcept:hasRelationship rdf:nodeID="A2"/> </rdf:Description>
HasSynonym The remaining concepts are for three relationships. The "HasSynonym" relationship from the accepted concept to the synonym concept: <rdf:Description rdf:nodeID="A3"> <rdf:type rdf:resource="...TaxonConcept#Relationship"/> <TaxonConcept:relationshipCategory rdf:resource="...TaxonConcept#HasSynonym"/> <TaxonConcept:fromTaxon rdf:resource="urn:lsid:lsid.sp2000.org:ac2007:174752"/> <TaxonConcept:toTaxon rdf:nodeID= "A4"/> </rdf:Description>
IsSynonymFor The reverse "IsSynonymFor" relationship from the synonym concept to the accepted concept: <rdf:Description rdf:nodeID="A2"> <rdf:type rdf:resource="...TaxonConcept#Relationship"/> <TaxonConcept:relationshipCategory rdf:resource="...TaxonConcept#IsSynonymFor"/> <TaxonConcept:fromTaxon rdf:nodeID= "A4"/> <TaxonConcept:toTaxon rdf:resource="urn:lsid:lsid.sp2000.org:ac2007:174752"/> </rdf:Description>
Genus Finally a relationship from the accepted species concept to its genus concept (which, being a separate taxon with its own LSID, is not contained in this document): <rdf:Description rdf:nodeID="A5"> <rdf:type rdf:resource="...TaxonConcept#Relationship"/> <TaxonConcept:relationshipCategory rdf:resource="...TaxonConcept#IsChildTaxonOf"/> <TaxonConcept:fromTaxon rdf:resource="urn:lsid:lsid.sp2000.org:ac2007:174752"/> <TaxonConcept:toTaxon rdf:resource="urn:lsid:lsid.sp2000.org:ac2007:10868"/> </rdf:Description> </rdf:RDF>
Questions to be resolved For what kinds of entities will the CoL issue LSIDs? (Note that this is not the same question as which entities will be represented as <TaxonConcepts> in the RDF) • only accepted species taxon concepts [yes] or • all names (including synonyms)? [no, to be done by the nomenclator projects?] • taxa at other levels • lower ("infra-specific") taxa (CoL limits itself to only one infra-specific level, i.e. a species may have either subspecies or varieties but not both) • higher taxa: (Col limits itself to genus, family, order, class, phylum; but in practice also permits subgenera, superfamilies, and the eight "top-level" nodes above the phylum level)
What is returned by the resolver? • Should it return all available data? [yes] • or just enough to let the user query the AC or DC for more information? [no] • If only accepted taxa have LSIDs, the resolver response can include all synonyms • if all names get LSIDs (e.g. from nomenclators), • should all the names relating to a taxon be returned in one response? [yes] • or return only one concept per response? [no, need to call the resolver again] How are higher taxa linked? • should the resolver return TCS about a single accepted taxon concept [yes, call the resolver again to "navigate"] • or embed all linked taxa (above and below) [no, could be huge]
Types of synonymy in the CoL and how they are handled in RDF/TCS (provisional!)
Propagating LSIDs and other GUIDs We will demonstrate • receiving GUIDs from a data provider • and making them available in the metadata returned by the LSID resolver. We will summarise the options for doing this in a preliminary plan to be refined by the Catalogue of Life partners in November 2007
Decisions about deployment • to be made by discussion with Sp2000 and ITIS
Timetable for completion of the project • We will complete the project by the end of December 2007, including • drafting updated documentation of the enhanced CDM and schemas
Questions (1) To be resolved during the rest of the project, and interesting areas for wider consideration and future research, some of which will be described in the presentation. How to express the different types of CoL LSIDs within the allowed LSID syntax: • urn:lsid:sp2000.org:ac2007:159044 • urn:ac2007.lsid:sp2000.org:taxon:159044 • urn:dc.lsid:sp2000.org:taxon:159044
Questions (2) • How will users (human or software) obtain LSIDs in the first place? Do we need to set up a modified version of the AC (and eventually DC) Web Service? [yes] • How the GUIDs (not necessarily LSIDs) that the data providers might supply will be propagated through the hub • The expectations of users concerning the activities that LSIDs will help them with
TCS “Completeness” Can all the information in the Sp2000 standard data set (CDM) be translated into TCS? • so that it could be turned back into a Sp2000 data set without loss of information? • could Sp2000 usefully employ any features of TCS not currently in the CDM? • such as finer distinctions between types of synonymy and other relationships between names and taxa?
Taxonomic hierarchy issues Whether and how the taxonomic hierarchy will be navigable using LSIDs • It is natural to assign LSIDs to higher taxa • Using the LSID Resolution Service to navigate up and down the taxonomic hierarchy raises some interesting issues • going up is straightforward (at least while only one hierarchy is considered) • going down may involve subtaxa at different levels • the children of one node may not all be at the same level • for example some genera in an order may not have been classified in any family ("incertae sedis") • you can’t give an LSID to an unnamed node, especially because you don't know whether two unnamed sister nodes are supposed to be the same or not)
More questions • How users (human or software) will obtain LSIDs for entities of interest? • Users’ expectations concerning tasks that LSIDs might assist, • including navigating the taxonomic hierarchy and • linking data to taxa • The role of CoL LSIDs in building the biodiversity information systems of the future
Summary We hope to: • improve the compatibility of the protocols and public software interfaces used by Species 2000 with TDWG standards • increase the usefulness of the CoL to users, including GBIF, by • improving the CoL’s compatibility with other biodiversity tools, • supplying its information to clients expressed as taxon concepts • by enhancing interoperability between data providers and consumers by means of LSIDs referring to these concepts • The updated Spice protocol, documentation and enhanced Spice software will be available for use by other projects to build species information systems for their own purposes.
Homework for TDWG delegates • Informal meeting at 18:00 on Tuesday (today) in the vestibule (at the registration desk end) - all interested persons welcome! • Further information about this project and its progress, updated periodically, will be placed at http://spice.cs.cf.ac.uk/lsid/ (but don’t look there just yet!) • Email us at e.r.orme | andrew.c.jones | r.j.white@cs.cf.ac.uk