460 likes | 595 Views
Digital Object Identifiers for Science Data. Norman Paskin, International DOI Foundation. do i >. Outline. doi>. What is a DOI Persistent identification and resolution Functional components Example applications in science data (2) Open Q&A Slides are available electronically
E N D
Digital Object Identifiers for Science Data Norman Paskin, International DOI Foundation doi>
Outline doi> • What is a DOI • Persistent identification and resolution • Functional components • Example applications in science data (2) • Open Q&A • Slides are available electronically • some slides are hidden (run in presentation view) • sequence of builds 29 - 49 • References are in penultimate slide • See especially article for Data Science Journal 2005 • Print handout of Internet Registries article • Possible other topics for discussion: semantic interoperability (data integration)
Digital Object Identifier = DOI doi> • A name (not a location) for an entity on digital networksAND/OR • A system for persistent and actionable identification and interoperableexchange of managed information on digital networks • Standards-based components • Developed as cross-industry, cross-sector, not-for-profit effort managed by an open membership collaborative development body • International DOI Foundation (IDF) • In widespread use now: • Over 15 million assigned, over 1000 naming authorities (users) • Key feature of scientific primary publishing as part of CrossRef system • Adopted for government documents (EC, OECD, UK, etc) • In use, is a mechanism “behind the scenes”, • e.g. looks like a URL in a web context • One application is interoperable common system for identification of science data: two projects considered as examples: • TIB project (citation of primary data sets) • Names for Life (biological taxonomy)
Identification and resolution • Identifier: • A unique label for an entity involved in a transaction • Note the ambiguity of “identifier”: • Label alone (e.g. ISBN) • Specification alone (e.g. URN) • Implemented specification (e.g. DOI, Bar code) • Ideally, persistent • Ideally, actionable … • Resolution: • The process in which an identifier is the input (a request) to a network service to receive in return specific output(s) • Both concepts are in principle neutral as to technology implementation • Abstract concepts, but implementations typically at least “internet” TCP/IP (the more general the better, e.g. not just “Web”)
Technical and social infrastucture issues Persistence • "It is intended that the lifetime of a [persistent identifier] be permanent. That is, the [persistent identifier] will be globally unique forever, and may well be used as a reference to a resource well beyond the lifetime of the resource it identifies or of any naming authority involved in the assignment of its name.“ • [Persistent Identifier] = URN in IETF RFC 1737: Functional Requirements for Uniform Resource Names. (http://www.ietf.org/rfc/rfc1737.txt)
Interoperability • Persistence is one dimension of interoperability: • “persistence is interoperability with the future” • We know what we mean, but others may not. • Identifiers assigned in one context may be encountered, and may be re-used, in another place (or time) - without consulting the assigner. You can’t assume that your assumptions will be known to someone else. Interoperability = the possibility of use in services outside the direct control of the issuing assigner • Identifiers may be opaque or may be meaningful – but meaning only makes sense in context • Normally, opaque string is the safest assumption • User communities define rules (social infrastructure; namespaces) • e.g. recent chemistry identifier proposal INChI is an interesting “meaningful” identifier • Interoperability guarantees others can interpret even if they do not know rules
ID Two principles for persistent identification resource 1. Obvious: Assign ID to resource Once assigned the number must identify the same resource • Beyond the lifetime of the resource, or the assigner • Less obvious: Assign Resource to ID • The resource must be “identified” • Must ensure it is always the same thing (bound) • Describe the resource “content” [with precision] • Failure to do this will ultimately break interoperability • How far do we go in each? Depends on what we think is “good enough” • Technologists have focussed on (1) [and “bags of bits/data structures”]. • The content/rights world (2) [and focus on “intellectual content”] • Both viewpoints valid • (2) is now becoming more relevant
Resolution and “What are we identifying?” • Resolution: The process in which an identifier is the input (a request) to a network service to receive in return specific output(s) • “Point and click” is what I do (URL model), so: • “what I point to (resolve to and get) is what is identified”, right? • No • Point and click “get” is not referencing • Can identify but not “get” directly things that are intangible (works), or fugitive (performances) or that change: (“Todays NY Times”) or people and concepts…. • Pointing and clicking can return different things in different contexts, or give multiple options • Identifier identifies an entity. Pointing and clicking is a service about that entity • Entities can be physical, abstract, tangible, intangible, things, people, concepts, instances, … • Resolution provides a mechanism to describe the resource “content” through a service which delivers a description
What are we identifying? Document on screen Abstract work? Manifestation of abstract work? Version? This HTML file? All/some of these?
Identification and resolution • Resolvable identifiers must specify: • Agreed numbering syntax • Resolution mechanism • Data model to define “what it is we are identifying” • Technical and social infrastructure to implement • (compare physical world bar codes) • These could be assembled ad hoc, or offered as a packaged system (e.g.DOI)
Numbering scheme Policies DOI is the combination of these four components doi> Data Model Internet Resolution
DOI syntax can includeany existing identifier “label” formal or informal, of any entity • An identifier “container” e.g. • 10.1234/NP5678 • 10.5678/ISBN-0-7645-4889-4 • 10.2224/2004-10-ISO-DOI • NISO Z39.84, DOI Syntax
Resolve from DOI to data • initially to location (URL) – persistence • May be to multiple data: • Multiple locations • Metadata • Services • Extensible framework: to allow user-defined data types • Uses the Handle system • Implementing URI/URN concept • Running on TCP/IP (common co-inventor) • IETF RFCs 3650, 3651, 3652 • To be in GRID Globus tool kit • Full Unicode compliance Internet resolution allows a DOI to link to any & multiple pieces of current data
DOI Data Model = Metadata tools: • a data dictionary to define + • a grouping mechanism to relate • Necessary for interoperability • “Enabling information that originates in one context to be used in another in ways that are as highly automated as possible”. • Able to use existing metadata • Mapped using a standard dictionary • Can describe any entity at any level of granularity • indecsDD which incorporatesISO MPEG 21 RDD • IDF is the MPEG21 RDD registration authority <indecs> Data Dictionary + DOI AP framework
DOI policies allow any model for practical implementations • Implementation through IDF • Governance and agreed scope, policy, “rules of the road” • Technical infrastructure: resolution mechanism, proxy servers, mirrors, back-up, central dictionary, • Social infrastructure: persistence commitments, fall-back procedures, cost-recovery (self-sustaining), shared use of system • Not a standard but a Registration Authority/maintenance agency • IDF delegates through Registration Agencies • Each can develop own applications • Use in “own brand” ways appropriate for their community (eg: CrossRef)
Identify Describe DOI syntax can includeany existing identifier,formal or informal, of any entity eg DOI metadata can be of any type, standard or proprietary eg OnixForBooks OnixForSerials IEEE/LOM MARC Dublin Core Proprietary scheme 10.2341/0-7645-4889-1 10.5678/978-0-7645-4889-4 10.1000/ISBN 0764548891 10.1234/Norman_presentation 10.2224/2004-10-28-ISO-DOI DOI combination of components doi> Resolve The Handle resolution technology allows you to access any kind of Service associated with your DOI. eg (to interoperate with anyone else in the DOI network, map to the <indecs> Data Dictionary (iDD) A package of services is an Application Profile Services can include metadata services
doi> DOI and scientific data • DOI is already the core technology for maintaining cross-reference • persistent links between a citation and internet access to article • CrossRef system used by 350+ publishers representing bulk of STM articles (as pre-publication link builder) both for profit and not for profit, OA, www.crossref.org • 9,000 DOIs per day added to CrossRef. • Over 12 million DOIs now registered with CrossRef, • Over 850,000 assigned to books and conference proceedings. • Several projects suggested to IDF using DOIs for data (not connected with CrossRef) • physico-chemical property data; biological microscopy images. • See Paskin, ICSTI 2002 paper • Some sectors have developed their own identifiers, • e.g. Life Science Identifier (I3C/IBM): simple URN mechanism, non-generic, non-global – but very useful in bio-informatics • These can be incorporated into a DOI if needed to make globally interoperable and extensible • Two projects in particular have developed DOI applications:
doi> (1) TIB: Citation of Primary Data • Problem: re-use of existing data sets • Attribution of data source: make data publications citable in a standard way (cf. articles Citation Index) • Archiving of data in context so as to be discoverable and interoperable (usable by others) • Background • CODATA National Committee WG, grant-aided by DFG (Sept 2001 to May 2002): Report "Concept of Citing Scientific Primary Data“ • Continuation as project for pilot implementation funded by DFG Oct 2003 to Oct 2005 at TIB (German National Library of Science & Technology) • Development of DOI registration agency for Data • Solution: • DOIs for data sets, with associated metadata • Core management metadata applicable to all datasets • Structured metadata extensible to specific science disciplines • Follows principles of DOI Data Model
doi> (1) Citation of Primary Data: illustration of solution • During her research for the World Data Center Climate (WDCC) Dr. Weather gains primary data about the weather in Hannover in the year 2003. • Primary data is tested, evaluated, stored and administrated at the WDCC. • Primary data is registered and allocated DOI at the TIB • With quality control of metadata, no change once allocated, etc • Dr Weather can now cite this with a resolvable DOI e.g DOI:10.1594 /WDCC/W_Han_2003_MMB_2 10.1594 (Prefix) = TIB as the registration agency. WDCC = research institute. W_Han_2003_MMB_2 = internal name of the Data • DOI is resolvable directly, or via http as http://dx.doi.org/10.1594/WDCC/W_Han_2003_MMB_2
doi> (1) Citation of Primary Data: illustration of solution Usage scenario 1: attribution of source • Dr. Storm is reading publications from Dr. Weather in a journal and would like to analyse her data under different aspects. • Can resolve the DOI to obtain the data set for use • In his publication ”Comparison of the weather from Hannover and Miami” Dr. Storm cites Dr. Weather’s data using its DOI, referring to the uniqueness and own identity of the original data. • Citation example: Weather, 2003: “Weather in Hannover for 2003” doi: 10.1594/WDCC/W_Han_2003_MMB_2 Usage scenario 2: archiving for re-use • Mr. Nice is writing a paper about the sales figures of ice cream in Hannover in 2003, but he has no information about the weather. • Searches via TIB central registration agency metadata search • Result is doi:10.1594/WDCC/W_Han_2003_MMB_2 • He resolves the DOI to find the data. • The metadata refers him to the WDCC as publisher and data archive. • In his paper he cites the data using the DOI.
doi> (2) Names for life: Biological taxonomy • Problem: “Future-proofing biological nomenclature” • See Garrity and Lyons, OMICS, 2003 • For a given nomenclature in a biological taxonomy, change occurs • e.g. new species recognised, species reassigned as the founding species of new genera; synonyms; species split into subspecies which later became separate species; • resulting in changes of names, genera, families, classes, relationships over time • How does researcher keep track? • Solution: DOI proposed as tool • a data model of nomenclature and taxonomy • enabling disambiguation of synonyms and competing taxonomies • a metadata resolution service • enabling disseminationof archived and updated information objects through persistent links
nomenclature (2) Names for Life: illustration of problem doi>
nomenclature 1972 Alteromonas macleodii(T) communis vaga
nomenclature 1972 1973 Alteromonas macleodii(T) communis vaga haloplanktis
nomenclature 1972 19731976 Alteromonas macleodii(T) communis vaga haloplanktis rubra
nomenclature 1972 1973 19761977 Alteromonas macleodii(T) communis vaga haloplanktis rubra citrea
nomenclature 1972 1973 1976 19771978 Alteromonas macleodii(T) communis vaga haloplanktis rubra citrea esperjiana undina
nomenclature 1972 1973 1976 1977 19781979 Alteromonas macleodii(T) communis vaga haloplanktis rubra citrea esperjiana undina aurantia
nomenclature 1972 1973 1976 1977 1978 19791981 Alteromonas macleodii(T) communis vaga haloplanktis rubra citrea esperjiana undina aurantia putrifaciens hanedai
nomenclature 1972 1973 1976 1977 1978 1979 19811982 Alteromonas macleodii(T) communis vaga haloplanktis rubra citrea esperjiana undina aurantia putrifaciens hanedai luteoviolaceae
Oceanosprillum Marinomonas linum(T) communis(T) japonicum minutium biejerinckii maris maris maris williamsae hiroshimense multiglobiferum pelagicum pusillum jannaschii kreigii nomenclature 1972 1973 1976 1977 1978 1979 1981 19821984 Alteromonas macleodii(T) vaga communis vaga haloplanktis rubra citrea esperjiana undina aurantia putrifaciens hanedai luteoviolaceae commune vagum
Shewanella putrifaciens(T) nomenclature 1972 1973 1976 1977 1978 1979 1981 1982 19841986 Oceanosprillum Marinomonas Alteromonas linum(T) communis(T) macleodii(T) japonicum vaga communis benthica minutium hanedai vaga biejerinckii haloplanktis maris maris rubra citrea maris williamsae esperjiana undina hiroshimense aurantia multiglobiferum putrifaciens pelagicum hanedai pusillum luteoviolaceae commune jannaschii kreigii vagum
nomenclature 1972 1973 1976 1977 1978 1979 1981 1982 1984 19861987 Oceanosprillum Marinomonas Alteromonas Shewanella linum(T) communis(T) putrifaciens(T) macleodii(T) japonicum vaga communis benthica minutium hanedai vaga biejerinckii haloplanktis maris maris rubra citrea maris williamsae esperjiana undina hiroshimense aurantia multiglobiferum putrifaciens pelagicum hanedai pusillum luteoviolaceae commune denitrificans jannaschii kreigii vagum
nomenclature 1972 1973 1976 1977 1978 1979 1981 1982 1984 1986 19871988 Oceanosprillum Marinomonas Alteromonas Shewanella linum(T) communis(T) putrifaciens(T) macleodii(T) japonicum vaga communis benthica minutium hanedai vaga biejerinckii haloplanktis maris maris rubra citrea maris williamsae esperjiana undina hiroshimense aurantia multiglobiferum putrifaciens pelagicum hanedai pusillum luteoviolaceae commune denitrificans jannaschii colwelliana kreigii vagum
nomenclature 1972 1973 1976 1977 1978 1979 1981 1982 1984 1986 1987 19881990 Oceanosprillum Marinomonas Alteromonas Shewanella linum(T) communis(T) putrifaciens(T) macleodii(T) japonicum vaga communis benthica minutium hanedai vaga biejerinckii colwelliana haloplanktis maris maris rubra citrea maris williamsae esperjiana undina hiroshimense aurantia multiglobiferum putrifaciens pelagicum hanedai pusillum luteoviolaceae commune denitrificans jannaschii colwelliana kreigii tetradonis vagum biejerinckii pelagicum maris hiroshimense
nomenclature 1972 1973 1976 1977 1978 1979 1981 1982 1984 1986 1987 1988 19901992 Oceanosprillum Marinomonas Alteromonas Shewanella linum(T) communis(T) putrifaciens(T) macleodii(T) japonicum vaga communis benthica minutium hanedai vaga biejerinckii colwelliana haloplanktis maris maris algae rubra citrea maris williamsae esperjiana undina hiroshimense aurantia multiglobiferum putrifaciens pelagicum hanedai pusillum luteoviolaceae commune denitrificans jannaschii colwelliana kreigii tetradonis vagum atlantica biejerinckii pelagicum carageenovora maris hiroshimense
Pseudoalteromonas haloplanktis haloplanktis(T) nigrifaciens pisicida nomenclature 1972 1973 1976 1977 1978 1979 1981 1982 1984 1986 1987 1988 1990 19921995 Oceanosprillum Marinomonas Alteromonas Shewanella linum(T) communis(T) putrifaciens(T) macleodii(T) japonicum vaga communis benthica haloplanktis tetradonis minutium hanedai vaga biejerinckii colwelliana haloplanktis atlantica maris maris algae rubra aurantia citrea maris williamsae carrageenovora esperjiana citrea undina hiroshimense esperjiana aurantia multiglobiferum luteoviolacea putrifaciens pelagicum hanedai pusillum luteoviolaceae commune rubra denitrificans jannaschii undina colwelliana kreigii tetradonis vagum atlantica biejerinckii pelagicum carageenovora distincta maris hiroshimense fulginea
nomenclature 1972 1973 1976 1977 1978 1979 1981 1982 1984 1986 1987 1988 1990 1992 19951997 Oceanosprillum Marinomonas Alteromonas Shewanella Pseudoalteromonas linum(T) communis(T) putrifaciens(T) haloplanktis haloplanktis(T) macleodii(T) japonicum vaga communis benthica haloplanktis tetradonis minutium hanedai vaga biejerinckii colwelliana haloplanktis atlantica maris maris algae rubra aurantia citrea maris williamsae carrageenovora esperjiana citrea undina hiroshimense esperjiana aurantia multiglobiferum luteoviolacea putrifaciens pelagicum nigrifaciens hanedai pusillum pisicida luteoviolaceae commune rubra denitrificans jannaschii undina colwelliana kreigii antartica tetradonis vagum atlantica biejerinckii pelagicum carageenovora distincta maris hiroshimense fulginea elyakoviii
woodyii amazonensis oneidensis pealeana violacea nomenclature 1972 1973 1976 1977 1978 1979 1981 1982 1984 1986 1987 1988 1990 1992 1995 19972000 Oceanosprillum Marinomonas Alteromonas Shewanella Pseudoalteromonas linum(T) communis(T) putrifaciens(T) haloplanktis haloplanktis(T) macleodii(T) japonicum vaga communis benthica haloplanktis tetradonis minutium mediterannea hanedai vaga biejerinckii colwelliana haloplanktis atlantica maris maris algae rubra aurantia citrea fridgidimarina maris williamsae carrageenovora esperjiana geldimarina citrea undina hiroshimense esperjiana aurantia multiglobiferum luteoviolacea putrifaciens baltica pelagicum nigrifaciens hanedai pusillum pisicida luteoviolaceae commune rubra denitrificans jannaschii undina colwelliana kreigii antartica tetradonis vagum bacteriolytica atlantica biejerinckii pelagicum prydzensis carageenovora tunicata distincta maris hiroshimense distincta fulginea elyakovii elyakoviii peptidolytica
nomenclature 1972 1973 1976 1977 1978 1979 1981 1982 1984 1986 1987 1988 1990 1992 1995 1997 20002001 Oceanosprillum Marinomonas Alteromonas Shewanella Pseudoalteromonas linum(T) communis(T) putrifaciens(T) haloplanktis haloplanktis(T) macleodii(T) japonicum vaga communis benthica haloplanktis tetradonis minutium mediterannea hanedai vaga biejerinckii colwelliana haloplanktis atlantica maris maris algae rubra aurantia citrea fridgidimarina maris williamsae carrageenovora esperjiana geldimarina citrea undina woodyii hiroshimense esperjiana aurantia amazonensis multiglobiferum luteoviolacea putrifaciens baltica pelagicum nigrifaciens hanedai oneidensis pusillum pisicida luteoviolaceae pealeana commune rubra denitrificans violacea jannaschii undina colwelliana japonica kreigii antartica tetradonis vagum bacteriolytica atlantica biejerinckii pelagicum prydzensis carageenovora tunicata distincta maris hiroshimense distincta fulginea elyakovii elyakoviii peptidolytica tetrodonis
nomenclature 1972 1973 1976 1977 1978 1979 1981 1982 1984 1986 1987 1988 1990 1992 1995 1997 2000 20012002 Oceanosprillum Marinomonas Alteromonas Shewanella Pseudoalteromonas linum(T) communis(T) putrifaciens(T) haloplanktis haloplanktis(T) macleodii(T) japonicum vaga communis benthica haloplanktis tetradonis minutium mediterannea hanedai vaga biejerinckii colwelliana haloplanktis atlantica maris maris algae rubra aurantia citrea fridgidimarina maris williamsae carrageenovora esperjiana geldimarina citrea undina woodyii hiroshimense esperjiana aurantia amazonensis multiglobiferum luteoviolacea putrifaciens baltica pelagicum nigrifaciens hanedai oneidensis pusillum pisicida luteoviolaceae pealeana commune rubra denitrificans violacea jannaschii undina colwelliana japonica kreigii antartica tetradonis denitrificans vagum bacteriolytica atlantica livingstonensis biejerinckii pelagicum prydzensis carageenovora alleyanna tunicata distincta maris hiroshimense distincta fulginea elyakovii elyakoviii peptidolytica tetrodonis
nomenclature 1972 1973 1976 1977 1978 1979 1981 1982 1984 1986 1987 1988 1990 1992 1995 1997 2000 2001 20022004 Oceanosprillum Marinomonas Alteromonas Shewanella Pseudoalteromonas linum(T) communis(T) putrifaciens(T) haloplanktis haloplanktis(T) macleodii(T) japonicum vaga communis benthica haloplanktis tetradonis minutium mediterannea hanedai vaga biejerinckii primoryensis colwelliana haloplanktis atlantica maris maris algae rubra aurantia citrea fridgidimarina maris williamsae carrageenovora esperjiana geldimarina citrea undina woodyii hiroshimense esperjiana aurantia amazonensis multiglobiferum luteoviolacea putrifaciens baltica pelagicum nigrifaciens hanedai oneidensis pusillum pisicida luteoviolaceae pealeana commune rubra denitrificans violacea jannaschii undina colwelliana japonica kreigii antartica tetradonis denitrificans vagum bacteriolytica atlantica livingstonensis biejerinckii pelagicum prydzensis carageenovora alleyanna tunicata distincta mariniintestina maris hiroshimense distincta fulginea saire elyakovii elyakoviii schlegeliana peptidolytica gaetbuli tetrodonis 11 others
links from the web links to the web journalarticle journalarticle name combinedname DOI DOI strainrecord taxon journalarticle DOI DOI geneannotation journalarticle exemplar DOI nomos geneannotation strainrecord anyonline information (2) Names for Life: illustration of solution
name combinedname Look up this name and all its synonyms in PubMed taxon Compare this name to the current state (contents) of the taxon exemplar nomos Determine whether thisexemplar is part of a taxon in another nomos dissemination (2) Names for Life: illustration of solution doi> By reasoning over information objects, construct services that can be offered through multiple resolution.
Further reading • Paskin, Norman. "Digital Object Identifiers for scientific data". Paper presented at 19th International CODATA Conference, Berlin, 10 November 2004. http://www.doi.org/topics/041110CODATAarticleDOI.pdf • Project announced to develop DOIs for scientific data: http://www.doi.org/news/TIBNews.html • Garrity, G. M.; Lyons, C. "Future-proofing biological nomenclature". Omics, 2003, Volume 7, Number 1, pgs. 31-33. pre version at http://www.eecs.umich.edu/~jag/wdmbio/garrity.htm. • Harris, Jerald D. ""Published Works" in the electronic age: recommended amendments to Articles 8 and 9 of the Code". Bulletin of Zoological Nomenclature, 61(3), September 2004, pp. 138-148. • "Online Registries: The DNS and Beyond...", Esther Dyson, Release 1.0 September 2003. [ print only: summary at doi:10.1340/309registries ] • DOI progress report : D Lib magazine (online) [http://www.dlib.org/dlib/june03/paskin/06paskin.html] • “Identification and Metadata: Components of DRM Systems" Norman Paskin; in E. Becker et al (eds) "Digital Rights Management” in the series Lecture Notes in Computer Science (Springer-Verlag, 2003) pp. 26-61 [http://www.doi.org/topics/drm_paskin_20030113_b1.pdf] • DOI factsheets etc. http://www.doi.org/factsheets.html
doi> n.paskin@doi.org www.doi.org