240 likes | 436 Views
Globally Unique Identifiers and Life Science Identifiers. Dave Thau thau@learningsite.com University of Kansas California Academy of Sciences www.learningsite.com. Outline. Describe Global Unique Identifiers Show how they’re relevant Describe one GUID system (LSIDs)
E N D
Globally Unique IdentifiersandLife Science Identifiers Dave Thau thau@learningsite.com University of Kansas California Academy of Sciences www.learningsite.com
Outline • Describe Global Unique Identifiers • Show how they’re relevant • Describe one GUID system (LSIDs) • Outline some issues around using GUIDs for TDWG-related activities • Provide some resources • Open discussion
GUID Is Not An Ugly Word It ’s guid to be merry and wise, It ’s guid to be honest and true, Robert Burns Here’s a Health to Them that ’s Awa’. Pteroptochos tarnii AKA Guidguid Image From: animaldiversity.ummz.umich.edu
GUID: Globally Unique Identifier • A short name for a complex entity • Useful for locating information about the entity • Each name identifies only one entity • There is some sense of permanence
Some things which fit this description • GenBank accession numbers: AP006480.1 • US Patent numbers: 5443036 (laser guided cat exercise) • Digital Object Identifier: 10.121/3212
In Our Domain SDD Document – Representing some data set. <ClassName id="1"> <Label> <Representation language="en"> <Text>Cypselurus heterurus (Rafinesque, 1810)</Text> </Representation> </Label> <Link> <LSID>lsid.gbif.net:www.fishbase.org:1029</LSID> </Link> <Rank>sp</Rank> </ClassName> Napier Schema Document – Representing some taxon. <TaxonConcept id=“urn:lsid:bioguid.org:seek:121212“ type="original"> <Name type="scientific"> <NameSimple>Canis lupus</NameSimple> </Name> … <Relationships> <Relationship type=“is child of"> <ToTaxonConcept ref=“urn:lsid:bioguid.org:seek:5743" /> </Relationship> </Relationships> </TaxonConcept>
Features of a GUID system • Global uniqueness scoped to Internet • Should be easily resolvable by a computer or human • Should identify things down to whatever level of granularity necessary • Should not be limited to proprietary systems • Should serve up all sorts of data • Database records • Text files • Images • It would be nice if the identifier had associated metadata
Life Science Identifiers • Official standard of the Object Management Group (OMG) • Support for metadata and authentication • Supports multiple protocols (e.g. HTTP, SOAP) • Can serve up data in any format • Decentralized – anyone can issue an LSID • LSID code available in Java and Perl. • A young standard, but increasingly used.
Organizations Using LSIDs • National Center for Biotech Information (NCBI) • Pubmed • Genbank • European Bioinformatics Institute (EBI) • US Long Term Ecological Research Network (LTER) • BioMOBY – an biological database interoperability program (biomoby.org) • Open Bioinformatics Foundation (open-bio.org) • myGrid– a BioGRID project (mygrid.org.uk)
LSID Format urn:lsid:bioguid.org:seek:117866:v1 • urn – indicates that this is a URN • lsid – indicates that it’s an LSID-type urn • bioguid.org – the authority who issued the LSID • Doesn’t have to be a domain name – but for now probably should be. • bioguid.org does not necessarily have the data or metadata. • There may not even be a machine called bioguid.org. • seek – a name space id internal to that authority • The name space is meaningless to systems outside that authority. • 117866 – the local identifier within that authority • Also internal to the authority • v1 – an optional version number • If no version, no trailing colon either.
Data and Metadata • An LSID has data • Examples • The gene sequence in GenBank • The actual LTER data set, maybe in excel, or in a text file • The data should never change • An LSID also has metadata • Example metadata • The format of the data • A display title for clients displaying the LSID • Dublin core metadata • Anything you want • The metadata can change
Example LSIDs • An LTER fish abundance data set • urn:lsid:limnology.wisc.edu:dataset:ntlfi02 • A PubMed reference: • urn:lsid:ncbi.nlm.nih.gov.lsid.biopathways.org:pubmed:12441808 • A GenBank sequence: • urn:lsid:ncbi.nlm.nih.gov.lsid.biopathways.org:genbank_gi:30350027
How LSIDs work LSID Client Maybe Launchpad Maybe Haystack Maybe BioFerret Maybe myGRID Maybe Yours! DNS Find DNS record Resolve it to get Address of Authority • Find the authority for this LSID Returns the LSID Authority Server 2. Query authority for available services LSID Authority Returns WSDL for this LSID 3. Chose a service, get the goods Data Store Metadata Store HTTP, SOAP, FTP, others
LSID Promises • I promise to never change the data behind an LSID. • I will make sure my LSIDs are being served, or give them to someone who can do it. • I will give my LSIDs metadata – at least give them a title and a format
Other GUID systems • URLs • Files move • The data change • Unstructured metadata • UUIDs – 128 bit string, guaranteed unique • 58f202ac-22cf-11d1-b12d-002035b29092 • No resolution • No metadata • Handle System / DOIs (10.12/2312) • Non standard protocol • Centralized resolution • Unstructured metadata (for Handle System) • High costs (for DOI)
Issues For This Community • What gets a GUID? • For each of those things, what’s the data, what’s the metadata? • One GUID per item? • Centralization – who issues GUIDs?
What Gets a GUID? • These things probably should get GUIDs • Taxonomic concepts • Specimens • Publications • People • These things might get GUIDs • Taxonomic names • Journals • Data providers • Observations
Specimen Data? Metadata? • If specimens get a GUID – what does it identify? • The physical specimen? • A collection’s database record of the specimen? • What about multiple labels? • Main question – what doesn’t change about a specimen? • Other main question – how should the data be represented? • Darwin core includes current institution location. Not a good idea for the data of a GUID since that may change.
One GUID Per Item? • No GUID system inherently enforces a 1:1 mapping between GUID and data. • Everyone should TRY to limit the number of GUIDs per item. • Should there be any centralization to help achieve this?
Degrees of Centralization • An index • List your GUID authority in an index so your GUIDs are easy to find. • A central authority • One authority could be responsible for issuing GUIDs to the community for specific types of information – you’d have to get one from here. • GBIF? • The IC_Ns? (ICZN, ICBN….) • lsidauthority.org? • This would help enforce a 1:1 mapping of GUIDs and data items • It would also alleviate data providers from the need to maintain their own authorities • It MAY also reduce the likelihood of GUIDs becoming unresolvable • It may also be infeasible technically, or socially. • A respected authority • With LSIDs, an authority can be set up to serve its own GUIDs and proxy other authorities. • This would help enforce a 1:1 mapping for those who use the authority • It may also be more feasible.
LSID Resources • LSID Articles and code from IBM • http://www-124.ibm.com/developerworks/oss/lsid/#whatislsid • Current LSID specification • http://www.omg.org/cgi-bin/doc?dtc/04-05-01 • Launchpad – An LSID resolver for Windows IE • available from first link • A website which resolves LSIDs • http://lsid.biopathways.org/resolver/ • URN specification • http://www.ietf.org/rfc/rfc2141.txt
Acknowledgements • My work on GUIDs has been funded by the SEEK project – seek.ecoinformatics.org. • SEEK is funded by National Science Foundation award 0225676. • Thanks to Ben Szekely at IBM for his LSID articles, his LSID java code, and for answering all my questions.
Questions for Discussion • Do we need GUIDs? • What gets a GUID? • One GUID per item? • Centralization?