310 likes | 359 Views
This comprehensive guide covers the Darwin Core Archives checklist, extensions for interoperability, best practices, publishing support tools, and guidelines for Darwin Core standards. Learn about taxonomy, data formats, identifiers, classifications, nomenclature, and more.
E N D
Darwin Core Archives Checklist Archives Checklist Extensions Archive Tools Checklist Bank Markus Döring & David Remsen, GBIF 2010
Darwin Core • Ratified in 2009 • Significant additions/refinements • Ongoing process • Set of terms • http://rs.tdwg.org/dwc/terms/index.htm • Not tied to technology • Use Text Guidelines for DwC-A • http://rs.tdwg.org/dwc/terms/guides/text/index.htm
Darwin Core Archives for interoperability • Simplicity • Complete datasets, compressed • Allow for rich dataset metadata • Single CSV /w header minimal requirement • Flexible • 1:many extensions • Schema descriptor meta.xml • Property mapping tocolumn or global valu • GNA exchange format • Standard extensions • Taxonomic core conventions • Controlled vocabularies
Best Practices • Include dataset metadata file or URL • inside <archive metadata=“...”> • GBIF recognises eml file • For simplicity a Dublin Core xml file does it • Data file format • UTF8 • tab or csv files • header row • NULL as empty stringnot “\N” or “NULL”
Dwc:Taxon – Identifier • Relational data, Record ID the primary key that other id terms relate to • = TaxonID for checklist archives • = OccurrenceID for occurrence archives • TaxonConceptID • Asserting that taxa have a shared concept • ScientificNameID • Link out to some optional name identifier, GUID really • Identifier are plain strings, can be any format • Literal terms, e.g. parentNameUsage • All Dwc ID terms have such a literal friend • Redundant if id terms are used • to be avoided for relations, e.g. homonyms
Dwc:Taxon - Classification • Classification only for accepted taxa, not synonyms • parentNameUsageID • Allows for arbitrary ranks and levels • Beware infinite loops • Root with parentID=NULL or parentID=recordID • Denormalised (prefer the use of parentNameUsageID) • Kingdom,Phylum,Class,Order,Family,Genus,Subgenus • No explicit records required for higher taxa • TaxonRank • String, but recommended vocabularyhttp://rs.gbif.org/vocabulary/gbif/rank.xml • Examples http://code.google.com/p/gbif-ecat/wiki/publishingClassifications
Dwc:Taxon - Synonyms • Synonym are records in core file • But classification should be ignored • acceptedNameUsageID • Synonyms point to the accepted/valid name usage • Accepted names have NULL or point to themselves • pro parte synonyms concatenate with | symbol all accepted IDs • taxonomicStatus • Accepted, (hetero-/homotypic) synonym, misapplied • See http://rs.gbif.org/vocabulary/gbif/taxonomic_status.xml • nameAccordingTo • sec. / sensu part of taxon concepts
Dwc:Taxon – Nomenclature • scientificName • full name with authorship • genus, subgenus, specificEpithet, verbatimTaxonRank, infraspecificEpithet, scientificNameAuthorship • namePublishedIn • nomenclaturalStatus • nomenclaturalCode • http://rs.gbif.org/vocabulary/gbif/nomenclatural_code.xml • originalNameUsageID • Basionym, Pointer to usage that first established the name
Dwc Extensions - Basics • One to many relation, schema descriptor meta.xml • id column required to join extensions • rowType specifies the class of records / extension • Property mapping to column or global value • List of allowed properties with • Definition, examples, further link • Mandate Vocabulary • Basic data types: string, integer, decimal, boolean, date, dateTime • Centrally hosted at http://rs.gbif.org • Staging environment • Production is manually moderated, but open to community
Dwc:Taxon Extensions • Frozen soon for GNA “Simple Exchange Format”http://rs.gbif.org/extension/gbif/1.0/ • Vernaculars • Distribution • Bibliography • Alternative ids & links. Webpage, LSID, DOI, JSON, etc • Candidates for further extensions • species info • images • nomenclatural acts & name relations • concept relations • type specimen
Darwin Core Tools Publishing support
DwC-A Reader Java library • Provides iterators across star schema • Dwc terms and GNA extension terms as enumerations
Validator Status: Under Evaluation http://tools.gbif.org/dwca-validator/
Integrated Publishing Toolkit • Compose EML Metadata • Connect to database • Upload Data • Transform to DWCA • Publish via GBIF http://ipt.gbif.org Status: Stable release – end 2010
Guidelines and Best Practices • DB Admin skills • Database export • No tools required • Successful pilots • Ireland • NBN UK • Norway • Avian Knowledge network • IPNI • IRMNG Status: Drafts for November campaign (see roadmap)
Authoring Descriptor XML Status: Ready for Review Metafile http://tools.gbif.org/dwca-assistant/
Excel Spreadsheet Templates Status: Ready for Review/Testing
Spreadsheet Processor Status: Ready for Review http://tools.gbif.org/spreadsheet-processor/
Checklist Bank Indexing checklists
GBIF Checklist Bank • Rich index to checklists and their content • All of Dwc Taxon and GNA Simple Format extensions:Vernacular names, Identifier & Links, Distribution, References • ~35 million name usages, 90 datasets + 8500 derived from occurrence index • Checklists • DwC-A created by • Publisher • Adapters (CoL, ITIS, NCBI, USDA, GRIN, TreeOfLife) • manual Transformation, static • No versioning • 4 main types: taxonomic, nomenclatural, occurrences, thematic
Name Usages • Checklists are made up of name usagesa plain name string with optionally: • Classification • Taxonomic status, e.g. synonym, misapllied name • Original name, i.e. basionym • According to, i.e. taxon concept • Nomenclatural status • Original publication
Lexical Grouping • Name strings are parsed and grouped • Correct & incorrect spellings • Homonyms in several groups • Semiautomatic processlargely based on canonical,year and higher classification • Allows for • Fuzzy matching • Checklist crosswalk • Rubussilvaticus • Rubussylvaticus • Rubussilvaticum • RubussilvaticusWeihe & Nees • Vertebrata [animal subphylum] • Vertebrate • Vertebrata Cuvier, 1812 • Vertebrata [algae genus] • Vertebrata Gray • Vertebrata S.F. Gray, 1821 • Gerardiapaupercula var. borealis (Pennell) Deam • Gerardiapaupercula (Gray) Britt. var. borealis (Pennell) Deam • Gerardiapaupercula (A.Gray) Britton var. borealis (Pennell) Deam • Gerardiapaupercula borealis • Gerardiapaupercula borealis (Pennell) Deam
Nomenclatural Grouping • Grouping homotypic names • Original name relation • Homotypic synonyms • Not yet available
Checklist Bank Portal • Preliminary until new GBIF portal complete • Browse & Search • Statistics • Links to source pages • Flickr Images
Checklist Bank Webservices • Common API to all resources • RESTful JSON services • search names, usages, checklists • navigate classification • http://ecat-dev.gbif.org/api/clb
Importing Darwin Core • Highly relational data • Challenges faced • Syntactically damaged sources • Wrong mappings, charsets, non escaped line breaks or field delimiters • Data Quality • Broken referential integrity • Non names, e.g. “Unallocated Family” • No standard vocabularies for ranks, status, etc • Name strings have several publishing options • ScientificName, Authorship, Genus + epithets + rank • Classification has several publishing options • Normalised (parentUsage / parentUsageID) or flat via Linnean Ranks
GBIF Nub • Synthetic “union taxonomy”, checklist #1 • Lexical group = nub name usage • Classification based on prioritized checklists • Align to 8 CoL kingdoms • Fixed accepted ranks: • Linnean + subfamily, subgenus, section, subspecies, variety, form • Other ranks become “Intermediate rank” synonyms • Homotypic synonyms only • Work in progress!
Personal Name Lists • User accounts with personal name lists • Name string + kingdom/nom code • Add classifications, status, distribution, vernaculars, etc from one or more indexed checklists • Also on the fly via webservices • but only for already indexed name strings • In development …