310 likes | 354 Views
Darwin Core Archives. Checklist Archives Checklist Extensions Archive Tools Checklist Bank. Markus Döring & David Remsen, GBIF 2010. Checklist Scope. Darwin Core. Ratified in 2009 Significant additions/refinements Ongoing process Set of terms http://rs.tdwg.org/dwc/terms/index.htm
E N D
Darwin Core Archives Checklist Archives Checklist Extensions Archive Tools Checklist Bank Markus Döring & David Remsen, GBIF 2010
Darwin Core • Ratified in 2009 • Significant additions/refinements • Ongoing process • Set of terms • http://rs.tdwg.org/dwc/terms/index.htm • Not tied to technology • Use Text Guidelines for DwC-A • http://rs.tdwg.org/dwc/terms/guides/text/index.htm
Darwin Core Archives for interoperability • Simplicity • Complete datasets, compressed • Allow for rich dataset metadata • Single CSV /w header minimal requirement • Flexible • 1:many extensions • Schema descriptor meta.xml • Property mapping tocolumn or global valu • GNA exchange format • Standard extensions • Taxonomic core conventions • Controlled vocabularies
Best Practices • Include dataset metadata file or URL • inside <archive metadata=“...”> • GBIF recognises eml file • For simplicity a Dublin Core xml file does it • Data file format • UTF8 • tab or csv files • header row • NULL as empty stringnot “\N” or “NULL”
Dwc:Taxon – Identifier • Relational data, Record ID the primary key that other id terms relate to • = TaxonID for checklist archives • = OccurrenceID for occurrence archives • TaxonConceptID • Asserting that taxa have a shared concept • ScientificNameID • Link out to some optional name identifier, GUID really • Identifier are plain strings, can be any format • Literal terms, e.g. parentNameUsage • All Dwc ID terms have such a literal friend • Redundant if id terms are used • to be avoided for relations, e.g. homonyms
Dwc:Taxon - Classification • Classification only for accepted taxa, not synonyms • parentNameUsageID • Allows for arbitrary ranks and levels • Beware infinite loops • Root with parentID=NULL or parentID=recordID • Denormalised (prefer the use of parentNameUsageID) • Kingdom,Phylum,Class,Order,Family,Genus,Subgenus • No explicit records required for higher taxa • TaxonRank • String, but recommended vocabularyhttp://rs.gbif.org/vocabulary/gbif/rank.xml • Examples http://code.google.com/p/gbif-ecat/wiki/publishingClassifications
Dwc:Taxon - Synonyms • Synonym are records in core file • But classification should be ignored • acceptedNameUsageID • Synonyms point to the accepted/valid name usage • Accepted names have NULL or point to themselves • pro parte synonyms concatenate with | symbol all accepted IDs • taxonomicStatus • Accepted, (hetero-/homotypic) synonym, misapplied • See http://rs.gbif.org/vocabulary/gbif/taxonomic_status.xml • nameAccordingTo • sec. / sensu part of taxon concepts
Dwc:Taxon – Nomenclature • scientificName • full name with authorship • genus, subgenus, specificEpithet, verbatimTaxonRank, infraspecificEpithet, scientificNameAuthorship • namePublishedIn • nomenclaturalStatus • nomenclaturalCode • http://rs.gbif.org/vocabulary/gbif/nomenclatural_code.xml • originalNameUsageID • Basionym, Pointer to usage that first established the name
Dwc Extensions - Basics • One to many relation, schema descriptor meta.xml • id column required to join extensions • rowType specifies the class of records / extension • Property mapping to column or global value • List of allowed properties with • Definition, examples, further link • Mandate Vocabulary • Basic data types: string, integer, decimal, boolean, date, dateTime • Centrally hosted at http://rs.gbif.org • Staging environment • Production is manually moderated, but open to community
Dwc:Taxon Extensions • Frozen soon for GNA “Simple Exchange Format”http://rs.gbif.org/extension/gbif/1.0/ • Vernaculars • Distribution • Bibliography • Alternative ids & links. Webpage, LSID, DOI, JSON, etc • Candidates for further extensions • species info • images • nomenclatural acts & name relations • concept relations • type specimen
Darwin Core Tools Publishing support
DwC-A Reader Java library • Provides iterators across star schema • Dwc terms and GNA extension terms as enumerations
Validator Status: Under Evaluation http://tools.gbif.org/dwca-validator/
Integrated Publishing Toolkit • Compose EML Metadata • Connect to database • Upload Data • Transform to DWCA • Publish via GBIF http://ipt.gbif.org Status: Stable release – end 2010
Guidelines and Best Practices • DB Admin skills • Database export • No tools required • Successful pilots • Ireland • NBN UK • Norway • Avian Knowledge network • IPNI • IRMNG Status: Drafts for November campaign (see roadmap)
Authoring Descriptor XML Status: Ready for Review Metafile http://tools.gbif.org/dwca-assistant/
Excel Spreadsheet Templates Status: Ready for Review/Testing
Spreadsheet Processor Status: Ready for Review http://tools.gbif.org/spreadsheet-processor/
Checklist Bank Indexing checklists
GBIF Checklist Bank • Rich index to checklists and their content • All of Dwc Taxon and GNA Simple Format extensions:Vernacular names, Identifier & Links, Distribution, References • ~35 million name usages, 90 datasets + 8500 derived from occurrence index • Checklists • DwC-A created by • Publisher • Adapters (CoL, ITIS, NCBI, USDA, GRIN, TreeOfLife) • manual Transformation, static • No versioning • 4 main types: taxonomic, nomenclatural, occurrences, thematic
Name Usages • Checklists are made up of name usagesa plain name string with optionally: • Classification • Taxonomic status, e.g. synonym, misapllied name • Original name, i.e. basionym • According to, i.e. taxon concept • Nomenclatural status • Original publication
Lexical Grouping • Name strings are parsed and grouped • Correct & incorrect spellings • Homonyms in several groups • Semiautomatic processlargely based on canonical,year and higher classification • Allows for • Fuzzy matching • Checklist crosswalk • Rubussilvaticus • Rubussylvaticus • Rubussilvaticum • RubussilvaticusWeihe & Nees • Vertebrata [animal subphylum] • Vertebrate • Vertebrata Cuvier, 1812 • Vertebrata [algae genus] • Vertebrata Gray • Vertebrata S.F. Gray, 1821 • Gerardiapaupercula var. borealis (Pennell) Deam • Gerardiapaupercula (Gray) Britt. var. borealis (Pennell) Deam • Gerardiapaupercula (A.Gray) Britton var. borealis (Pennell) Deam • Gerardiapaupercula borealis • Gerardiapaupercula borealis (Pennell) Deam
Nomenclatural Grouping • Grouping homotypic names • Original name relation • Homotypic synonyms • Not yet available
Checklist Bank Portal • Preliminary until new GBIF portal complete • Browse & Search • Statistics • Links to source pages • Flickr Images
Checklist Bank Webservices • Common API to all resources • RESTful JSON services • search names, usages, checklists • navigate classification • http://ecat-dev.gbif.org/api/clb
Importing Darwin Core • Highly relational data • Challenges faced • Syntactically damaged sources • Wrong mappings, charsets, non escaped line breaks or field delimiters • Data Quality • Broken referential integrity • Non names, e.g. “Unallocated Family” • No standard vocabularies for ranks, status, etc • Name strings have several publishing options • ScientificName, Authorship, Genus + epithets + rank • Classification has several publishing options • Normalised (parentUsage / parentUsageID) or flat via Linnean Ranks
GBIF Nub • Synthetic “union taxonomy”, checklist #1 • Lexical group = nub name usage • Classification based on prioritized checklists • Align to 8 CoL kingdoms • Fixed accepted ranks: • Linnean + subfamily, subgenus, section, subspecies, variety, form • Other ranks become “Intermediate rank” synonyms • Homotypic synonyms only • Work in progress!
Personal Name Lists • User accounts with personal name lists • Name string + kingdom/nom code • Add classifications, status, distribution, vernaculars, etc from one or more indexed checklists • Also on the fly via webservices • but only for already indexed name strings • In development …