180 likes | 298 Views
Moving Targets: Integrating semistructured data. Pepé Ciardelli & Marc Geoffroy Botanic Garden and Botanical Museum Berlin-Dahlem Dept. of Biodiversity Informatics TDWG 2000, Bratislava. The project.
E N D
Moving Targets: Integrating semistructured data Pepé Ciardelli & Marc Geoffroy Botanic Garden and Botanical Museum Berlin-Dahlem Dept. of Biodiversity Informatics TDWG 2000, Bratislava
The project • Euro+Med Plantbase - on-line database for the vascular plants of Europe and the Mediterranean region; http://ww2.bgbm.org/EuroPlusMed • 2722 pages in Microsoft Word – human generated • 11,755 accepted taxa • 18,119 synonyms • Distribution tables
The actors • Senior taxonomist – his baby; knows what a <TAB> is • Junior taxonomist – technically sophisticated; proofs the data post-import • Programmer – taxonomically sophisticated • Programmer – taxonomically unsophisticated; naively believes every case can be caught with code
“Moving targets” • Problems of notation • Files 1-10: 1865 (Mar.-Jun.) • File 11: 1902 [Oct.] • Taxonomic rules can vary • One fine morning, species can be included in species • Collective species notation significantly different • Complex groups like Hieracium require extensive notation • Communication generally good with taxonomic “moving targets”, notation changes not so much
The moving targets challenge • Import 18 files, generated by hand, over time • Build error-tolerant software that constantly evolves and improves • Know when to say “enough!” – what is the best use of limited human resources?
Added wrinkle • Taxonomist does not review his own work to confirm it’s been parsed correctly • Absolutely essential: junior taxonomist intimately familiar not only with content, but with senior taxonomist • Anticipate problems based on experience
The re-import • For final revisions, database exported back into original Word format • Senior taxonomist’s fine eye for detail confirmed that initial imports were successful • Re-import presents opportunity to use most efficient workflow based on experience
Hard-won lessons • Put data into XML format to catch “fatal errors” – i.e. typos that deviate from rudimentary markup • Identify records likely to cause errors, capture for manual check post-import • Run additional parsing software after the initial import • Key realization: really not so many exceptions after all
The Taxonomic Web Editor • Built to edit checklists stored in a Berlin Model DB • What’s missing: knowing where to look for errors • Based on experience, programmers provide taxonomists with (blessedly short) lists of suspect taxa
“In”-reference parser • After import, all “in”-refs marked as “preliminary” • Parse what fits regular expression patterns • Parse the rest by hand • Results: • 14,303 “in”-references • Only 26 unresolved, all of which were typos, not unmatched patterns
“In”-reference parser drawback • Reg exps not everyone’s cup of tea
Specific solution to a specific problem • Complex, somewhat quirky system of notation developed over decades • Close relationship between taxonomists and programmers • Limited human resources • Re-usability not a goal of the project • The right mix of automation and data massaging
Acknowledgments • Mattfeld-Quadbeck Foundation • Association of Friends of the BGBM • Global Biodiversity Information Facility (GBIF)