370 likes | 548 Views
BiOnym A flexible workflow approach to taxon name matching. Edward Vanden Berghe (VUB), Nicolas Bailly (WorldFish), Caselyn Aldemita (FIN), Fabio Fiorellato (FAO), Gianpaolo Coro (CNR), Anton Ellenbroek (FAO), Pasquale Pagano (CNR). Improving the current matchers.
E N D
BiOnymA flexible workflow approach to taxon name matching Edward Vanden Berghe (VUB), Nicolas Bailly(WorldFish), Caselyn Aldemita (FIN), Fabio Fiorellato (FAO), Gianpaolo Coro (CNR), Anton Ellenbroek (FAO), Pasquale Pagano (CNR)
Improving the current matchers Propose several Taxonomic Authority Files as references to be matched with Make flexible and customizable the control of the matching workflow (e.g., selection of the sequence of the matching methods) Give full control for advanced users [but still a set of default/standard workflow(s) for basic users] TDWG Annual Conference 2013, Firenze, Italy 31st October 2013
BiOnym approach • There is no one size that fits all!! • Some applications are ‘fault intolerant’ • E.g. compilation of authority lists • Have to minimise ‘false positives’, at the expense of less automation • Others are less sensitive to mistakes • E.g. synonymy expansion in a biogeographic query, find distribution records of a single species under different names or spelling variations • Will require different choices TDWG Annual Conference 2013, Firenze, Italy 31st October 2013
A flexible workflow for taxon name matching: BiOnym Raw Input String. e.g. GadismoruaLineus 1759 Reference Source (FishBase) Reference Source (ASFIS) Parsing and Pre-processing Reference Source (any in DwC-A) Reference Source (WoRMS) Taxon Matcher 1 Taxon Matcher 2 • Matchers: • GSAy (new) • Lexical distances • Levenshtein • Soundex • Trigrams • Workflows • BiOnym (new): User control • Emulation of Taxamatch • YASMEEN (new) … Taxon Matcher n Post-processing Matching name qedGadusmorhua(Linnaeus, 1758) TDWG Annual Conference 2013, Firenze, Italy 31st October 2013
Developed in iMarine infrastructure VREs gCube Infrastructure iMarine (D4Science): e-infrastructure VREs: Virtual Research Environments exploiting data and tools in the infrastructure … TDWG Annual Conference 2013, Firenze, Italy 31st October 2013
… and outside: iMarine Data Bonanza • Private Cloud Standards Policies • Commercial Cloud Guidelines Procedures TDWG Annual Conference 2013, Firenze, Italy 31st October 2013
iMarine: Storage and Computing as Service TB Currently Used 330 CPU Cores Currently Allocated TDWG Annual Conference 2013, Firenze, Italy 31st October 2013
Statistical Manager: Resources and Sharing TDWG Annual Conference 2013, Firenze, Italy 31st October 2013
BiOnym: Outline • Components • Taxonomic Authority Files • Matchers • Pre- and post-processing: parsers, synonym ex-pansion , taxon resolution, performance statistics • Development frameworks • For Matchers • For Workflows (= sequence of Matchers) • Experiments • Results • Conclusions TDWG Annual Conference 2013, Firenze, Italy 31st October 2013
Available Taxonomic Authority Files CoL: Catalogue of Life NCBI: National Center for Biotechnology Information IRMNG: Interim Register of Marine and Non-marine Genera ITIS: Integrated Taxonomic Information System WoRMS: World Register of Marine Species ASFIS: List of Species for Fishery Statistics Purposes; for commercial aquatic species FishBase (+info from CofF: Catalog of Fishes): for finfishes TDWG Annual Conference 2013, Firenze, Italy 31st October 2013
Pre-Processing: name format standard Split names in atomic components (genus, species, authority, author, year) if necessary (DimaMozzherin’s parser) Align variations in complementary words: var./v., aff., conf./cf., comma in authority, etc. Customize character/string substitutions TDWG Annual Conference 2013, Firenze, Italy 31st October 2013
Matchers: principle • Input: • Standard formatted file of names Input • Customized parameters (e.g., thresholds for distances) • Character substitutions • E.g. dropping gender suffix • E.g. fuzzy matching of Tony Rees • A unique algorithm (e.g., one lexical distance): • Using the customized parameters • Output: A set of names with matching rate • One subset being considered as matched • One subset considered as non-matching • The output of a matcher can be used as the input of another one TDWG Annual Conference 2013, Firenze, Italy 31st October 2013
Matchers: the built-in matchers Lexical dist.: the minimum number of single-character edits (insertion, deletion, substitution) required to change one word into the other Soundex-Like dist.: an algorithm relying on an encoding of phonemes pronunciation in English. Our variant does not compress phonetic information Trigrams / N-grams dist.: a similarity measure between sequences of letter triplets (a trigram representation) extracted from the input strings One domain-knowledge based matcher (GSAy) … to be applied first in the context of Systematics TDWG Annual Conference 2013, Firenze, Italy 31st October 2013
Matchers: GSAy process (1) Complete match GSAy Parentheses issue GSAY Gender agreement issues GSrAy Gender agreement and parentheses GSrAY Year issues GSA Year and gender agreement issues GSrA TDWG Annual Conference 2013, Firenze, Italy 31st October 2013
Matchers: GSAy process (2) Author issues, misspelling or wrong GSY GSrY Author and year issues, Homonyms GS GSr Genus issues, other combinations SAy SrAY TDWG Annual Conference 2013, Firenze, Italy 31st October 2013
Matchers: GSAy process (3) Genus issues, other combinations SrAy SrAY Species misspellings … but also … … species described in same genus by same author in same paper GAy GAY Matched names other matchers Non-Matching names TDWG Annual Conference 2013, Firenze, Italy 31st October 2013
Matcher: GSAy examples TDWG Annual Conference 2013, Firenze, Italy 31st October 2013
Workflow development framework Builds flexible Workflows for Names Matching A Java framework based on the gCubesystem (http://www.gcube-system.org/) Allows to exploit Cloud Computing Facilities Presents Java interfaces to build Strings Pre-Processing, Parsing and Post-processing Allows to define character substitutions Allows to add new Matchers as plug-ins TDWG Annual Conference 2013, Firenze, Italy 31st October 2013
Workflow development framework • A series of operators acting as switches: • First apply ‘transformation’ (e.g. character substitution) • Then calculate distance between all possible pairs of names • Each switch decides, whether a pair of names should be considered as ‘matches’, and splits the input list in: • ‘matched’ names • ‘non-matching’ names. • Parameters in each switch are customizable TDWG Annual Conference 2013, Firenze, Italy 31st October 2013
Matcher framework: YASMEEN (FAO) • Yet Another Species Matching Execution ENvironment • Based on COMET – COnceptMatching Engine and Tools • YASMEEN: a set of data models, formats and tools to perform species matching identification • Multiple matchlets, each dealing with a specific attribute of the species data model (genus, species, author etc.) • New matchlets can be designed (just a few lines of code) and plugged in • Reference data in DwC-A format • Full support to distributed computation (split IN & REF data / join results) TDWG Annual Conference 2013, Firenze, Italy 31st October 2013
Workflow builder: YASMEEN (FAO) • Used as a matcher in BiOnym workflow • But can work as a standalone specific workflow • When used as standalone: • Lexical matchlets' scores can be computed with a combination of different strategies (Levenshtein distance, soundex similarity, N-grams similarity) • Overall matching score for an input / reference data pair is a weighted combination of the triggered matchlets' scores TDWG Annual Conference 2013, Firenze, Italy 31st October 2013
Workflow: BiOnym • Workflow consists of three parts • Preprocessing (including possibly parsing) • Chain of matchers; output from one is input in next • Postprocessing • E.g. present ‘ambiguous’ matches to end user • E.g. calculate performance statistics • Chain of matchers • Most restrictive first • Those based on domain knowledge first • Test names matched in one step are not passed on to next matcher TDWG Annual Conference 2013, Firenze, Italy 31st October 2013
FAO use case TDWG Annual Conference 2013, Firenze, Italy 31st October 2013
TDWG Annual Conference 2013, Firenze, Italy 31st October 2013
Workflow: Emulation of Tony Rees’ Taxamatch Normalization of species name into its root disregarding the gender issues in taxon name. Modified Damerau-Levenshtein Distance Algorithm (MDLD) - the number of times of replace, delete or insert character to make the two strings the same Phonetic algorithm (e.g. Soundex) Authority Matching - which detects the similarity in substring TDWG Annual Conference 2013, Firenze, Italy 31st October 2013
Post-processing • Modalities governing how the results of the matching process are used/presented to the end-user • Will depend on the needs of the end user • Examples: • Synonymy expansion of queries in a biogeographical system • Reconciliation of check lists from different sources, for same area and taxon • Presenting end-user with ambiguous matches TDWG Annual Conference 2013, Firenze, Italy 31st October 2013
Experiments: a R implementation • Experimental system implemented in R and PostgreSQL • R thin wrapper around PostgreSQL statements • SQL used for the heavy lifting • Make use of Trigram indexes, for example • Tool for communication and prototyping • Developing tools to analyse performance • Generate confusion matrix… • For identical test sets, different workflows • Quantitatively compare sets of options and/or matchers TDWG Annual Conference 2013, Firenze, Italy 31st October 2013
Effectiveness False hits True hits Non hits Example graph comparing performance of different settings (generated with R) TDWG Annual Conference 2013, Firenze, Italy 31st October 2013
Results of experiments (YASMEEN) Genus / species misspellings: IN: LacheepenseerPerthseecous→ REF: Acipenserpersicus (Borodin, 1897) Scientific namematchlet, using Levenshtein similarity → 61.5% No separation between genus / species, relevant misspelling: IN: acipnesreppeerseekoos→ REF: Acipenserpersicus (Borodin, 1897) Scientific namematchlet, using Levenshtein similarity →47.6% Inverted genus / species: IN: PlatorhynchusScaphirhincus→ REF: Scaphirhincusplatorhynchus (Rafinesque, 1820) Scientific namematchlet, using n-grams similarity → Score: 100.0% Relevant misspellings (resolved with support from authorities data): IN: CasphinhiPlatynchurs (Rafinesk, 1820) → REF: Scaphirhynchusplatorynchus (Rafinesque, 1820) Genusmatchlet (wgt: 75), Speciesmatchlet(wgt : 100), Author namematchlet(wgt : 50), Yearmatchlet (wgt : 25), using Levenshteinsimilarity(wgt : 100) and Soundex similarity (50) → 58,2% TDWG Annual Conference 2013, Firenze, Italy 31st October 2013
Experiments: a simple interface TDWG Annual Conference 2013, Firenze, Italy 31st October 2013
Efficiency • Application first version of BiOnym workflow (1000 species names) • First run: only one Worker node (~ 1 CPU) • Second run: Cloud Computing facilities assigned by iMarine e-Infrastructure (computation distributed over 19 Worker nodes) • Result: Time reduction 76.7% • This means that the workflow can be used also in interactive systems (no need for batch processing) TDWG Annual Conference 2013, Firenze, Italy 31st October 2013
Results of experiments: Matchers • Search Term: • RhincodontypuLinneaus, 1758 • Output/s: Using GSAy • Rhincodontypus Smith, 1828 -> Score is 73% • Using taxamatch: • Rhineodontypus Smith, 1828 • Rhiniodontypus (Smith, 1828) • RhinodontypicusMüller & Henle, 1839 • Rhinodontypicus Smith, 1845 TDWG Annual Conference 2013, Firenze, Italy 31st October 2013
Conclusions: Results • Workflows • Building of pre-set (default) • On the fly setting • Integrating taxonomic/nomenclature knowledge (GSAy) • Making the best from previous matchers (Taxamatch and subsequent various implementations) and other technologies (uBio/GNA/GBIF parser) • Effectiveness and Efficiency increased in iMarine e-infrastructure TDWG Annual Conference 2013, Firenze, Italy 31st October 2013
Conclusions • Plans: interface (Nov.), tests (Dec.), open (Jan.) • Other Taxonomic Authority File • FADA (BioFresh) / PESI / … • Name reconciliation • Beyond scientific names • Common names / Vessels / … • New matchers integration = as matching methods are developed TDWG Annual Conference 2013, Firenze, Italy 31st October 2013
Future added value? Storage of knowledge Make available the matches between raw and published names (and current valid names) Self-learning system Build a community of practice (CoP), not alone … GNA, BioVel Collaborative development TDWG Annual Conference 2013, Firenze, Italy 31st October 2013
Special Thanks Tony Rees (CSIRO) Dmitry (Dima) Mozzherin (GNA project) TDWG Annual Conference 2013, Firenze, Italy 31st October 2013
Authors Edward Vanden Berghe, VrijeUniversiteitBrussel(VUB), Brussels, Belgium Nicolas Bailly, WorldFish, and FishBase Information and Research Group (FIN), Los Baños, Philippines Caselyn Aldemita, FishBase Information and Research Group (FIN), Los Baños, Philippines Fabio Fiorellato & Anton Ellenbroek, Fisheries Statistics and Information (FIPS), FAO, Rome, Italy Gianpaolo Coro & Pasquale Pagano, Istituto di Scienza e Tecnologie dell'Informazione A. Faedo (ISTI), CNR, Pisa, Italy TDWG Annual Conference 2013, Firenze, Italy 31st October 2013