Detecting spelling errors in taxonomic databases

Detecting spelling errors in taxonomic databases Eduardo Dalcin Instituto Nacional de Pesquisas da Amazônia eduardo@dalcin.org Richard White, Alex Gray Computer Science, Cardiff University Eduardo Dalcin - TDWG - St Petersburg

Outline • Experiences with techniques for detecting and correcting spelling errors in scientific names • Dictionaries of complete scientific names are often unavailable • Algorithms for detecting errors without using vocabularies • Results of some experiments • database error rates • performance of the algorithms • cost effectiveness Eduardo Dalcin - TDWG - St Petersburg

Introduction • Many problems can affect the quality of data in taxonomic databases, and some of the ways to resolve these issues are discussed in other talks at this meeting. • Species names play a key role in biodiversity information systems • They (or meaningless unique codes corresponding to them) are the most common values linking related information in different tables and databases • Because of their unfamiliarity to many of the contributors and editors of databases, they are particularly prone to errors. • The causes and frequencies of different types of errors in scientific names are usually unknown. Eduardo Dalcin - TDWG - St Petersburg

A problem in Biodiversity Informatics Species names are the key to biodiversity information – inaccurate names cause data loss: • In single databases where species records may not be found if the species name is wrong • In large biodiversity information systems assembled from several sources, where the names are (or have been) used to link the data sources together Eduardo Dalcin - TDWG - St Petersburg

A problem in Biodiversity Informatics • Thus the accuracy of species names may affect the reliability and usability of species information systems • Techniques to handle the problem semi-automatically can be developed • Litchi (later – this afternoon) • Error detection and correction Eduardo Dalcin - TDWG - St Petersburg

Previous research in error detection • Automatic spelling error detection and correction have been researched since the early 1960’s in relation to three main problems: • spelling checkers in word processor systems • retrieval of the correct records in databases • word recognition in OCR and “pen-based interface” devices Eduardo Dalcin - TDWG - St Petersburg

Types of error detection and correction • Use of vocabularies of scientific name components • akin to those used by conventional spelling checkers. However, suitable dictionaries of complete scientific names are frequently unavailable. • Without the use of vocabularies Eduardo Dalcin - TDWG - St Petersburg

Availability of dictionaries • Spelling error detection in taxonomic databases is straightforward when it involves scientific names of higher taxa such as family names, where reference dictionaries could be used • In contrast, species names, as a binomial combination of genus name and specific epithet, need a different approach because, in most practical situations, a reliable independent dictionary of species names in the same taxonomic domain is not available. Eduardo Dalcin - TDWG - St Petersburg

Objective • Detecting scientific name spelling errors in taxonomic databases aims to minimise “one of the more tedious tasks of taxonomists: to check and update long lists of species names” (Froese, 1997) Eduardo Dalcin - TDWG - St Petersburg

Experiments in error detection • The objective is to detect potential spelling errors in scientific names, using similarity algorithms in order to identify pairs of scientific names which have a high degree of similarity to each other but are not exactly the same. • The results of some experiments will be discussed and summarised, particularly in terms of database error rates and cost effectiveness Eduardo Dalcin - TDWG - St Petersburg

Databases used in tests • ILDIS: International Legume Database and Information Service (plant family Fabaceae) • MARINE: Marine Invertebrates of New Zealand (National Institute of Water & Atmosphere, Wellington ) • SP2K: Species 2000 (Annual Checklist 2002) • PMA: Atlantic Rain Forest Plant Database (Rio de Janeiro Botanic Garden ) • CNIP: North-east Brazil Plant Checklist Database (Federal University of Pernambuco) • A “control” database into which known errors had ben introduced Eduardo Dalcin - TDWG - St Petersburg

Preliminary processing • In an initial step, the presence of invalid characters was detected in all databases • These characters were removed in order to avoid compromising the execution of the spelling error algorithms Eduardo Dalcin - TDWG - St Petersburg

Sample character errors Eduardo Dalcin - TDWG - St Petersburg

Algorithms 1: phonetic similarity • Phonetic matching is used to identify strings that may be of similar pronunciation, regardless of their actual spelling • Phonetic similarity algorithms encode strings of letters (names) according to their spoken sounds. Names that share the same phonetic codes are assumed to be similar and may include erroneous names. Eduardo Dalcin - TDWG - St Petersburg

Phonetic similarity algorithms • The Soundex algorithm (originally developed to search for names in the 1890 USA census files) • For example, reynold and renauld are both reduced to r543 • The Phonix algorithm (developed to be used in a library package for retriieval from bibliographical databases) • The Phonix algorithm is a Soundex variant. Letters are mapped to codes using a similar algorithm with a slightly different set of codes, but before this mapping about 160 letter-group transformations are used to standardise the string and resulting codes Eduardo Dalcin - TDWG - St Petersburg

Algorithms 2: string similarity String similarity algorithms compute a numerical estimate of the similarity between two string • N-grams (Bi-grams and tri-grams) • Levenshtein distance • Skeleton-Key Eduardo Dalcin - TDWG - St Petersburg

N-gram algorithm • N-gramsare n-letter subsequences of words or strings, where n is usually one, two or three letters • If n is greater than 1, n-grams may overlap, in other words a letter may appear in more than one n-gram. • The n-gramalgorithm counts the number of n-grams which the two strings have in common, and assigns a similarity coefficient. • Bi-grams and tri-grams were investigated Eduardo Dalcin - TDWG - St Petersburg

Levenshtein algorithm • The Levenshtein distance is the number of single-character deletions, insertions, or substitutions required to transform one string into the other. Eduardo Dalcin - TDWG - St Petersburg

Skeleton-Key algorithm • Assumes that consonants carry more information than vowels, but is not phonetic • The Skeleton-key of a word consists of its first letter, followed by the remaining consonants and finally the remaining vowels, both in order of appearance. Duplicated letters are eliminated Eduardo Dalcin - TDWG - St Petersburg

Test environment • A series of software tools were developed in Object Pascal for the Delphi compiler • The first set of tools were used to test the algoorithm implementation Eduardo Dalcin - TDWG - St Petersburg

Error classification tool • Single errors (Hit) • extra or one missing letter • one wrong letter • transposition of two adjacent letters • Multiple errors (Hit) • more than one “single error” (Vicia faba : Viica fabba) • Unrelated (false alarm) • different species epithet • different genus • Doubtful Eduardo Dalcin - TDWG - St Petersburg

Eduardo Dalcin - TDWG - St Petersburg

Number of comparisons made Eduardo Dalcin - TDWG - St Petersburg

Error detection and correction algorithms • Algorithms based on character encoding, such as Soundex, Phonix, and Skeleton Key, are faster because the codes can be generated in advance for each name • Bigram, Trigram and Levenshtein Distance algorithms depend on comparison of the original strings, and thus showed the worst performance Eduardo Dalcin - TDWG - St Petersburg

Large number of “false alarms” Eduardo Dalcin - TDWG - St Petersburg

“False alarms” in ILDIS • The large number of false alarms in the ILDIS database can be explained by the large number of scientific names which belong to each genus, because it is a database with a restricted taxonomic domain (one family), increasing the average similarity of names Eduardo Dalcin - TDWG - St Petersburg

The Marine Invertebrates of New Zealand Database showed a particularly large percentage of wrong names, explained by the process of construction of the database, from free text with no data quality control at data entry Percentage of errors by database Eduardo Dalcin - TDWG - St Petersburg

Percentage of errors • Despite the building techniques and database administration efforts and tools, all five databases tested show spelling errors • However, databases built with no data entry control, such as the Marine Invertebrates database, show the highest number of spelling errors Eduardo Dalcin - TDWG - St Petersburg

Types of error Eduardo Dalcin - TDWG - St Petersburg

Genuine errors v. false alarms Eduardo Dalcin - TDWG - St Petersburg

No need for a dictionary • The approach used a variety of detection techniques, using the database or names list itself as a data dictionary; and correction techniques, using similarity algorithms to detect scientific names with a degree of similarity which suggest a spelling error • Lack of dependence on a separate dictionary makes for an independent process, which can be executed by stand-alone tools without dependence on external dictionaries, which may not exist in many groups Eduardo Dalcin - TDWG - St Petersburg

Referring hits to a dictionary • However, a test program which used the IPNI database as a reference was able to classify 92% of “hits” automatically as false alarms, thus greatly reducing the human input required Eduardo Dalcin - TDWG - St Petersburg

But a list of genera is useful • Large numbers of “false alarms” associated with algorithms that find more genuine errors: • Need for auxiliary tools to minimize the requirement for human interaction in order to identify the genuine errors • 14% of false alarms were generated by differences in the generic name, so the use of a reliable genus dictionary could reduce the number of “false alarms”, and enhance the practical efficiency of these tools. Eduardo Dalcin - TDWG - St Petersburg

Eduardo Dalcin - TDWG - St Petersburg

Error limitation policies • Low error rate in the genus name for the first three databases (CNIP, PMA and ILDIS) because those database were populated using a specialised taxonomic database management system (Alice) which offers a genus dictionary from which editors may choose to select existing genus names • The low rate for genus errors in the Species 2000 database reflects the adoption of data entry control polices and data control during the compilation process Eduardo Dalcin - TDWG - St Petersburg

Future work • Improved algorithms for detecting close matches, including Soundex and Phonix modified to be more appropriate for scientific names based on Latin and Greek • Incorporation of dictionaries, where available • Extension to authority strings in scientific names • Extension to other data types, such as the names of geographical areas, etc. • Integration with other tools such as Litchi (later talk – this afternoon) to help bioinformaticians deal with nomenclature Eduardo Dalcin - TDWG - St Petersburg

References • Dalcin E.C. (2004) PhD thesis (Southampton University, unpublished) • MSS in preparation • Reports by Arthur Chapman on GBIF web site Acknowledgements • Plantas do Nordeste (Brazil – Kew collaboration with funding from Brazilian and UK governments) • Arthur Chapman, Andrew Jones Eduardo Dalcin - TDWG - St Petersburg

Detecting spelling errors in taxonomic databases

Detecting spelling errors in taxonomic databases

Presentation Transcript

Taxonomic Grouping

taxonomic classification

Taxonomic Classifications

Common Spelling Errors

Taxonomic ontologies: Bridging phylogenetic and taxonomic history

Taxonomic Groups

Ontology Databases: Detecting Inconsistencies in the Gene Ontology using Not-gadgets

Biodiversity Information Standards (TDWG: Taxonomic Databases Working Group)

The Comedy of Errors: Assessing Citation Help in Databases

Methods for detecting errors in VAT Turnover data

Taxonomic Analysis

Taxonomic databases: The SEEK and VegBank experience

Taxonomic

Automatic Technology for Detecting Fatal SW Errors Before Testing

Detecting Multi-cycle Errors using Invariance Information

Taxonomic Classification:

Spelling in KS2

Detection of Spelling Errors in Swedish Clinical Text

Detecting Memory Access Errors via Illegal Write Monitoring

Taxonomic Systems

Probabilistic Detection of Context-Sensitive Spelling Errors

Common Spelling Errors From Online Essay Homework Helpers