380 likes | 495 Views
Detecting spelling errors in taxonomic databases. Eduardo Dalcin Instituto Nacional de Pesquisas da Amazônia eduardo@dalcin.org Richard White , Alex Gray Computer Science, Cardiff University. Outline.
E N D
Detecting spelling errors in taxonomic databases Eduardo Dalcin Instituto Nacional de Pesquisas da Amazônia eduardo@dalcin.org Richard White, Alex Gray Computer Science, Cardiff University Eduardo Dalcin - TDWG - St Petersburg
Outline • Experiences with techniques for detecting and correcting spelling errors in scientific names • Dictionaries of complete scientific names are often unavailable • Algorithms for detecting errors without using vocabularies • Results of some experiments • database error rates • performance of the algorithms • cost effectiveness Eduardo Dalcin - TDWG - St Petersburg
Introduction • Many problems can affect the quality of data in taxonomic databases, and some of the ways to resolve these issues are discussed in other talks at this meeting. • Species names play a key role in biodiversity information systems • They (or meaningless unique codes corresponding to them) are the most common values linking related information in different tables and databases • Because of their unfamiliarity to many of the contributors and editors of databases, they are particularly prone to errors. • The causes and frequencies of different types of errors in scientific names are usually unknown. Eduardo Dalcin - TDWG - St Petersburg
A problem in Biodiversity Informatics Species names are the key to biodiversity information – inaccurate names cause data loss: • In single databases where species records may not be found if the species name is wrong • In large biodiversity information systems assembled from several sources, where the names are (or have been) used to link the data sources together Eduardo Dalcin - TDWG - St Petersburg
A problem in Biodiversity Informatics • Thus the accuracy of species names may affect the reliability and usability of species information systems • Techniques to handle the problem semi-automatically can be developed • Litchi (later – this afternoon) • Error detection and correction Eduardo Dalcin - TDWG - St Petersburg
Previous research in error detection • Automatic spelling error detection and correction have been researched since the early 1960’s in relation to three main problems: • spelling checkers in word processor systems • retrieval of the correct records in databases • word recognition in OCR and “pen-based interface” devices Eduardo Dalcin - TDWG - St Petersburg
Types of error detection and correction • Use of vocabularies of scientific name components • akin to those used by conventional spelling checkers. However, suitable dictionaries of complete scientific names are frequently unavailable. • Without the use of vocabularies Eduardo Dalcin - TDWG - St Petersburg
Availability of dictionaries • Spelling error detection in taxonomic databases is straightforward when it involves scientific names of higher taxa such as family names, where reference dictionaries could be used • In contrast, species names, as a binomial combination of genus name and specific epithet, need a different approach because, in most practical situations, a reliable independent dictionary of species names in the same taxonomic domain is not available. Eduardo Dalcin - TDWG - St Petersburg
Objective • Detecting scientific name spelling errors in taxonomic databases aims to minimise “one of the more tedious tasks of taxonomists: to check and update long lists of species names” (Froese, 1997) Eduardo Dalcin - TDWG - St Petersburg
Experiments in error detection • The objective is to detect potential spelling errors in scientific names, using similarity algorithms in order to identify pairs of scientific names which have a high degree of similarity to each other but are not exactly the same. • The results of some experiments will be discussed and summarised, particularly in terms of database error rates and cost effectiveness Eduardo Dalcin - TDWG - St Petersburg
Databases used in tests • ILDIS: International Legume Database and Information Service (plant family Fabaceae) • MARINE: Marine Invertebrates of New Zealand (National Institute of Water & Atmosphere, Wellington ) • SP2K: Species 2000 (Annual Checklist 2002) • PMA: Atlantic Rain Forest Plant Database (Rio de Janeiro Botanic Garden ) • CNIP: North-east Brazil Plant Checklist Database (Federal University of Pernambuco) • A “control” database into which known errors had ben introduced Eduardo Dalcin - TDWG - St Petersburg
Preliminary processing • In an initial step, the presence of invalid characters was detected in all databases • These characters were removed in order to avoid compromising the execution of the spelling error algorithms Eduardo Dalcin - TDWG - St Petersburg
Sample character errors Eduardo Dalcin - TDWG - St Petersburg
Algorithms 1: phonetic similarity • Phonetic matching is used to identify strings that may be of similar pronunciation, regardless of their actual spelling • Phonetic similarity algorithms encode strings of letters (names) according to their spoken sounds. Names that share the same phonetic codes are assumed to be similar and may include erroneous names. Eduardo Dalcin - TDWG - St Petersburg
Phonetic similarity algorithms • The Soundex algorithm (originally developed to search for names in the 1890 USA census files) • For example, reynold and renauld are both reduced to r543 • The Phonix algorithm (developed to be used in a library package for retriieval from bibliographical databases) • The Phonix algorithm is a Soundex variant. Letters are mapped to codes using a similar algorithm with a slightly different set of codes, but before this mapping about 160 letter-group transformations are used to standardise the string and resulting codes Eduardo Dalcin - TDWG - St Petersburg
Algorithms 2: string similarity String similarity algorithms compute a numerical estimate of the similarity between two string • N-grams (Bi-grams and tri-grams) • Levenshtein distance • Skeleton-Key Eduardo Dalcin - TDWG - St Petersburg
N-gram algorithm • N-gramsare n-letter subsequences of words or strings, where n is usually one, two or three letters • If n is greater than 1, n-grams may overlap, in other words a letter may appear in more than one n-gram. • The n-gramalgorithm counts the number of n-grams which the two strings have in common, and assigns a similarity coefficient. • Bi-grams and tri-grams were investigated Eduardo Dalcin - TDWG - St Petersburg
Levenshtein algorithm • The Levenshtein distance is the number of single-character deletions, insertions, or substitutions required to transform one string into the other. Eduardo Dalcin - TDWG - St Petersburg
Skeleton-Key algorithm • Assumes that consonants carry more information than vowels, but is not phonetic • The Skeleton-key of a word consists of its first letter, followed by the remaining consonants and finally the remaining vowels, both in order of appearance. Duplicated letters are eliminated Eduardo Dalcin - TDWG - St Petersburg
Test environment • A series of software tools were developed in Object Pascal for the Delphi compiler • The first set of tools were used to test the algoorithm implementation Eduardo Dalcin - TDWG - St Petersburg
Error classification tool • Single errors (Hit) • extra or one missing letter • one wrong letter • transposition of two adjacent letters • Multiple errors (Hit) • more than one “single error” (Vicia faba : Viica fabba) • Unrelated (false alarm) • different species epithet • different genus • Doubtful Eduardo Dalcin - TDWG - St Petersburg
Number of comparisons made Eduardo Dalcin - TDWG - St Petersburg
Error detection and correction algorithms • Algorithms based on character encoding, such as Soundex, Phonix, and Skeleton Key, are faster because the codes can be generated in advance for each name • Bigram, Trigram and Levenshtein Distance algorithms depend on comparison of the original strings, and thus showed the worst performance Eduardo Dalcin - TDWG - St Petersburg
Large number of “false alarms” Eduardo Dalcin - TDWG - St Petersburg
“False alarms” in ILDIS • The large number of false alarms in the ILDIS database can be explained by the large number of scientific names which belong to each genus, because it is a database with a restricted taxonomic domain (one family), increasing the average similarity of names Eduardo Dalcin - TDWG - St Petersburg
The Marine Invertebrates of New Zealand Database showed a particularly large percentage of wrong names, explained by the process of construction of the database, from free text with no data quality control at data entry Percentage of errors by database Eduardo Dalcin - TDWG - St Petersburg
Percentage of errors • Despite the building techniques and database administration efforts and tools, all five databases tested show spelling errors • However, databases built with no data entry control, such as the Marine Invertebrates database, show the highest number of spelling errors Eduardo Dalcin - TDWG - St Petersburg
Types of error Eduardo Dalcin - TDWG - St Petersburg
Genuine errors v. false alarms Eduardo Dalcin - TDWG - St Petersburg
No need for a dictionary • The approach used a variety of detection techniques, using the database or names list itself as a data dictionary; and correction techniques, using similarity algorithms to detect scientific names with a degree of similarity which suggest a spelling error • Lack of dependence on a separate dictionary makes for an independent process, which can be executed by stand-alone tools without dependence on external dictionaries, which may not exist in many groups Eduardo Dalcin - TDWG - St Petersburg
Referring hits to a dictionary • However, a test program which used the IPNI database as a reference was able to classify 92% of “hits” automatically as false alarms, thus greatly reducing the human input required Eduardo Dalcin - TDWG - St Petersburg
But a list of genera is useful • Large numbers of “false alarms” associated with algorithms that find more genuine errors: • Need for auxiliary tools to minimize the requirement for human interaction in order to identify the genuine errors • 14% of false alarms were generated by differences in the generic name, so the use of a reliable genus dictionary could reduce the number of “false alarms”, and enhance the practical efficiency of these tools. Eduardo Dalcin - TDWG - St Petersburg
Error limitation policies • Low error rate in the genus name for the first three databases (CNIP, PMA and ILDIS) because those database were populated using a specialised taxonomic database management system (Alice) which offers a genus dictionary from which editors may choose to select existing genus names • The low rate for genus errors in the Species 2000 database reflects the adoption of data entry control polices and data control during the compilation process Eduardo Dalcin - TDWG - St Petersburg
Future work • Improved algorithms for detecting close matches, including Soundex and Phonix modified to be more appropriate for scientific names based on Latin and Greek • Incorporation of dictionaries, where available • Extension to authority strings in scientific names • Extension to other data types, such as the names of geographical areas, etc. • Integration with other tools such as Litchi (later talk – this afternoon) to help bioinformaticians deal with nomenclature Eduardo Dalcin - TDWG - St Petersburg
References • Dalcin E.C. (2004) PhD thesis (Southampton University, unpublished) • MSS in preparation • Reports by Arthur Chapman on GBIF web site Acknowledgements • Plantas do Nordeste (Brazil – Kew collaboration with funding from Brazilian and UK governments) • Arthur Chapman, Andrew Jones Eduardo Dalcin - TDWG - St Petersburg