An Evaluation of Taxonomic Name Recognition (TNR) in the Biodiversity Heritage Library (BHL)

TNR Error TNR uBio Database Text Mining Internet Archive BHL Web OCR Error Authority File Error OCR BHL (Text Database) Image & Text Database Unstructured Data Structured Data Figure 1: BHL digitization process and location of errors An Evaluation of Taxonomic Name Recognition (TNR) in the Biodiversity Heritage Library (BHL) Qin Wei1, Chris Freeland2 and P. Bryan Heidorn Graduate School of Library and Information Science, University of Illinois at Urbana-Champaign, qinwei2@uiuc.edu Missouri Botanical Garden, chris.freeland@mobot.org http://www.biodiversitylibrary.org The BHL has incorporated TaxonFinder, a taxonomic name finding algorithm and service provided by uBio.org, into its portal for the identification and verification of taxonomic name strings found within the digitized BHL corpus. An eight-week evaluation was performed to determine the factors affecting the accuracy of the results returned. We explored and analyzed the factors influencing the performance of: 1).Optical Character Recognition (OCR) for transforming images into text, 2).TNR matching algorithms for identifying taxonomic names from texts, 3). thecompleteness of NameBank, which is used as an authority file for name verification. Table 2： Performances *TaxonFinder Error = 3003-1056(OCR Error)-92(NameBank) -621(Correctly Found Names) =1234 Table 5: Performances of TaxonFinder and FAT* *Without_OCR_Error means the names which have not been correctly recognized by OCR are excluded in the evaluation. With_OCR_Error means all names (whether correctly or uncorrectly recognized by OCR) are included in the evaluation. *The different numbers of names identified by biologists are due to the different mechanisms of TaxonFinder and FAT. TaxonFinder removes duplicate names within a page while FAT does not. In order to match the algorithms, we use the same mechanisms to evaluate them. *TaxonFinder is developed by uBio and FAT is short for Finds All Taxonomic names developed by Sautter et al. The performance of the whole text mining system is evaluated by two measures from information retrieval evaluation: Precision (P) and Recall (R). Precision is the proportion of matching strings that are valid names. In our case,the precision means the capability of the algorithm to exclude the non-valid name in the result. Recall is the proportion of valid names in the whole database that were returned as true positives. It means the capability of finding all valid names from the database. In this evaluation, we use a single measure F-score to express the tradeoff between Precision and Recall which is a harmonic mean of recall and precision: Table 3: Top OCR errors Table 7: Overall NameBank Evaluation For additional information about the evaluation, including datasets used, visit: http://bhlnameevaluation.wikispaces.com/ Table 4: Similarities between TaxonFinder and FAT *Same means No. of names found both by TaxonFinder and FAT. Same Is Valid Name means No. of same names which are also identified by domain experts (biologists) Table 6: NameBank Evaluation For TaxonFinder F-score=2(Precision*Recall)/(Precision+Recall)

An Evaluation of Taxonomic Name Recognition (TNR) in the Biodiversity Heritage Library (BHL)

An Evaluation of Taxonomic Name Recognition (TNR) in the Biodiversity Heritage Library (BHL)

Presentation Transcript

taxonomic classification

Current names finding algorithm performance example – random BHL page, accessed June 2009

Management at BHL

An introduction to taxonomic publication

Taxonomic Groups

Global BHL Meeting Fez - Morocco, 27-28 May, 2013

Veterinary Aspects of TNR

BHL POOL MAY 09

BHL POOL MAY 08

BHL-E Meta Data Harmonisation Wolfgang Koller & Heimo Rainer NHM Vienna

TNR – Top 10 Reasons to Start a TNR Program, A Shelter Perspective

Performance Evaluation: Estimation of Recognition rates

Presentation Fonts (TNR 44)

TNR = 0.4

Care Management Implementation

Trap-Neuter-Return An Introduction

NAME Evaluation Report

Understanding Neutron Radiography Reading II-TNR of Materials-A

TNR – Top 10 Reasons to Start a TNR Program, A Shelter Perspective

BHL-E Meta Data Harmonisation Wolfgang Koller & Heimo Rainer NHM Vienna

Cho Thue Van Phong TNR Tower

An Evaluation of Taxonomic Name Recognition (TNR) in the Biodiversity Heritage Library (BHL)