10 likes | 433 Views
Automated Georeferencing of Natural History Museum Data. Nelson E. Rios. Abstract. Design: Natural Language Processing. Discussion.
E N D
Automated Georeferencing of Natural History Museum Data Nelson E. Rios Abstract Design: Natural Language Processing Discussion A locality description along with country, state and county information is input into GEOLocate. Georeferencing begins by standardizing the locality description string into a common terms format. For example, distances mentioned in a locality string are converted to miles. Once standardized the locality string is parsed into key geographic identifiers. Some example geographic identifiers used by GEOLocate include the occurrence of named places, navigable river miles, highway names, water body names, legal locations and displacement patterns. These identifiers within the string are used to determine geographic coordinates from database lookups and geographic calculations. The resulting coordinates are ranked based on the type of information found within the string and plotted on the digital map display for user verification, correction and error determination. It is estimated that the number of biological specimens in US museums and herbaria exceeds 750 million. In the vast majority of instances the collection location is recorded as a string of text and typically lacks geographic coordinates. We have developed a tool for interpreting descriptive locality text associated with natural history collections data, determining geographic coordinates and allowing the user to verify and correct the coordinates. Traditional methods for georeferencing collection data from text descriptions are tedious and time consuming, typically involving finding the locality on either a hardcopy or digital maps, plotting the locality and determining the coordinates. Using our tool, GEOLocate, considerably reduces the time required to georeference locality information. It took 1 staff member approximately 1.5 years to georeference the 15,000 unique locality descriptions within the Tulane fish collection. Time trials with GEOLocate suggest that this job could have been accomplished in under 6 months. Using GEOLocate can significantly reduce the time required to georeference natural history data. GEOLocate was able to assign coordinates to over 98% of the locality data tested. This initial assignment of coordinates should only be considered a "rough" pass at the data and each record should be visually inspected and corrected as necessary. Locality records with incorrect or missing county information typically have greater error associated with resultant coordinates. This is due to the greater search area involved when county information is absent. Depending on the quality of the original locality data, georeferencing results can be improved by prior checks of misspelled, missing, incorrect, and/or ambiguous information within the locality dataset. Application: Test Bed Results Acknowledgements Introduction 11521 unique locality descriptions containing geographic coordinates were extracted from the TUMNH database and imported into GEOLocate. Of these, 11295 records were auto-assigned coordinates by GEOLocate within a 3 hour period. 36% of the georeferenced records were within 1 mile of the original coordinates. 83% of the records were within 15 miles of the original, permitting easy verification and correction on the map display. Time trials using GEOLocate average 45-60 seconds to georeference, verify and correct a locality record. I would like to thank the following for reviewing early versions of GEOLocate: James S. Albert, Jonathan Armbruster, Jeremy Bartley, Andy Bentley, Stephanie Coste, Paul David, Bud Freeman, John Friel, Tom Giermakowski, Robert Glaubitz, Sara J. Gottlieb, Brendan Haley, Chad Hargrave, Dean Hendrickson, Mikaela Howie, Denny Hugg, Janeen Jones, Edie Marsh, Kris McNyset, Jonathon Rothman, Barbara Scudder, Steph Smith and John Wieczorek. This research was supported by a grant from the National Science Foundation (DBI-0131053 ). The Tulane University Fish Collection, with 7.1 million fluid-preserved specimens in over 190,000 lots, collected from over 15,000 locations worldwide, is one of the largest collections in the world and is recognized as ”National Center of Ichthyology Resource Collection”. During the early 90's, the entire collection was computerized and georeferenced. Georeferencing the collection took nearly 2 years, requiring labor intensive lookups in a both paper and digital maps. This experience along with the resultant dataset of georeferenced information, became a test bed for the development of an automated georeferencing system for natural history information called GEOLocate. GEOLocate is a software tool that enables researchers to easily assign geographic coordinates to a descriptive string of locality information, visualize the location, and make corrections as necessary. Features • Drag and drop coordinate correction • Option to ‘snap’ to found waterbody (U.S. only) • Bridge crossing detection (U.S. only) • Batch georeferencing • File input via .xml, .csv or delimited .txt • Polygon error determination • Multiple coordinate determination • Supports entire United States, Mexico and Canada • Street level mapping for United States • Overview plotting of input datasets