1 / 20

Digitizing California Arthropod Collections

Digitizing California Arthropod Collections. Peter Oboyski, Phuc Nguyen, Serge Belongie , Rosemary Gillespie Essig Museum of Entomology University of California Berkeley, California, USA. Who is CalBug ?. Essig Museum of Entomology California Academy of Sciences

yorick
Download Presentation

Digitizing California Arthropod Collections

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Digitizing California Arthropod Collections Peter Oboyski, Phuc Nguyen, Serge Belongie, Rosemary Gillespie Essig Museum of Entomology University of California Berkeley, California, USA

  2. Who is CalBug? Essig Museum of Entomology California Academy of Sciences California State Collection of Arthropods Bohart Museum, UC Davis Entomology Research Museum, UC Riverside San Diego Natural History Museum LA County Museum Santa Barbara Museum of Natural History

  3. Digitization workflow Optical Character Recognition (OCR) & Automated data parsing (Optional) Sort by locality, date, sex, etc. Online crowd-sourcing of manual data entry Manually enter data into MySQL database Remove labels, add unique identifier Replace labels, return to collection Take digital image, name and save file Geographic referencing Aggregate data in online cache Temporospatial analyses Error checking Handling & Imaging Data Capture Data Manipulation

  4. Why Image Specimens/Labels? • Data capture can be done remotely • Magnify difficult to read labels • Potential for OCR • Verbatim digital archive of label data

  5. 1st generation - DinoLitedigital microscope

  6. 2nd generation – Digital Camera (Canon G9)

  7. Higher resolution Labels flat & unobstructed Scale bar, controlled light Important to add species name to image or file name EMEC218958 Paracotalpaursina.jpg ~150,000 images waiting to database

  8. Data capture Optical Character Recognition (OCR) & Automated data parsing Using our own MySQL database (EssigDB) Built-in error checking Data carry-over one record to next Taxonomy automatically added “Notes from Nature” Collaboration with Zooniverse Citizen Scientist transcription of labels Collaboration with UC San Diego Improved word spotting& OCR Online crowd-sourcing of manual data entry Manually enter data into MySQL database

  9. Notes from Nature Citizen Science data transcription

  10. Integrating OCR with crowd sourcing Spotting words within images Copy-paste, highlight-drag fields Auto-detecting repeated “words” eg. species, states, counties Providing an additional “vote” for transcription consensus

  11. The OCR challenge for specimen labels DETECTION: Finding text in a complex matrix Machine-typed vs. hand-written labels Sliding window classifier creating text bounding boxes >95% detection and localization using pixel-overlap measures

  12. Current Progress in OCR recognition RECOGNITION: Using TesseractOCR engine Machine Type 74% accuracy for word-level 82% accuracy for character-level Hand Writing 5.4% accuracy for word-level 9.2% accuracy for character-level

  13. Where do we go from here? Improved recognition of hand-writing Incorporate OCR into crowd sourcing Develop (semi-) automated data parsing

  14. Thank you http://calbug.berkeley.edu

More Related