220 likes | 306 Views
Technology Support for ESSSS. Marshall Breeding Director for Innovative Technology and Research Vanderbilt University Library Founder and Publisher, Library Technology Guides http://www.librarytechnology.org/ http://twitter.com/mbreeding. Progress, Issues, and Challenges.
E N D
Technology Support for ESSSS Marshall Breeding Director for Innovative Technology and Research Vanderbilt University Library Founder and Publisher, Library Technology Guides http://www.librarytechnology.org/ http://twitter.com/mbreeding Progress, Issues, and Challenges ESSSS Digital Archive Workshop February 4, 2012
Turning Pages on Paper to Digital Images • Digitizing in the field involves many compromises compared to what can be done in more controlled settings • Access to archives may be of limited duration • Arbitrary and political • Materials deteriorating rapidly • Practices related to physical preservation tend to be minimal • Must be light, fast, and expensive
Achieve best results possible • Maximize quality and consistency • Handheld digital cameras • Rapid advancement in capabilities • Early images down at lower resolutions compared with what is possible today • Fixed camera stands • Consistency in orientation and framing • Organization of Images (folders / image names)
Image Standards • TIFF: Currently regarded as best image format for archiving images • RAW: Native proprietary format of a camera • JPEG: Compressed images for display on the Web • Data lost during compression: non-reversible • VU system creates multiple sizes of JPEG images • JPEG2000 • Lossless compression method • Not well supported on the Web
Bringing Images to the Web • Take advantage of infrastructure developed at by the Vanderbilt University Library to manage images • Digital Library framework: • Presentation and functionality created in Perl-based interface • Data and Metadata stored in MySQL relational tables • ODBC connectivity between presentation layer and MySQL • Microsoft Windows Server/IIS for Web server • Images reside on digital storage provided by the Vanderbilt University Library
Digital Preservation • Disaster Recovery • Ability to restore files in the case of any hardware, software, or human Error • Digital Preservation • Commitment and processes in place to preserve digital information for the very long term • Multiple replications • Migration of data into future formats as current standards become obsolete
Building structure through Metadata • Metadata structure based on Dublin Core • Volume-level descriptive metadata • Courtney Campbell designed metadata structure and is analyzing volumes to populate metadata for each volume • EXIF Data extracted from images into the individual records for each page • Page-level structure • Supports ability to select volumes and browse page images
Demonstration • Image management environment • Interface • Metadata • Page Images
Turning Pages into Data • The contents of the page images contain valuable data • Page images can be read by humans but do not support essential features: search, computer analysis, etc. • Full value of these collections can be realized through transcription
Challenges in transcription • Page characteristics • Hand written by many different hands • Many names and numbers • Spanish language • Varying contrast • Many defects: water damage, insects, etc
Human transcription • Scholars that work with pages of interest can create transcriptions manually • Optical character recognition? • Highly accurate for typescript • Not effective for handwritten manuscripts
Crowdsourcing • Find ways to have large numbers of persons create transcript snippets • Google uses crowdsourcing to improve transcripts for Google Books project.
Google ReCAPTCHA: • “Digitizing books one word at a time” • Each transaction transcribes one or two words • Each word is transcribed many times • Results compared to determine correct version
Crowdsourcing to Transcribe ESSSS • Scholars contribute any transcriptions created as they work with any given set of pages • Students assigned to create transcriptions • Language, history, LIS • Collaboration with some organization with ReCAPTCHA like infrastructure