630 likes | 784 Views
Down and Dirty Digitization: Everything you need to know about putting content online. Roy Tennant California Digital Library. Outline. Project Planning Selecting Material to Digitize Digitization Purpose Basic Imaging Principles Capturing Images Editing Images Best Practices
E N D
Down and Dirty Digitization:Everything you need to know about putting content online Roy Tennant California Digital Library
Outline • Project Planning • Selecting Material to Digitize • Digitization Purpose • Basic Imaging Principles • Capturing Images • Editing Images • Best Practices • Conversion to Text • Metadata • Access Systems • Skills Required of Staff • Preservation
Project Planning • Who will do the work? • What systems will be required? • What are the specifications for images and metadata? • How much will the project cost? • Who will own and manage the digital products that will be produced? Steve Chapman, from Handbook for Digital Projects, NEDCC
Selecting Material to Digitize • Publishing rights • Available support/funding opportunity • Critical mass • Uniqueness • Reputation • Audience and potential use • Diversity of material type • Ability to stand on its own and fit in with other collections
What Do We Preserve? • The body or the soul? • The artifact • The intellectual content • How do we decide that the artifact has preservation value? • Who decides?
The Artifact • The “look and feel” • The experience of interacting with a specific object • Consequences: • Choices for providing access are limited • Time and money spent on recreating the artifact may be better spent on increasing access • In some cases, preserving the look and feel actually harms other uses
Written Material • Handwritten texts (diaries, etc.), or those with handwritten notations (manuscript drafts, etc.) can easily be considered to have artifactual value • But how much artifactual value do printed texts have? • And born-digital texts? • What’s it worth to you?
“If the goal of preservation is persistent utility, then functionality rather than aesthetics should drive system design.” — Stephen Chapman, “Content Follows Form: Preservation via Systems Design, Microform & Imaging Review
Persistent Utility • Form must be allowed to be altered or destroyed to retain or enhance function • If function cannot be retained or enhanced, then form should be preserved
Considerations for Retaining Items in Original Format • Age • Evidential value • Aesthetic value • Scarcity • Associational value • Market value • Exhibition value
“The issue is not to evaluate the artifact per se to determine what survives and what does not…The issue is the need to agree on a method for interrogating the individual artifact, that would, in a climate of finite resources, help make a good decision about whether and how to preserve it.” — Council on Library and Information Resources, The Evidence in Hand: the Report of the Task Force on the Artifact in Library Collections
How Do We Preserve It? Preservation costs by method calculated by the Library of Congress Preservation Directorate
Types of Materials Printed text/ Simple line art Mixed Halftones Manuscripts Continuous Tone From Anne Kenney, et.al., Moving Theory into Practice
Benchmarking • The process whereby you determine your digitization requirements using the material you will digitize
Resolution The number of pixels in a given area defines the resolution of an image One pixel 1” 500 x 1,000 pixels
Dynamic Range (bit-depth) 1 bit 8 bit grayscale 8 bit color 24 bit color (GIF) (GIF) (JPEG) 1 bit = black or white 8 bits = 256 shades 16 bits = thousands 24 bits = millions 36 bits = billions
RGB Color Space 8 bits per channel = 24 bit color image Color Channels Red Green Blue 12 bits per channel = 36 bit color image
Image Compression • Lossless — the image is unchanged after compression (no image data is lost) • Typical file size: 50% of original • Example: LZW compression • Lossy — the image is altered after compression (image data is lost) • Example: JPEG
TIFF • Tagged Image File Format • Most often used to save “master versions” of images (unedited) • Can be compressed or uncompressed
Compuserve GIF • Graphic Interchange Format (GIF) • Maximum 8 bits/pixel: 256 colors (shades) • Good for: • Text and line art • Thumbnails • Not good for: • Full-color pictures • Anything that requires more than 256 colors
JPEG • Joint Photographic Engineers Group • JPEG is actually a compression scheme; the image file format is JFIF (JPEG File Image Format) • Good for: • Full-color pictures • Anything that requires more than 256 colors • Not good for: • Text or line art
New Image Formats • Portable Network Graphics (PNG) - from the W3C to replace the Compuserve GIF format and provide more capabilities • JPEG2000 - An upgrade of the JPEG format • Flashpix - from a consortium of commercial companies, to provide much higher-resolution images in a way that allows speedy network delivery • MrSID - From LizardTech, good for large format materials (maps, panoramic photos, etc.)
Capturing Images • Technologies • Digital Cameras • Flatbed Scanners • Film Scanners • Kodak PhotoCD • Outsourcing • Standards and Best Practices
Digital Cameras Phase One PowerPhase FX 10,500 x 12,600 pixels, 760MB (48 bit RGB) BetterLight Super6K 6,000 x 8,000 pixels, 136MB (24bit RGB) $16,990
Flatbed Scanners • Minimum requirements: • 600 X 1200 dpi optical resolution • 36-bit color • Not for slides or transparencies, best for 81/2”x11” or 81/2”x14” originals • Sheet feeder (often optional) helpful for digitizing text
Film Scanners • For 35mm slides and negatives;others available for larger formats • $600 - $3,000 • Most around 2700-4000 dpi,30-36 bit color
Kodak PhotoCD • Take pictures with a normal camera, but have your pictures “developed” onto a PhotoCD • A proprietary image format: ImagePAC, but very high resolution (4 different resolutions)
Outsourcing: Pros and Cons • Benefits: • No ramp-up costs (both time and money) • Probably higher quality, at least to begin with • High volume capability • Drawbacks: • May be more costly if you have underutilized staff time • No internal capability or experience developed (that is, when the money runs out, so does your chance to do anything more) • Rare items may require in-house digitization
Outsourcing: How • Write an RFQ (Request for Quote) outlining: • Type and amount of material being digitized • Quality requirements • Volume per unit of time requirements • For RFQ guidance and samples, see RLG Tools for Digital Imaging: • www.rlg.org/preserv/RLGtools.html
Digital Image Work Flow Rotate, Crop, Retouch, Brightness/ Contrast Resize, Sharpen Original TIFF or PCD 10-100+MB JPEG 100K GIF 10K Indexed Color Space RGB Color Space Stored offline Stored online
Editing Images • Rotating • Cropping • Retouching • Adjusting • Resizing • Sharpening • Saving
Conversion to Text • Optical Character Recognition (OCR) software is required (Caere OmniPage Pro, Xerox TextBridge, etc.) • Quality and typography of originals is key • Less than 99.5% accuracy is less expensive to have re-keyed offshore • For some applications, uncorrected text is sufficient
Imaging Best Practices • General guidelines for archival versions: • Photos, illustrations, maps, etc.: • 300-600dpi • 24-36 bit color • B/W Text document: • 300-600dpi • 8 bit grayscale • Negatives and Slides: • 2000-4000 pixels in longest dimension • 24-36 bit color for color; 8 bit grayscale for B/W
Imaging Best Practices “The key to image quality is not to capture at the highest resolution or bit depth possible, but to match the conversion process to the informational content of the original, and to scan at that level--no more, no less.”— Moving Theory Into Practice
Metadata: Types • Structured description of an object or collection of objects • Three basic types: • descriptive - e.g., title, creator, subject - used for discovery • administrative - e.g., resolution, bit depth - used for managing the collection • structural - e.g., table of contents page, page 34, etc. - used for navigation
Metadata: Appropriate Level • Collection-level access: • Discovery metadata describes the collection • Example: Archival finding aid encoded in SGML; see http://www.oac.cdlib.org/ • Item-level access: • Discovery metadata describes the item • Example: individual metadata records for each item; see http://jarda.cdlib.org/cgi-bin/imagesearch.pl
Collection Level Access Images Individual Finding Aid Search Interface (Library catalogor dedicated) Individual Finding Aid
Item Level Access Finding Aids Images Search Interface (Dedicated)
Metadata: Granularity • <name>William Randolph Hearst</name> • <name> <first>William</first> <middle>Randolph</middle> <last>Hearst</last></name> • Consider all uses for the metadata • Design for the most granular use • Store it in a machine-parseable format
Metadata: Qualification • <name role=“creator”>William Randolph Hearst</name> • <subject scheme=“LCSH”>Builder -- Castles -- Southern California</subject>
Metadata: Machine Parseability • The ability to pull apart and reconstruct metadata via software • For example, this: • Can easily become this: <name> <first>William</first> <middle>Randolph</middle> <last>Hearst</last></name> <DC.creator>Hearst, William Randolph</DC.creator>
Metadata: Standards • Metadata: • Collection Level: • Encoded Archival Description (EAD) - lcweb.loc.gov/ead/ • Item Level: • MARC • Dublin Core - purl.org/DC/ • MODS - www.loc.gov/standards/mods/ • Harvesting: • Open Archives Initiative, www.openarchives.org
Access Systems • Exhibit • Browse • Search
Access Systems: Exhibit • Goals: • Inviting • Easy to navigate • Highlight selected parts of a collection • Teach • Requirements: • Great graphic design • Informative and succinct commentary • Interesting subject matter