1 / 24

Digital Reformatting of Text

Digital Reformatting of Text. Aaron Choate Digital Library Production Services The University of Texas Libraries. From last time:. Calculating potential file size (no really… this time we got it!) file size = height x width x bit-depth x dpi 2. 8 bits per byte. imaging Benchmarking.

ozzie
Download Presentation

Digital Reformatting of Text

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Digital Reformatting of Text Aaron Choate Digital Library Production Services The University of Texas Libraries

  2. From last time: • Calculating potential file size (no really… this time we got it!)file size = height x width x bit-depth x dpi2 8 bits per byte

  3. imagingBenchmarking • Subjective evaluation becomes more problematic when the goal is legibility rather than fidelity.

  4. imagingBenchmarking • Physical Type, size and presentation

  5. imagingBanchmarking • Physical condition • Darkening pages • Fading ink • Stains • bleed-through • Uneven printing • Fold lines • smearing

  6. imagingBenchmarking • Document classification • Simple text / printed line art • Distinct-edge based representation Bitonal? • Manuscripts • Soft-edge-based Grayscale / color • Mixed material

  7. imagingBenchmarking • Medium and support • Support – (paper, clay tablet, etc.) • Thin paper? (bleed through) • Medium – (graphite pencil, inks, etc) • Fading of ink • Variations in color or density

  8. imagingBenchmarking • Tonal Representation

  9. imagingBenchmarking • Color Appearance • Is color reproduction necessary to the document’s meaning? • What purpose does the color serve? • How important is maintaining the color appearance?

  10. imagingBenchmarking • Detail • Printed text – • Measure the height of the smallest lowercase letter that typifies the item or group of items. • Manuscripts, line art – • Measure the finest stroke-width that must be represented and characterize the needed level of quality

  11. imagingBenchmarking • QI…(Quality Index) • Defining detail as character height • ANSI/AIIM preservation microfilming standard for determining requirements for text legibility • Defines a range from barely legible through excellent that maps to technical test targets

  12. Line pairs Excellent = 8 line pairs Good = 5 line pairs Marginal = 3.6 line pairs Barely legible = 3.0 line pairs imagingBenchmarking

  13. imagingBenchmarking Digital QI • Bitonal (only black pixels) QI = (dpi x .039h)/3 h = 3QI/.039dpi dpi = 3QI/.039h • Tonal images (grayscale for printed text) QI = (dpi x .039h)/2 h = 2QI/0.39dpi dpi = 2QI/.039h

  14. Text Capture • Methods • Rekeying • OCR • Accuracy …

  15. Software • Scansoft - Omnipage Pro • Abbyy – Fine Reader • Adobe Acrobat … • PrimeOCR – Prime Recognition

  16. Encoding

  17. XML vs SGML • SGML (Standard Generalized Markup Language ) is the grand-daddy of all markup languages • XML is a subset of SGML with an intent on being the format for use on the Internet. • XML attempts to fill the gap between SGML, which can be used for just about anything, and HTML which is severely limited and currently being abused because of this. (table structures for layout, clear 1 pixel GIFs.. etc)

  18. xmlDTDs vs Schemas

  19. xmlTEI • Text Encoding Initiative • Initially launched in 1987, the TEI is an international and interdisciplinary standard that helps libraries, museums, publishers, and individual scholars represent all kinds of literary and linguistic texts for online research and teaching, using an encoding scheme that is maximally expressive and minimally obsolescent.

  20. xmlTEI • Levels of encoding • Level 1: Fully Automated Conversion and Encoding • Level 2: Minimal Encoding • Level 3: Simple Analysis • Level 4: Basic Content Analysis • Level 5: Scholarly Encoding Projects

  21. Character sets • Unicode – Unicode provides a unique number for every character, no matter what the platform, no matter what the program, no matter what the language.

  22. Greek & Coptic character setsUnicode

  23. Software • XMetal • Oxygen • Cooktop

  24. Software • MetaE

More Related