1 / 20

Chapter Four

Chapter Four. Documents: The raw material How to Build a Digital Library Ian H. Witten and David Bainbridge. Documents. Building blocks of digital libraries Many different standards for documents Internationalization Fixed versus fluid Permanent versus transient Indexing.

Download Presentation

Chapter Four

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Chapter Four Documents: The raw material How to Build a Digital Library Ian H. Witten and David Bainbridge

  2. Documents • Building blocks of digital libraries • Many different standards for documents • Internationalization • Fixed versus fluid • Permanent versus transient • Indexing

  3. Standards Organizations • American National Standards Institute (ANSI) • International Standards Organization (ISO)

  4. Representing Characters • EBCDIC • Extended Binary Coded Decimal Interchange Code • Represented in 8 bits • ASCII (1968) • American Standard Code for Information Interchange • Represented with 7 bits • Does not support many foreign languages • Many expansions made to the basic ASCII character set • ISCII (1983) • Indian Script Code for Information Interchange • Hindi and related languages • GB and Big-5 for Chinese

  5. Unicode • Successor of ASCII • ISO-10646 (1993) • Universal • Aims to represent ALL the world’s languages • Default encoding for HTML and XML • Development began in 1988 as a joint effort between Apple and Xerox • Unicode standard continues to evolve • Round-trip compatibility – Unicode can be mapped to/from any character set without loss

  6. Unicode Character Set • Unicode standard is massive • Two subsets of standard: ISO 10646-1/2 • 94,000 characters defined • Represents scripts • Scripts versus languages • Punctuation shared among scripts • Universal character set – characters at the core of Unicode

  7. Five Zones of Unicode • Alphabetic scripts (Western languages, Latin, Greek, Cyrillic, Hebrew and Arabic) • Ideographic scripts (Chinese, Japanese, Korean) • Other characters (Braille, mathematical symbols) • Surrogates • Reserved codes

  8. Composite and Combining Characters Unicode Terms • Character:abstract form of a letter • Glyph: a particular rendition of a character on a page • Different fonts  different glyphs • Unicode does not distinguish between different glyphs • Characters are abstract members of linguistic scripts, not graphic entities • Code Point: a Unicode value, specified by prefixing U+ • Includes ligatures and combining diaeresis • Canonical and compatibility equivalence • Deprecated characters • Code Range: range of values that characters span

  9. Unicode Character Encoding • UTF: Unicode character set Transformation Format • UTF-32 • ISO standard uses a 32 bit (4 byte) value • Unicode consortium uses 21 bits • 32 planes (5 bits) of 65,536 characters (16 bits) • Basic multilingual plane (living languages) • Supplementary multilingual plane (historic scripts, other alphabets • Supplementary ideographic plane (ancient Chinese ideographs) • Supplementary special-purpose plane (tags for languages) • UTF-16, UTF-8 • Hindi and related scripts (ISCII)

  10. Representing Documents • Plain text • Full-text indexing • Bag of words • Inversion of the text • Inverted files • Granularity of document • Granularity of index • Word segmentation • Chinese and Japanese are written without spaces • Spacing in Chinese sentences can completely change the meaning

  11. Page Description Languages • Device independence • PostScript (also a programming language) • First commercially developed page description language (1985) • Fonts: Type 1, TrueType, OpenType • Text extraction • Using PostScript in a digital library

  12. Page Description Languages • Portable Document Format (PDF) • Successor to Postscript • PDF versus Postscript • Not a full-scale programming language • New features for interactive display • Random access to pages • Hierarchically structured content • Navigation within a document • Hyperlinks • File format: header, objects, cross-references, trailer • Searchable image option

  13. Word-Processor Documents • Rich Text Format • 240 page specification • Document-level metadata • Conversion to HTML software available • Native Word formats • Binary • Proprietary • LaTeX format • Typed formatting commands • Non-proprietary

  14. Representing Images • Lossless image compression • GIF • PNG • JPEG-Lossless • JPEG-2000 • Lossy image compression • JPEG • Progressive refinement

  15. Representing Audio and Video • Evolution of signals over time • Sample rate • Samples per second • Multimedia compression • Codec • Asymmetry • Redundancy

  16. MPEG • ISO Moving Picture Experts Group (1988) • Audio and Video at 1.5 Mbit/second • Family of standards • MPEG-1 • Low resolution video, 30 fps, near CD quality • Layer 3 – MP3 • MPEG-2 • Higher quality video (DVD) • Supports interlaced images (Broadcast TV) • Multichannel audio

  17. MPEG • MPEG-3 abandoned • MPEG-4 • Low bandwidth networks – mobile and WWW • Object based (vs. signal based) • Interactive • Strategies for identifying and managing intellectual property • MPEG-7 • Metadata description for content delivered via MPEG-1,2,4 • MPEG-21 • Multimedia lifecycle • Interoperability

  18. MPEG • Television standards: NTSC, PAL, SECAM • Digital television and video standard: CCIR601 • MPEG video • Frames: intra (I), predicted (P), bidirectional (B) • MPEG audio • Acoustic masking • Three compression layers • Mixed media • Time-stamped packets are multiplexed into a single stream • Typically – video 1.2 Mbits/s and audio takes .3 Mbits/s

  19. Other Multimedia Formats • Audio and Video • AVI (Microsoft) • Quicktime (Apple) • Streaming • RealAudio, RealVideo, RealOne (Realsystems) • ASF (Microsoft) • Audio only • WAV (Microsoft, IBM) • AIFF (Apple) • AU (Sun)

  20. Multimedia in a Digital Library • Indexing and browsing structures • Text-based • Content-based • Summarizing audio and video • Digitizing media • Linear resolution, color depth, frame rate, sample rate • Preservation issues

More Related