200 likes | 299 Views
Chapter Four. Documents: The raw material How to Build a Digital Library Ian H. Witten and David Bainbridge. Documents. Building blocks of digital libraries Many different standards for documents Internationalization Fixed versus fluid Permanent versus transient Indexing.
E N D
Chapter Four Documents: The raw material How to Build a Digital Library Ian H. Witten and David Bainbridge
Documents • Building blocks of digital libraries • Many different standards for documents • Internationalization • Fixed versus fluid • Permanent versus transient • Indexing
Standards Organizations • American National Standards Institute (ANSI) • International Standards Organization (ISO)
Representing Characters • EBCDIC • Extended Binary Coded Decimal Interchange Code • Represented in 8 bits • ASCII (1968) • American Standard Code for Information Interchange • Represented with 7 bits • Does not support many foreign languages • Many expansions made to the basic ASCII character set • ISCII (1983) • Indian Script Code for Information Interchange • Hindi and related languages • GB and Big-5 for Chinese
Unicode • Successor of ASCII • ISO-10646 (1993) • Universal • Aims to represent ALL the world’s languages • Default encoding for HTML and XML • Development began in 1988 as a joint effort between Apple and Xerox • Unicode standard continues to evolve • Round-trip compatibility – Unicode can be mapped to/from any character set without loss
Unicode Character Set • Unicode standard is massive • Two subsets of standard: ISO 10646-1/2 • 94,000 characters defined • Represents scripts • Scripts versus languages • Punctuation shared among scripts • Universal character set – characters at the core of Unicode
Five Zones of Unicode • Alphabetic scripts (Western languages, Latin, Greek, Cyrillic, Hebrew and Arabic) • Ideographic scripts (Chinese, Japanese, Korean) • Other characters (Braille, mathematical symbols) • Surrogates • Reserved codes
Composite and Combining Characters Unicode Terms • Character:abstract form of a letter • Glyph: a particular rendition of a character on a page • Different fonts different glyphs • Unicode does not distinguish between different glyphs • Characters are abstract members of linguistic scripts, not graphic entities • Code Point: a Unicode value, specified by prefixing U+ • Includes ligatures and combining diaeresis • Canonical and compatibility equivalence • Deprecated characters • Code Range: range of values that characters span
Unicode Character Encoding • UTF: Unicode character set Transformation Format • UTF-32 • ISO standard uses a 32 bit (4 byte) value • Unicode consortium uses 21 bits • 32 planes (5 bits) of 65,536 characters (16 bits) • Basic multilingual plane (living languages) • Supplementary multilingual plane (historic scripts, other alphabets • Supplementary ideographic plane (ancient Chinese ideographs) • Supplementary special-purpose plane (tags for languages) • UTF-16, UTF-8 • Hindi and related scripts (ISCII)
Representing Documents • Plain text • Full-text indexing • Bag of words • Inversion of the text • Inverted files • Granularity of document • Granularity of index • Word segmentation • Chinese and Japanese are written without spaces • Spacing in Chinese sentences can completely change the meaning
Page Description Languages • Device independence • PostScript (also a programming language) • First commercially developed page description language (1985) • Fonts: Type 1, TrueType, OpenType • Text extraction • Using PostScript in a digital library
Page Description Languages • Portable Document Format (PDF) • Successor to Postscript • PDF versus Postscript • Not a full-scale programming language • New features for interactive display • Random access to pages • Hierarchically structured content • Navigation within a document • Hyperlinks • File format: header, objects, cross-references, trailer • Searchable image option
Word-Processor Documents • Rich Text Format • 240 page specification • Document-level metadata • Conversion to HTML software available • Native Word formats • Binary • Proprietary • LaTeX format • Typed formatting commands • Non-proprietary
Representing Images • Lossless image compression • GIF • PNG • JPEG-Lossless • JPEG-2000 • Lossy image compression • JPEG • Progressive refinement
Representing Audio and Video • Evolution of signals over time • Sample rate • Samples per second • Multimedia compression • Codec • Asymmetry • Redundancy
MPEG • ISO Moving Picture Experts Group (1988) • Audio and Video at 1.5 Mbit/second • Family of standards • MPEG-1 • Low resolution video, 30 fps, near CD quality • Layer 3 – MP3 • MPEG-2 • Higher quality video (DVD) • Supports interlaced images (Broadcast TV) • Multichannel audio
MPEG • MPEG-3 abandoned • MPEG-4 • Low bandwidth networks – mobile and WWW • Object based (vs. signal based) • Interactive • Strategies for identifying and managing intellectual property • MPEG-7 • Metadata description for content delivered via MPEG-1,2,4 • MPEG-21 • Multimedia lifecycle • Interoperability
MPEG • Television standards: NTSC, PAL, SECAM • Digital television and video standard: CCIR601 • MPEG video • Frames: intra (I), predicted (P), bidirectional (B) • MPEG audio • Acoustic masking • Three compression layers • Mixed media • Time-stamped packets are multiplexed into a single stream • Typically – video 1.2 Mbits/s and audio takes .3 Mbits/s
Other Multimedia Formats • Audio and Video • AVI (Microsoft) • Quicktime (Apple) • Streaming • RealAudio, RealVideo, RealOne (Realsystems) • ASF (Microsoft) • Audio only • WAV (Microsoft, IBM) • AIFF (Apple) • AU (Sun)
Multimedia in a Digital Library • Indexing and browsing structures • Text-based • Content-based • Summarizing audio and video • Digitizing media • Linear resolution, color depth, frame rate, sample rate • Preservation issues