110 likes | 234 Views
lis508 lecture 2: characters to textual documents. Thomas Krichel 2002-09-30. Structure. Character sets Coded character set Character endcoding. Literature. Norton “new inside the PC” chapter 4 http://www.danbbs.dk/~erikoest/bb_terms.htm
E N D
lis508 lecture 2: characters to textual documents Thomas Krichel 2002-09-30
Structure • Character sets • Coded character set • Character endcoding
Literature • Norton “new inside the PC” chapter 4 • http://www.danbbs.dk/~erikoest/bb_terms.htm • http://wwwinfo.cern.ch/asdoc/WWW/publications/ictp99/ictp99N2705.html • http://www.cl.cam.ac.uk/~mgk25/unicode.html
Recall from last lecture • UCS is a character set defined by the ISO • The most important characters are in the basic multilingual plane. It has 2^16=65536 characters. • UCS characters in the BMP can be represented by two bytes. • Other characters need more space.
Unicode • Unicode are an industry consortium. • The Unicode Standard published by the Unicode Consortium corresponds to the BMP of ISO 10646. All characters are at the same positions and have the same names in both standards. • The Unicode Standard defines in addition much more semantics associated with some of the characters. There is a free online book at http://www.unicode.org/unicode/uni2book/u2.html
application • Word and Wordpad give the option to input Unicode character • Insert symbol • Hex sequence followed by ALT-X • You may not see the character if you do not have a font for it. • Wordpad and Notepad allow to save the Unicode file in various encodings. When in doubt, use Unicode UTF-8. • likely to be the most widely supported • does not screw up ASCII text
What is textual document? • A text is a sequence of characters. • A textual document is a text with some formatting • Font • Font shape (e.g. italics) • Spacing and other “lay-out” issues • Why are librarians concerned about textual documents?
Creation of textual documents • Pure text editors only create text. • Usually text is created with wordprocessing software. This surrounds text with digital gibberish that explains the formatting. • Formatting instructions are depended on the wordprocessing software. • Why is this bad?
Storing of textual documents • Most widely used is PDF • It is based on a language called postscript that describes documents. • Support for fonts • Support for inclusion of non-textual files • PDF compresses PostScript files • Proprietary format owned by Adobe Inc. • Requires special software • Also bad for digital preservation
http://openlib.org/home/krichel Thank you for your attention!