1 / 11

lis508 lecture 2: characters to textual documents

lis508 lecture 2: characters to textual documents. Thomas Krichel 2002-09-30. Structure. Character sets Coded character set Character endcoding. Literature. Norton “new inside the PC” chapter 4 http://www.danbbs.dk/~erikoest/bb_terms.htm

lorant
Download Presentation

lis508 lecture 2: characters to textual documents

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. lis508 lecture 2: characters to textual documents Thomas Krichel 2002-09-30

  2. Structure • Character sets • Coded character set • Character endcoding

  3. Literature • Norton “new inside the PC” chapter 4 • http://www.danbbs.dk/~erikoest/bb_terms.htm • http://wwwinfo.cern.ch/asdoc/WWW/publications/ictp99/ictp99N2705.html • http://www.cl.cam.ac.uk/~mgk25/unicode.html

  4. Recall from last lecture • UCS is a character set defined by the ISO • The most important characters are in the basic multilingual plane. It has 2^16=65536 characters. • UCS characters in the BMP can be represented by two bytes. • Other characters need more space.

  5. Unicode • Unicode are an industry consortium. • The Unicode Standard published by the Unicode Consortium corresponds to the BMP of ISO 10646. All characters are at the same positions and have the same names in both standards. • The Unicode Standard defines in addition much more semantics associated with some of the characters. There is a free online book at http://www.unicode.org/unicode/uni2book/u2.html

  6. application • Word and Wordpad give the option to input Unicode character • Insert symbol • Hex sequence followed by ALT-X • You may not see the character if you do not have a font for it. • Wordpad and Notepad allow to save the Unicode file in various encodings. When in doubt, use Unicode UTF-8. • likely to be the most widely supported • does not screw up ASCII text

  7. Textual documents

  8. What is textual document? • A text is a sequence of characters. • A textual document is a text with some formatting • Font • Font shape (e.g. italics) • Spacing and other “lay-out” issues • Why are librarians concerned about textual documents?

  9. Creation of textual documents • Pure text editors only create text. • Usually text is created with wordprocessing software. This surrounds text with digital gibberish that explains the formatting. • Formatting instructions are depended on the wordprocessing software. • Why is this bad?

  10. Storing of textual documents • Most widely used is PDF • It is based on a language called postscript that describes documents. • Support for fonts • Support for inclusion of non-textual files • PDF compresses PostScript files • Proprietary format owned by Adobe Inc. • Requires special software • Also bad for digital preservation

  11. http://openlib.org/home/krichel Thank you for your attention!

More Related