300 likes | 392 Views
Encoding and fonts Edward Garrett Software Developer, ELAR. Some issues. Data types: diverse scripts multilingual data IPA and other transcriptional notations Modes: representation (in some scheme) storage (using some encoding) display (in browser, word processor, etc.)
E N D
Some issues • Data types: • diverse scripts • multilingual data • IPA and other transcriptional notations • Modes: • representation (in some scheme) • storage (using some encoding) • display (in browser, word processor, etc.) • input (with various OS, keyboards, etc.) • Your issues and challenges: • data/problems to look at now? • Friday AM advice clinic
Representing data • Symbols • Encoding: character sets, Unicode • Fonts • Relationships (eg links) • Structures (eg hierarchies)
Representing textual data • Plain text • Lacks formatting information • Transfer between applications • Internal memory • Saved in files • Encodings • Unicode • Markup • XML • HTML
Plain text • What is it? • Try saving a document as plain text … in TextEdit …
Definitions • Background on digital data storage: • Bit: 0, 1 • Byte: 8 bits, e.g. 00101100 • Definitions from Yucca Korpela’s article: • Character repertoire: a set of characters (from one or more scripts) constituting the data that can be represented • Character code: a mapping that gives each character in a repertoire a distinct numeric identifier • Character encoding: a method of mapping sequences of character codes into sequences of bytes
Character encodings: ISO 8859-1 • ISO 8859-1: uses 1 byte (8 bits) to encode characters for most of the Western European languages
Unicode • International standard (ISO 10646) • Industry standard (Unicode Consortium) • Aims to code all characters from all of the world’s scripts - over 1 million code points • Privileges character semantics, not glyphic representations • Multiple encoding methods • Referencing a character: U+nnnn (in hexidecimal, base 16) • Most characters in Basic Multilingual Plane (first 65,536 character positions)
Unicode encodings • UTF-32: each code to 4 bytes; inefficient as most commonly used characters are in BMP • UTF-16: maps each code to either one 2 byte sequence, or two: • efficient and widely used • Good for the BMP • UTF-8: maps each code to 1-4 bytes • Particularly compact for Western European languages • Most widely supported across various internet protocols
Character semantics vs. glyphs • No difference between e, e, and e • IPA letter [c], unvoiced palatal plosive, but same as Roman c • No separate characters for cursive scripts, joined up handwriting
Character semantics vs. glyphs • Examples • U+0041 LATIN CAPITAL LETTER A • U+0410 CYRILLIC CAPITAL LETTER A • U+0391 GREEK CAPITAL LETTER ALPHA • IPA digraphs • “Never use a character just because it looks right.”
Precomposed characters • Complex characters involving a base character and multiple diacritics - treated as equivalent • A relevant case study: "Challenges in Writing Bih" [http://test.elar.soas.ac.uk/node/5]
Compatibility characters • Similar to their decompositions, but not equivalent; they include extra information (formatting, etc.)
Pre-composed and compatibility characters • Why do they exist, if counter to Unicode’s focus on character semantics over glyphic representation? • Compatibility with prior encodings • No such new characters will be accepted into Unicode
Things to watch out for • An example to illustrate the difference between: • Text rendering • Document encoding
Take away message • Just because characters aren’t rendered properly doesn’t mean that they aren’t there. • Just because characters are rendered properly doesn’t guarantee that they will stay that way. • Beware your platform’s default encoding (probably not Unicode).
Adding markup • Not only should the document be Unicode, but it must declare itself as Unicode.
Exercises • What's wrong with these Unicode words? • Character encoding exercises I • http://test.elar.soas.ac.uk/taxonomy/term/1