1 / 30

Encoding and fonts Edward Garrett Software Developer, ELAR

Encoding and fonts Edward Garrett Software Developer, ELAR. Some issues. Data types: diverse scripts multilingual data IPA and other transcriptional notations Modes: representation (in some scheme) storage (using some encoding) display (in browser, word processor, etc.)

Download Presentation

Encoding and fonts Edward Garrett Software Developer, ELAR

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Encoding and fontsEdward GarrettSoftware Developer, ELAR

  2. Some issues • Data types: • diverse scripts • multilingual data • IPA and other transcriptional notations • Modes: • representation (in some scheme) • storage (using some encoding) • display (in browser, word processor, etc.) • input (with various OS, keyboards, etc.) • Your issues and challenges: • data/problems to look at now? • Friday AM advice clinic

  3. Representing data • Symbols • Encoding: character sets, Unicode • Fonts • Relationships (eg links) • Structures (eg hierarchies)

  4. Representing textual data • Plain text • Lacks formatting information • Transfer between applications • Internal memory • Saved in files • Encodings • Unicode • Markup • XML • HTML

  5. Plain text • What is it? • Try saving a document as plain text … in TextEdit …

  6. Definitions • Background on digital data storage: • Bit: 0, 1 • Byte: 8 bits, e.g. 00101100 • Definitions from Yucca Korpela’s article: • Character repertoire: a set of characters (from one or more scripts) constituting the data that can be represented • Character code: a mapping that gives each character in a repertoire a distinct numeric identifier • Character encoding: a method of mapping sequences of character codes into sequences of bytes

  7. Character encodings: ISO 8859-1 • ISO 8859-1: uses 1 byte (8 bits) to encode characters for most of the Western European languages

  8. Unicode • International standard (ISO 10646) • Industry standard (Unicode Consortium) • Aims to code all characters from all of the world’s scripts - over 1 million code points • Privileges character semantics, not glyphic representations • Multiple encoding methods • Referencing a character: U+nnnn (in hexidecimal, base 16) • Most characters in Basic Multilingual Plane (first 65,536 character positions)

  9. Unicode encodings • UTF-32: each code to 4 bytes; inefficient as most commonly used characters are in BMP • UTF-16: maps each code to either one 2 byte sequence, or two: • efficient and widely used • Good for the BMP • UTF-8: maps each code to 1-4 bytes • Particularly compact for Western European languages • Most widely supported across various internet protocols

  10. Character semantics vs. glyphs • No difference between e, e, and e • IPA letter [c], unvoiced palatal plosive, but same as Roman c • No separate characters for cursive scripts, joined up handwriting

  11. Character semantics vs. glyphs • Examples • U+0041 LATIN CAPITAL LETTER A • U+0410 CYRILLIC CAPITAL LETTER A • U+0391 GREEK CAPITAL LETTER ALPHA • IPA digraphs • “Never use a character just because it looks right.”

  12. Precomposed characters • Complex characters involving a base character and multiple diacritics - treated as equivalent • A relevant case study: "Challenges in Writing Bih" [http://test.elar.soas.ac.uk/node/5]

  13. Compatibility characters • Similar to their decompositions, but not equivalent; they include extra information (formatting, etc.)

  14. Pre-composed and compatibility characters • Why do they exist, if counter to Unicode’s focus on character semantics over glyphic representation? • Compatibility with prior encodings • No such new characters will be accepted into Unicode

  15. Things to watch out for • An example to illustrate the difference between: • Text rendering • Document encoding

  16. Take away message • Just because characters aren’t rendered properly doesn’t mean that they aren’t there. • Just because characters are rendered properly doesn’t guarantee that they will stay that way. • Beware your platform’s default encoding (probably not Unicode).

  17. Adding markup • Not only should the document be Unicode, but it must declare itself as Unicode.

  18. Exercises • What's wrong with these Unicode words? • Character encoding exercises I • http://test.elar.soas.ac.uk/taxonomy/term/1

  19. Your questions and issues

More Related