760 likes | 1.8k Views
Unicode. With everything becoming globalized these days, more characters to represent a wider array of languages than just English are necessary. We'll look at Unicode as a solution.
E N D
Unicode • With everything becoming globalized these days, more characters to represent a wider array of languages than just English are necessary. We'll look at Unicode as a solution. • Unicode contains a repertoire of more than 110,000 characters covering 100 languages. Think not only about Chinese, Cyrillic, Hebrew, etc. but also Cherokee, Runic, Mandaic, Bamum, Tagalog, and so on. PACS – 11/16/13
Unicode • Because more than one byte is often needed for a Unicode character, special handling for Unicode text is required. • When special techniques are not followed to handle Unicode, web pages present boxes, question marks or jumbles of random characters instead of what was intended. PACS – 11/16/13
Unicode PACS – 11/16/13
Unicode PACS – 11/16/13
Unicode PACS – 11/16/13
Unicode PACS – 11/16/13
Unicode • From the Unicode Consortium: “Unicode provides a unique number for every character,no matter what the platform,no matter what the program,no matter what the language.” PACS – 11/16/13
Unicode • “Fundamentally, computers just deal with numbers. They store letters and other characters by assigning a number for each one. Before Unicode was invented, there were hundreds of different encoding systems for assigning these numbers. No single encoding could contain enough characters: for example, the European Union alone requires several different encodings to cover all its languages. Even for a single language like English no single encoding was adequate for all the letters, punctuation, and technical symbols in common use.” PACS – 11/16/13
Unicode • Some history: Early computers just processed numbers. When people decided that characters needed to be processed also, different manufacturers came up with their own solutions. There were different word lengths (12, 16, 18, 24, 32, 36) and different byte lengths (4, 6, 8). • Most implementations were for upper case only. • Punctuation support was sporadic. • Six-bit bytes only supported 64 different characters – not enough for upper and lower case. PACS – 11/16/13
Unicode • The American Standard Code for Information Interchange (ASCII) is a character-encoding scheme originally based on the English alphabet that encodes 128 specified characters - the numbers 0-9, the letters a-z and A-Z, some basic punctuation symbols, some control codes that originated with Teletype machines, and a blank space - into the 7-bit binary integers. Work started in 1960, published during 1963, revised during 1967, and most recently updated during 1986. PACS – 11/16/13
Unicode PACS – 11/16/13
Unicode • ASCII used 7-bits for one character. This allowed an eighth bit to be used as a parity bit on paper tape or magnetic tape. • Most early network links were 7 bit and the SMTP spec called for 7-bit characters. That’s why you see base-64 encoding of binary files in emails – it converts the file to an equivalent string of 7-bit characters to be successfully transferred. • As 8-bit bytes became standard, more uses were found for the top 128 characters in the 256 character space. PACS – 11/16/13
Unicode • Many systems used the top 128 characters for block graphics for gaming or charting. • DEC first devised a Multinational Character Set which had the accented characters needed by a majority of the European languages plus a few more special symbols. • Apple made their own set - Mac OS Roman – included math symbols in addition to the diacritical marks. • Postscript had its own set. • Microsoft came up with Windows-1252 which included more special symbols in the 80-9f positions. • ANSI codified the 256-character extension of ASCII. PACS – 11/16/13
Unicode PACS – 11/16/13
Unicode • What a mess! Enter Unicode. While ASCII is limited to 128 characters, Unicode supports more characters by separating the concepts of unique identification (using natural numbers called code points) and encoding (to 8-, 16- or 32-bit binary formats, called UTF-8, UTF-16 and UTF-32). • To allow backward compatibility, the 128 ASCII and 256 ANSI or ISO-8859-1 (Latin 1) characters are assigned Unicode/UCS code points that are the same as their codes in the earlier standards. PACS – 11/16/13
Unicode • The most common implementation is utf-8 which can represent all characters in between 1 and 4 bytes with up to 21 bits of data. PACS – 11/16/13
Unicode • The first 128 characters (US-ASCII) need one byte. The next 1,920 characters need two bytes to encode. This covers almost all Latin-derived alphabets, and also Greek, Cyrillic, Coptic, Armenian, Hebrew, Arabic, Syriac and Tāna alphabets, as well as Combining Diacritical Marks. • Three bytes are needed for characters in the rest of the Basic Multilingual Plane (which contains virtually all common characters). • Four bytes are needed for characters in the other planes of Unicode, which include less common CJK (Chinese, Japanese, Korean) characters and various historic scripts and mathematical symbols. PACS – 11/16/13
Unicode • A few examples: PACS – 11/16/13
Unicode Some sequences of bytes are invalid: • Invalid bytes listed in the Unicode standard • An unexpected continuation byte • A start byte not followed by enough continuation bytes • An Overlong Encoding i.e. more zeroes than needed • A 4-byte sequence (starting with 0xF4) that decodes to a value greater than U+10FFFF PACS – 11/16/13
Unicode • In order to work around the limitations of legacy encodings, HTML is designed such that it is possible to represent characters from the whole of Unicode inside an HTML document by using a numeric character reference: a sequence of characters that explicitly spell out the Unicode code point of the character being represented. PACS – 11/16/13
Unicode • A character reference takes the form &#N;, where N is either a decimal number for the Unicode code point, or a hexadecimal number, in which case it must be prefixed by x. The characters that compose the numeric character reference are universally representable in every encoding approved for use on the Internet. PACS – 11/16/13
Unicode • In HTML, there is also a standard set of 252 named character entities for characters that are either not found in certain character encodings or are markup sensitive in some contexts (for example angle brackets and quotation marks). • Although any Unicode character can be referenced by its numeric code point, some HTML document authors prefer to use these named entities instead, where possible, as they are less cryptic and were better supported by early browsers. PACS – 11/16/13
Unicode • Character entities can be included in an HTML document via the use of entity references, which take the form &EntityName;, where EntityName is the name of the entity. • For example, —, equivalent to — or —, represents U+2014: the em dash character "—" even if the character encoding used doesn't contain that character. PACS – 11/16/13
Unicode PACS – 11/16/13
Unicode • Even with all the care to preserve the proper bytes for the code points during the transmission, you will still need a font that includes the characters needed. • Arial 3,415 characters. • Arial Unicode MS 38,917 characters. • If your font can’t display a given character, typically a box or question mark is shown. PACS – 11/16/13
Unicode Most non-ASCII characters result from: • ‘Fancy’ punctuation characters from MS Office apps. E.g. check the single quotes in the previous line! • Special characters for cents, degrees, math symbols, etc. • International languages needing diacritical marks or non-Latin letters. PACS – 11/16/13
Unicode • How does Unicode affect PHP coding? • Character encoding for the HTML file may be set wrong by the server or inside the file. • Main logic problems stem from the fact that length of a string in bytes will probably be greater than the number of characters that will display. • Note that this will affect field sizes in MySQL. Field size might go up 4x to handle the same number of characters. PACS – 11/16/13
Unicode • Set the encoding in HTML <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /> • But you’ll have to make sure the server is saying the same thing. Scripts can use the header function: header('Content-Type:text/html; charset=UTF-8'); PACS – 11/16/13
Unicode • Do not EVER use functions that convert case (strtolower, strtoupper, ucfirst, ucwords) or claim to be case-insensitive (str_ireplace, stristr, strcasecmp). • Think twice before using functions that count characters (strlen will return bytes, not characters; str_split and word_wrap may corrupt a string). PACS – 11/16/13
Unicode • Sorting becomes a challenge. Consider the different representations of vowels with diacriticals. • Regular expressions will have trouble deciding which characters are letters among other problems. • Character case conversion becomes much harder to do as lower/upper pairs appear throughout the code tables. PACS – 11/16/13
Unicode • Generating Unicode characters is not easy either. • Any editing must be done with a Unicode aware program. • Most editors will mangle files by converting Unicode European characters into ANSI equivalents. • MS Word is a capable Unicode editor. • See Unicode.org for information about generating characters. CJK languages have special challenges. PACS – 11/16/13
Unicode Links: • htmlpurifier.org/docs/enduser-utf8.html • www.hotpeachpages.net/a/characters.html • www.phpwact.org/php/i18n/charsets • phputf8.sourceforge.net • unicode.org • www.alanwood.net/unicode/ PACS – 11/16/13