170 likes | 288 Views
מבנה מחשב. תרגול 1 ייצוג תווים בחומרה. A programmer that doesn’t care about characters encoding in not much better than a medical doctor who doesn’t believe in germs. Joel Spolsky. Introduction. Computers are considered "number crunchers“. Humans work with characters.
E N D
מבנה מחשב תרגול 1ייצוג תווים בחומרה
A programmer that doesn’t care about characters encoding in not much better than a medical doctor who doesn’t believe in germs. Joel Spolsky
Introduction • Computers are considered "number crunchers“. • Humans work with characters. • Character data isn't just alphabetic characters, but also numeric characters, punctuation, spaces, etc. Most keys on the central part of the keyboard (except shift, caps lock) are characters. • Everything represented by a computer is represented by binary sequences. • We use standard encodings (binary sequences) to represent characters. תמר שרוט, נועם חזון
Introduction (2) • The two's complement method is used to represent integer numbers, because it has nice mathematical properties, in particular. • However, there aren't such properties for character data, so assigning binary codes for characters is somewhat arbitrary. • The most common character representation is ASCII, which stands for American Standard Code for Information Interchange. • The ASCII code defines what character is represented by each binary sequence. תמר שרוט, נועם חזון
The ASCII code תמר שרוט, נועם חזון
The ASCII code (2) • There are two reasons to use ASCII: • A way to represent characters. • An acceptable standard. • Different bit patterns are used for each different character that needs to be represented. • A nice property –Thelowercase (uppercase; digits) letters are contiguous. Applications: • ‘a’ < ‘b’; 'A' < 'B‘; ‘0’<‘1’. • ‘a’ – ‘A’ = ‘b’ – ‘B’ = …. = ‘z’ – ‘Z’ = 32. • ‘1’ – ‘0’ = 1 – 0. תמר שרוט, נועם חזון
The ASCII code (3) • Note: • ‘a’ ≠ ‘A’. • 0 ≠ ‘0’ (‘0’ = 48). • The characters between 0 and 31 are generally not printable (control characters that affect how text is processed, etc). 32 is the space character. • There are 128 (= 2^7) ASCII characters. • The eighth bit being used as a parity bit to detect transmission errors. תמר שרוט, נועם חזון
The ASCII’s disadvantage • The greatest disadvantage: biased for the English language character set. • Missing: • Mathematical symbols. • European languages (as well as Hebrew). • Solution: use the 8th bit as well (Extended ASCII). Switching up to 256 letters, which is plenty for most alphabet based languages. תמר שרוט, נועם חזון
Extended ASCII • Problems: • Not enough for Asian languages, which are word-based (thousands of characters). • Can’t add more than one language (é = ג; email from France to Israel and vise verses). • Code-Pages – different characters encoding. Identical only in the first 128 codes (the ASCII part). • Works reasonably in small networks that use the same coding. • Problem: The Internet! תמר שרוט, נועם חזון
Unicode • An effort to create a single character set that includes every reasonable writing system. • Uses 2 bytes to represent a character. • 1st byte + 2nd empty byte – used to represent the ASCII characters. • 1st + 2nd bytes – used to represent other characters. • The UCS-2 (2-Bytes Universal Character Set. Also known as UTF-16) disadvantages: • Endians. • Doubles the files size. • Doesn’t support old files. תמר שרוט, נועם חזון
Endians • Now when the characters are stored in more than one byte, the bytes order (high / low endian) matters! • Causes problems when transferring files between different computers. • Solution: “Union Byte Order Mark” – 0xFEFF (in a 16-bit Unicode). • Always place the mark at the beginning of the characters’ stream. • While receiving an input that start with 0xFFFE – the programmer knows she must swap every other byte. תמר שרוט, נועם חזון
Unicode – cons. • Yet: • Not every Unicode string has a byte order mark at the beginning. • Pure English files are doubled for no reason. • Old files must be converted. • Unicode was abandoned for several years (until 1992). • Solution: UTF-8 (8-bit-Unicode-Transfer- Format). תמר שרוט, נועם חזון
UTF-8 • This is a variable length character encoding. • Every code-point from 0-127 (ASCII’s original codes) is stored in a single byte. • Code points 128 and above are stored using 2-4 bytes according to the character code-point (it is possible to use 6 bytes) . • Outcomes: • Pure English files are identical to ASCII files. • No unneeded doubled files. • No need to convert old files. • Enables representation of richer character set through the extra bytes. • Frequent characters use shorter encodings. תמר שרוט, נועם חזון
UTF-8 – How does it work? • If we have an ASCII character: • It will be placed in one byte and the MSB will be zero. • Otherwise: we need more than one byte! • The first byte will tell us how many bytes are used to encode the character. • The first byte will start (MSB) with a sequence of ones followed by a single zero. The sequence length will be the number of bytes used to encode the character. • Each additional byte will have the value 10 in its MSB. • The remaining bits will be used to encode the character. תמר שרוט, נועם חזון
Other encodings • There are hundreds of different encodings. • UTF-7, UTF-8, UTF-16 and UTF-32 are the most reliable when working with languages other than English. • When passing a sequence of characters (strings, files etc.) one must mention which encoding method is used. Or else: • Gibberish. • Question marks. • Wrong representation of several characters. תמר שרוט, נועם חזון
Standards • E-mail: • Content-Type: text / plain; charset = “UTF-8” • Web page: tag • <html> <head> <meta http-equiv=“Content-Type” content = “text/html; charset=utf-8”> … תמר שרוט, נועם חזון
Libraries for managing encodings • There are many libraries that support different characters encoding. I.e.: • Iconv (Or a more stable implementation: libiconv). (Mostly Unix). • Codecs module (python). • “The International Component for Unicode” (ICU) (There are libraries for C/C++ & Java). • UTF8-CPP (C++). תמר שרוט, נועם חזון