1 / 21

lis508 lecture 1: bits, bytes and characters

lis508 lecture 1: bits, bytes and characters. Thomas Krichel 2002-09-23. Structure. Bits Bytes Character sets Coded character set Character endcoding. Literature. Norton “new inside the PC” chapter 4 http://www.danbbs.dk/~erikoest/bb_terms.htm

kolina
Download Presentation

lis508 lecture 1: bits, bytes and characters

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. lis508 lecture 1: bits, bytes and characters Thomas Krichel 2002-09-23

  2. Structure • Bits • Bytes • Character sets • Coded character set • Character endcoding

  3. Literature • Norton “new inside the PC” chapter 4 • http://www.danbbs.dk/~erikoest/bb_terms.htm • http://wwwinfo.cern.ch/asdoc/WWW/publications/ictp99/ictp99N2705.html • http://www.cl.cam.ac.uk/~mgk25/unicode.html

  4. Information • Information is best understood as “what it takes to answer a question”. • The simplest question has a “yes” or “no” answer. Therefore a bit is the natural measure of information. • Term first used by John Turkey in 1946. • Concatenation of “binary digit”.

  5. Usage of bits • Computers are sometimes classified by • The number of bits they can process at one time i.e. the register size. Larger registers make a computer run faster. • The number of bits they use to represent addresses i.e. address size. A larger address size allows to run larger programs. • Graphics are also often described by the number of bits used to represent each dot.

  6. Many bits • The first chips used to process 8 bits at a time. It become customary to refer to them as a byte. • Larger units are • Kilo byte is 2 power 10 bytes • Mega bytes is 2 power 20 bytes • Giga bytes is 2 power 30 bytes • Tera byte is 2 power 40 bytes • From ancient Greek words for "thousand", "large", "giant", and "monster", respectively. Terms date back to the French revolution.

  7. More than a monster • In 1975, the General Conference of Weights and Measures (CGPM), based at Sèvres near Paris, agreed to add peta- (P) and exa- (E) • Petabyte is 2 power 50 bytes • Exabyte in 2 power 60 • Nowadays they are followed by yottabyte (70) and zettabyte (80)

  8. Hex numbers • A byte is often represented by two hex numbers. • Each hex number can encode 16 values • Written 0 to 9, then A B C D E F. F is 15. • Here, prefixed with 0x • Use Microsoft calculator with scientific notation to convert.

  9. 0 0 1 1 2 10 3 11 4 100 5 101 6 110 7 111 8 1000 9 1001 10 1010 11 1011 12 1100 13 1101 14 1110 15 1111 decimal/binary numbers

  10. Characters • Much of the information processed by computers is in the form of characters. • A character only makes sense for a human user of a minimum cultural level. • A character is not a glyph. • ligatures

  11. Representing characters • Computers don't understand text, they only understand numbers. For computers to be able to treat text, there must be a correspondence between numbers and text characters. Such a correspondence is called a coded character set. • Important examples are • ASCII • ISO 8859--1 • cp1252

  12. ASCII • American Standard Code for Information Interchange • 7-bit character set. There is no such thing as 8-bit ASCII • 95 printable symbols • 33 control characters (0-31, 127) • http://www.ccmr.cornell.edu/helpful_data/ascii2.html has a list.

  13. ASCII control codes • ACK (6, ^F) used to acknowledge receipt of message, NAK (21, ^U) used to signal non-receipt • CR (13, ^M) is the carriage return • LF (10, ^J) is the linefeed • FF (12, ^L) is the form feed (new page) • BS (8, ^H) is the backspace • DEL (ALT-127) is delete • ESC (^[) escape Different programs use them in different ways, a big pain in the a…

  14. ISO-8859-1 • PCs work with bytes, so manufactures were free to fill the other 128 characters. • ISO-8859-1, aka ISO-latin-1, it extends ASCII with characters that are used by the western European languages. • It is the default character set of html. • Positions 128 to 159 are not used. • Cp1252 fills these with graphic chars.

  15. Three concepts for characters • Abstract Character Repertoire: the set of characters to be encoded, e.g., some alphabet or symbol set • Coded Character Set : a mapping from an abstract character repertoire to a set of non-negative integers • Character Encoding Scheme: a mapping from a coded character set to a serialized sequence of bytes

  16. ISO 10646-1 • Defines the Universal Character Set (UCS) • UCS contains the characters required to represent characters used by practically all known languages, even the likes of Gurmukhi, Oriya, Telugu, Bopomofo, Runic. • There are proposals for more, like Hieroglyphs and Tengwar. • Note that there are about 6800 known languages. .

  17. UCS organization • ISO 10646 defines formally a 31-bit character set. They are represented as 32 bits, i.e. 4 bytes, or 8 hex chars. • The canonical form of ISO 10646 uses a four-dimensional coding space consisting of 256 groups. Each group consists of 256 planes with each plane containing 256 rows, each having 256 cells.

  18. UCS organization • The first plane (Plane 0x00) of Group (0x00) is called the Basic Multilingual Plane (BMP). It has been fixed since first publication. • The subsequent 223 planes (0x01 to 0xDF) of Group 0x00, as well as planes 0x00 to 0xFF in Groups 0x01 to 0x5F are reserved for further standardization. • The last 32 planes (0xE0 to 0xFF) of Group 0x00, as well as all code positions of 32 groups (0x60 to 0x7F) are reserved for private use.

  19. Relationship with legacy sets • Let U+(four hex numbers) denote characters in the BMP. • The UCS characters U+0000 to U+007F are identical to those in ASCII • The range U+0000 to U+00FF is identical to ISO 8859-1 (Latin-1).

  20. Types of characters in UCS • Letters • Base characters • Ideographic characters • Combining characters • Digits • Extenders

  21. http://openlib.org/home/krichel Thank you for your attention!

More Related