190 likes | 367 Views
ENCODING AND DECODING. Experiencing one (or more) bytes out of your A’s. Overview. It’s not your father’s character set 8 bit characters ASCII The rest of the world wakes up to computers Unicode Character codes Different flavors Encoding and Decoding classes Example. The Good Old Days.
E N D
ENCODING AND DECODING Experiencing one (or more) bytes out of your A’s
Overview • It’s not your father’s character set • 8 bit characters • ASCII • The rest of the world wakes up to computers • Unicode • Character codes • Different flavors • Encoding and Decoding classes • Example
The Good Old Days • Focus on unaccented, English letters • Every letter, number, capital, etc • Represented by codes 0-127 • Space, 32; “A”, 65; “a”, 97 • Used 7 bits, one bit free on most computers • Wordstar and the 8th bit • Below 32 – control bits 7, beep; 12, formfeed
8th bit, values 128-255 • Everybody had their own ideas • OEM Character sets • IBM-PC -> graphics (horizontal bars, vertical bars, bars with dangles, etc.) • Outside U.S. different languages • Code 130
8th bit, values 128-255 • Everybody had their own ideas • OEM Character sets • IBM-PC -> graphics (horizontal bars, vertical bars, bars with dangles, etc.) • Outside U.S. different languages • Code 130
8th bit, values 128-255 • Everybody had their own ideas • OEM Character sets • IBM-PC -> graphics (horizontal bars, vertical bars, bars with dangles, etc.) • Outside U.S. different languages • Code 130 é in US, Gimel ג character in Israel • Difficult to exchange documents • Code pages – regional definition of bit values 128-255 • Israel: Code page 862 • Greek: Code page 737 • ISO/ANSI code pages • Asia – Alphabets had thousands of characters • No way to store in one byte (8 bits)
Unicode • Not a 16-bit code • A new way of thinking about characters • Old way: • Character “A” maps to memory or disk bits • A-> 0100 0001 • Unicode way: • Each letter in every alphabet maps to a “code point” • Abstract concept • “A” is Platonic “form” – just floats out there • A -> U+0639 code point
Unicode • Hello -> U+0048 U+0065 U+006C U+006C U+006F • Storing in 2 bytes each: • 0048 0065 006C 006C 006F (big endian) • Or 4800 6500 6C00 6C00 6F00 (little endian) • Need to have a Byte Order Mark (BOM) at beginning of stream • UTF8 coding system • Stores Unicode points (magic numbers) as 8 bit bytes • Values 0-127 go into byte 1 • Values 128+ go into bytes 2, 3, etc. • For characters up to 127, UTF8 looks just like ASCII
UNICODE Encodings • UTF-8 • UTF-16 – characters stored in 2 byte, 16-bit (halfword) sequences – also called UTF-2 • UTF-32 – characters stored in 4byte, 32 bit sequences • UTF-7 – forces a zero in high order bit - firewalls • Ascii Encoding – everything above 7 bits is dropped
Definitions • .NET uses UTF-16 encoding internally to store text • Encoding: • transfers a set of Unicode characters into a sequence of bytes • Send a string to a file or a network stream • Decoding: • transfers a sequence of bytes into a set of Unicode characters • Read a string from a file or a network stream • StreamReader, StreamWriter default to UTF-8
Encoding/Decoding Classes • UTF32Encoding class • Convert characters to and from UTF-32 encoding • UnicodeEncoding class • Convert characters to and from UTF-16 encoding • UTF8Encoding class to convert to and from UTF-8 encoding – 1, 2, 3, or 4 bytes per char • ASCIIEncoding class to convert to and from ASCII Encoding – drops all values > 127 • System.Text.Encoding supports a wide range of ANSI/ISO encodings
Convert a string into a stream of encoded bytes • Get an encoding object Encoding e = Encoding.GetEncoding(“Korean”); 2. use the encoding object’s GetBytes() method to convert a string into its byte representation byte[ ] encoded; encoded = e.GetBytes(“I’m gonna be Korean!”); Demo: D:\_Framework 2.0 Training Kits\70-536\Chapter 03\EncodingDemo
Write a file in encoded form FileStream fs = new FileStream("text.txt", FileMode.OpenOrCreate); ... StreamWriter t = new StreamWriter (fs, Encoding.UTF8); t. Write("This is in UTF8"); Read an encoded file FileStream fs = new FileStream("text.txt", FileMode.Open); ... StreamReader t = new StreamReader(fs, Encoding.UTF8); String s = t.ReadLine();
Summary • ASCII is one of oldest encoding standards. • UNICODE provides multilingual support • System.Text.Encoding has static methods for encoding and decoding text. • Use an overloaded Stream constructor that accepts an encoding object when writing a file. • Not necessary to specify Encoding object when reading, will default.
References • www.unicode.org • Unicode and .Net – what does .NET Provide? http://www.developerfusion.co.uk/show/4710/3/ • Hello Unicode, Goodbye ASCII http://www.nicecleanexample.com/ViewArticle.aspx?TID=unicode_encoding • The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) http://www.joelonsoftware.com/articles/Unicode.html