Overview of XML & XHTML

Overview of XML & XHTML Instructor: Joseph DiVerdi, Ph.D., MBA

Character Sets • A Brief Digression...

Character Sets • Character • A Unit of a Written Language System ay, bee, see, dee, eff, gee, aych, eye • Glyph • An Actual Printed or Displayed Character = a b c 5 , $ ó

Character Sets • A Character May Associate With Several Glyphs • Close Quote - " or » • A Glyph May Correspond to Several Characters • Comma - Pause in Sentence or Decimal Indicator • In Certain Languages

Character Sets • Each Character Is Assigned • A Specific Numeric Value • Number of Characters in a Character Set • Limited by the Bit-depth of Its Encoding • 8-Bit Encoded Character Set - 256 characters • 16-Bit Encoded Character Set - 65,536 characters • HTML v2.0 & v3.2 are based on ISO 8859-1 • 8-Bit Character Set • AKA Latin-1

Character Sets • ISO-8859-1 Character Set • 8-Bit Depth • First 128 Values From US-ASCII Numeric Value Glyph Description 13 CR carriage return 48 0 digit zero 64 A uppercase aye 94 ^ caret 177 ± plus-or-minus 191 ¿ inverted question mark 255 ÿ lowercase wye w/umlaut

Character Sets (continued) • Common 8-bit character sets ISO 8859-1 Latin-1 ISO 8859-5 Cyrillic ISO 8859-6 Arabic ISO 8859-7 Greek ISO 8859-8 Hebrew SHIFT_JIS Japanese EUC_JP Japanese

Uses of Character Sets Languages Countries Character Sets French fr iso-8859-1 Greek el iso-8859-7 Hebrew iw iso-8859-8 Hungarian hu iso-8859-2 Icelandic is iso-8859-1 Italian it iso-8859-1 Japanese ja shift_jis, iso-2022-jp, euc-jp Romanian ro iso-8859-2 Russian ru koi-8-r, iso-8859-5 Serbian sr iso-8859-5 Slovak sk iso-8859-2 Spanish es iso-8859-1 Turkish tr iso-8859-9 Ukrainian uk iso-8859-5

Character Sets (continued) • 256 Characters are Sufficient • For Certain Languages • Insufficient for Others • Japanese (kanji) • Chinese • Korean • Vietnamese • Hence the Need For • 16-Bit Encoded Character Sets

Character Sets • 16-Bit Encoded Character Sets • Two Contiguous Bytes Represent One Character • 65,536 Possible Characters in One Set • Unicode is a 16-bit Character Set • Developed by the Unicode Consortium • Practically Identical to ISO 10646-1 • First 256 Slots Allocated to ISO 8859-1 • Backwards Compatible (woo-hoo!)

Character Sets • A Brief Digression... • Bottom Line • Specify Your Encoding As Required • Important For International Applications • Multi-Lingual Applications • There, now you know about it.

Overview of XML & XHTML