210 likes | 320 Views
Web Site Internationalization or Furansugo no dekiru mono wa imasen ka?. Instructor: Joseph DiVerdi, Ph.D., M.B.A. Core Issue.
E N D
Web Site Internationalization or Furansugo no dekiru mono wa imasen ka? Instructor: Joseph DiVerdi, Ph.D., M.B.A.
Core Issue • "If the WWW is to reach a truly worldwide audience, it needs to be able to support the display of all the languages of the world, with all their unique alphabets and symbols, directionality, and specialized punctuation. This poses a big challenge to HTML constructs as we know them." • Web Design in a Nutshell
W3C's Efforts • "i18n" project • Spoken: eye-eighteen-en i-nternationalizatio-n • Two Primary Issues Addressed: • Alternate Character Sets • Take into account all writing systems in the world • How to Specify Languages & Their Unique Presentation Requirements • Within an HTML document • Current state-of-art in • RFC-2070 • HTML v4.0
Character Sets • Character • A Unit of a Written Language System ay, bee, see, dee, eff, gee, aych, eye • Glyph • An Actual Printed or Displayed Character = a b c 5 , $ ó
Character Sets • A Character May Associate With Several Glyphs • Close Quote - " or » • A Glyph May Correspond to Several Characters • Comma - pause in sentence or decimal indicator • In Certain Languages
Character Sets • Each Character is Assigned • A Specific Numeric Value • Number of Characters in a Character Set • Limited by the Bit-Depth of its Encoding • 8-Bit Encoded Character Set - 256 characters • 16-Bit Encoded Character Set - 65,536 characters • HTML v2.0 & v3.2 are based on ISO 8859-1 • 8-Bit Character Set • AKA Latin-1
Character Sets • ISO-8859-1 Character Set • 8-Bit Depth • First 128 Values From US-ASCII Numeric Value Glyph Description 13 CR carriage return 48 0 digit zero 64 A uppercase aye 94 ^ caret 177 ± plus-or-minus 191 ¿ inverted question mark 255 ÿ lowercase wye w/umlaut
Character Sets (continued) • Common 8-bit character sets ISO 8859-1 Latin-1 ISO 8859-5 Cyrillic ISO 8859-6 Arabic ISO 8859-7 Greek ISO 8859-8 Hebrew SHIFT_JIS Japanese EUC_JP Japanese
Uses of Character Sets Languages Countries Character Sets French fr iso-8859-1 Greek el iso-8859-7 Hebrew iw iso-8859-8 Hungarian hu iso-8859-2 Icelandic is iso-8859-1 Italian it iso-8859-1 Japanese ja shift_jis, iso-2022-jp, euc-jp Romanian ro iso-8859-2 Russian ru koi-8-r, iso-8859-5 Serbian sr iso-8859-5 Slovak sk iso-8859-2 Spanish es iso-8859-1 Turkish tr iso-8859-9 Ukrainian uk iso-8859-5
Character Sets (continued) • 256 Characters are Sufficient • For Certain Languages • Insufficient for Others • Japanese (kanji) • Chinese • Korean • Vietnamese • Hence the Need For • 16-Bit Encoded Character Sets
Character Sets • 16-Bit Encoded Character Sets • Two Contiguous Bytes Represent One Character • 65,536 Possible Characters in One Set • Unicode is a 16-bit Character Set • Developed by the Unicode Consortium • Practically Identical to ISO 10646-1 • First 256 Slots Allocated to ISO 8859-1 • Backwards Compatible (woo-hoo!)
Specify Character Encoding • Document Character Encoding • Communicated Between Server & Client • Set With <META> tag & http-equiv attribute <META HTTP-EQUIV=CONTENT-TYPE CONTENT="text/html; charset=ISO-8859-1"> • Creates an HTTP header Content-type: text/html; charset=ISO-8859-1 • Required for Successful Validation • Browser Must Support Chosen Character Set • To Display Page Correctly
Character Sets • HTML v4.0 adopts Unicode as its Document Character Set • v4.0 Browser Behavior: • Regardless of Document Creation Encoding • Browser converts characters to internal format • Interprets characters with HTML meaning, e.g., <> • Converts character entities, e.g., © • Where character entity points outside Latin-1 character set • ϖ for Pi, • Uses Unicode character to display correct character
Character Sets (continued) • Issues • Larger Data Transfers • Slower Processing • If it's Necessary • Just Do It
v4.0 Language Tags • LANG attribute • Used Within Text Elements • To Switch to Other Languages Within a Document • Add to <HTML> Tag to Specify Language • For EntireDocument <HTML LANG=fr> • To Turn On Norwegian • For Just One Element <P LANG=no> Something about Lutefisk </P>
Language Codes • Representation of Language Names • See: • Table 27-1 of Web Design in a Nutshell • p 461 http://www.oclc.org/oclc/man/code/lang.htm
Language Codes • What Happens When Language is Specified? • Two General Answers: • Not Much • It Depends • An Individual viewer might configure his or her browser to respond differently to different language specifications • Search engines might respond to language specification • Consider LANG to be Structural Markup • Describe the Structure of the Document
Directionality • Many Languages Read from Right to Left • An International HTML Standard Needs to Take This Into Account • DIR attribute <P DIR=rtl> Left to Right from Read Languages Many </P>
Directionality • Tag in HTML v4.0 to deal with documents containing combinations of left-and right-reading text • Aka bi-directional text or Bidi • <BDO> is used for bi-directional override • Specify a span of text that overrides the intrinsic directions of the text it contains <BDO DIR=ltr>An English phrase in an otherwise Hebrew text</BDO> • More Structured Markup
Cursive Joining • A Character's Shape Can Vary • Depending on its Position in a Word • In Some Writing Systems • In Arabic • Certain Characters Look Completely Different • When Used at the Beginning of a Word or • When Used as the Last Character of a Word • Also True of Many Other Languages
Cursive Joining • Unicode Characters Exist • Are Placed Between Characters • Which Have Zero Width • They Don't Appear in the Browser's Window • Act Purely as instructions • To Specify Joining of the Neighboring Characters • ‌ • zero-width non-joiner • Prevents Joining of Characters Which Normally are Joined • ‍ • zero-width joiner • Joins Characters Which Normally are Not Joined