550 likes | 732 Views
Making Sense of Language Tags. 10 th Metadata Open Forum. Presenter. Addison Phillips Globalization Architect, Yahoo! Chair, W3C Internationalization Core Working Group Co-Editor, Language Tag Registry Update (LTRU) Working Group (RFC 4646, RFC 4647, RFC 4646bis).
E N D
Making Senseof Language Tags 10th Metadata Open Forum
Presenter • Addison Phillips • Globalization Architect, Yahoo! • Chair, W3C Internationalization Core Working Group • Co-Editor, Language Tag Registry Update (LTRU) Working Group (RFC 4646, RFC 4647, RFC 4646bis)
Languages, Language Tags, and Locales (oh my!) • Identifying language (and locale): the challenge • ISO 639 • IETF BCP 47 • RFC 4646, RFC 4647 • RFC 4646bis • Challenges for users
Human Language as Metadata • Some data is just data, but some data is human-readable text. • Text processing depends on language: • spelling, stemming, tokenization, word/line/sentence boundaries, thesauri, terminology, morphological analysis, font and stylistic traditions, collation. • IT systems depend on language negotiation: • localization, message selection, user interface, presentation, number/date/time/etc. formatting, list presentation
Human Language "Ole Missus, de house of plum` jam full o` people, en dey`s jes a-spi`lin` to see de gen`lemen!" (Mark Twain, Puddinhead Wilson)
Identifying Languages • Languages don’t form nice hierarchies • “splitters” vs “lumpers” • dialects, subdialects, regional and stylistic differences, patois • Differing communities with different needs • terminology, librarians, computer systems, translators, etc.
In the Beginning (ca. 1980 CE) Received Wisdom from the Dark Ages • Locales: • japanese, french, german, C • ENU, FRA, JPN • ja_JP.PCK • AMERICAN_AMERICA.WE8ISO8859P1 • Languages… … looked a lot like locales (and vice versa)
ISO 639 • Defines language identifier codes • Multiple parts: • ISO 639-1 (alpha2 codes676) (136 codes) • ISO 639-2 (alpha3 codes17576) (about 500) • ISO 639-3 (alpha3 codes) (about 7000) • ISO 639-4 (principles for encoding) • ISO 639-5 (language families) • ISO 639-6 (alpha4 codes) (under development)
Impact of ISO 639-3 • ISO 639-2 and 639-3 share a codespace • all 639-2 codes are also 639-3 codes • Macrolanguages
Human Language en "Ole Missus, de house of plum` jam full o` people, en dey`s jes a-spi`lin` to see de gen`lemen!" (Mark Twain, Puddinhead Wilson)
ISO 639 ISO 639-1 (early 1980s) ISO 639-2 (alpha3) ISO 639-3 IETF BCP 47 RFC 1766 (1995) RFC 3066 (2001) RFC 4646 (2006) RFC 4646bis (2007) Parallel Efforts
BCP 47 • Internet Engineering Task Force (IETF) “Best Current Practice” (BCP) • Enable presentation, selection, and negotiation of content in protocols and formats • Widely used! • XML, HTML, RSS, MIME, SOAP, SMTP, LDAP, CSS, XSL, CCXML, Java, C#, ASP, perl, Apache, IE, Mozilla……….
Adds Granularity • Need to identify language on varying levels of mutual intelligibility and granularity "Ole Missus, de house of plum` jam full o` people, en dey`s jes a-spi`lin` to see de gen`lemen!" (Mark Twain, Puddinhead Wilson) en-US en
What’s a Locale • “a concept or identifier used by programmers to represent a particular collection of cultural, regional, or linguistic preferences.” • java.util.Locale • .Net Culture • LANG (setlocale in C, C++) • NLS_LANG in Oracle • … and so on…
Locales? Huh? Theatre Center News: The date of the last version of this document was 2003年3月20. A copy can be obtained for $50,0 or 1.234,57 грн. We would like to acknowledge contributions by the following authors (in alphabetical order): Alaa Ghoneim, Behdad Esfahbod, Ahmed Talaat, Eric Mader, Asmus Freytag, Avery Bishop, and Doug Felt.
Locale Identifiers • Different ideas: • “Accept-Locale” vs. Accept-Language • URIs/URNs, etc. • CLDR/LDML • And Requirements: • Operating environments and harmonization • App Servers • Web Services • New Solution? Cost of Adoption: • UTF-8 to the browser: 8 long years
IUC23, March 2003 Locales and Language Tags meet We really need locale identifiers. Language tags are being (ab)used as locale identifiers anyway… Not going to need a big new thing… Yeah, we’ll write an RFC … we can do this really fast…
BCP 47 (Historic) Basic Structure • Alphanumeric (ASCII only) subtags • Up to eight characters long • Separated by hyphens • Case not important (i.e. zh = ZH = zH = Zh) 1*8alphanum * [ “-” 1*8 alphanum ]
RFC 1766 zh-TW ISO 639-1 (alpha2) ISO 3166 (alpha2) i-klingon Registered value
RFC 3066 sco-GB ISO 639-2 (alpha 3 codes) But use… eng-GB X alpha 2 codes when they exist
Problems • Script Variation: • zh-Hant/zh-Hans • (sr-Cyrl/sr-Latn, az-Arab/az-Latn/az-Cyrl, etc.) • Obsolence of registrations: • art-lojban (now jbo), i-klingon (now tlh) • Instability in underlying standards: • sr-CS (CS used to be Czechoslovakia • Lack of a single authoritative, stable source
And More Problems • Lack of scripts • Little support for registered values in software • Reassignment of values by ISO 3166 • Lack of consistent tag formation (Chinese dialects?) • Standards not readily available, bad references • Bad implementation assumptions • 1*8 alphanum *[ “-” 1*8 alphanum] • 2*3 ALPHA [ “-” 2ALPHA ] • Many registrations to cover small variations • 8 German registrations to cover two variations
LTRU and RFC 4646 • Defines a generative syntax • machine readable • future proof, extensible • Defines a single source (IANA Language Subtag Registry) • Stable subtags, no conflicts • Machine readable • Defines when to use subtags • (sometimes)
Anatomy of a Language Tag sl-Latn-IT-rozaj-1994-x-mine ISO 639-1/2 (alpha2/3) ISO 15924 script codes (alpha 4) ISO 3166 (alpha2) or UN M49 Registered variants Private Use and Extension
More Examples • fr, de, nl, en, ja • fr-FR, fr-CA, de-DE, de-CH… • es-419 (Spanish for Americas) • en-US (English for USA) • de-CH-1996 (Old tags are all valid) • sl-rozaj-1994 (Multiple variants) • zh-t-wadegile (Extensions)
zh-Hant (!= zh-TW) zh-Hans (!= zh-CN) Azerbaijani (az) Arab, Cyrl, Latn Serbian (sr) Cyrl, Latn Yiddish (yi) Hebr, Latn Mongolian (mn) Cyrl, Latn, Hani Belarussian (bs) Cyrl, Latn Etc. Solves the Script problem
Benefits • Subtag registry in one place: one source, machine-readable • Subtags identified by length/content • Extensible • Compatible with RFC 3066 tags • Stable: subtags are forever
Tag Choice • “Tag Content Wisely” • use the shortest tag reasonable • use as many subtags as necessary to disambiguate • don’t invent things; use the registry • map deprecated values to modern equivalents
zxx und mis Zxxx Specialized Codes
Problems • Matching • Does “en-US” match “en-Latn-US”? • Tag Choices • Users have more to choose from. • Implementations • More to do, more to think about • (easier to parse, process, support the good stuff)
Tag Matching (RFC 4647) • Uses “Language Ranges” in a “Language Priority List” to select sets of content according to the language tag • Three Schemes • Basic Filtering • Extended Filtering • Lookup
Many technologies would like language tags (attributes, etc.) to be atomic—but language tags have structure <span class=“foo” xml:lang=“en-US” /> foo(lang:en) { color: red; } Accept-Language=zh;q=1.0;de-DE;q=0.8 Tags are not Tokens!
Filtering • Ranges specify the least specific item • “en” matches “en”, “en-US”, “en-Brai”, “en-boont” • Basic matching uses plain prefixes • “en-US” matches “en-US” or “en-US-boont” but not “en-Latn-US” • Extended matching can match “inside bits” • “en-*-US”
Lookup • Range specifies the most specific tag in a match. • Returns exactly one item. • “en-US” might return either “en” or “en-US” but not “en-US-boont” • Mirrors the locale fallback mechanism and many language negotiation schemes.
Global Binary Resources Lookup and Language Negotiation • Resources “fall back” to find the best match zh-Hans-SG (Chinese, Simplified script, Singapore) zh-Hans (Chinese, Simplified script) zh (Chinese) (root) Falling back
What Do I Do (Content Author)? • Not much. • Existing tags are all still valid: tagging is mostly unchanged. • Resist temptation to (ab)use the private use subtags. • Unless your language has script variations: • Tag content with the appropriate script subtag(s) • Script subtags only apply to a small number of languages: “zh”, “sr”, “uz”, “az”, “mn”, and a very small number of others.
What Do I Do (Programmer)? • Check code for compliance with 4646 • Decide on well-formed or validating • Implement suppress-script • Change to using the registry • Bother infrastructure folks (Java, MS, Mozilla, etc) to implement the standard
I need a new subtag… • Register new subtags with ietf-languages@iana.org • only primary language or variant subtags • read RFC 4646 for instructions • two-week review period with expert approval
LTRU Milestone Dates • RFC 4646 • Registry went live in December 2005 • RFC 4647 • (Anticipated) RFC 4646bis • This includes ISO 639-3 support, extended language subtags, and possibly ISO 639-6
RFC 4646bis (Internet-Draft) • Currently taking shape • Adds about 7000 additional primary language subtags from ISO 639-3 • Extended language subtags for Chinese and other languages being debated • … and some cleanup work on processes and procedures
Macrolanguages and Extlang zh-Hant-HK Chinese, Traditional Script, Hong Kong SAR yue-Hant-HK Cantonese, Traditional Script, Hong Kong SAR zh-yue-Hant-HK extlang Chinese, Cantonese, Traditional Script, Hong Kong SAR
Things to Do (languages) • Get involved in LTRU • Get involved in W3C I18N Activity • Write implementations • Work on adoption of BCP 47: understand the impact • Then get involved with Locale identifiers…
Back to Locales… • IUC 20 Round Table • Suzanne Topping’s Multilingual Article • Tex Texin and the Locales list…
W3C and Unicode • W3C • Identifiers and cross-over with language tags • Web services • XML, HTML • Unicode Consortium • LDML • CLDR • Standards for content
Language Tags and Locale Identifiers REC (LTLI) • Working Draft developed by W3C I18N Architecture WG • effort currently moribund: needs community participation • defines standards and guidelines for using language tags in W3C technologies • defines relationship of language tags to locale identifiers • basis for efforts such as WS-I18N
Things to Read • Tag and Registry RFC http://www.ietf.org/rfc/rfc4646.txt • Matching RFC http://www.ietf.org/rfc/rfc4647.txt • 4646bis Draft http://www.ietf.org/internet-drafts/draft-ltru-4646bis-06.txt • References http://www.langtag.net http://www.inter-locale.com • LTRU Mailing List https://www1.ietf.org/mailman/listinfo/ltru