CLDR: The Common Locale Data Repository Locales for the World

CLDR: The Common Locale Data Repository Locales for the World. Lisa Moore George Rhoten Mark Davis Steven Loomis. Agenda. Why CLDR? CLDR data Tools and vetting Today and the future.

CLDR: The Common Locale Data Repository Locales for the World

  1. CLDR:The Common Locale Data RepositoryLocales for the World Lisa Moore George Rhoten Mark Davis Steven Loomis

  2. Agenda • Why CLDR? • CLDR data • Tools and vetting • Today and the future LRC – XI The Localisation Factory

  4. Locales – does anything stay the same? "Theatre Center News: Thedate of the last version of this document was 2003年3月20日. A copy can be obtained for$50,0 or 1.234,57 грн. We would like to acknowledge contributions by the following authors(in alphabetical order): Alaa Ghoneim, Behdad Esfahbod, Ahmed Talaat, Eric Mader, Asmus Freytag, Avery Bishop, and Doug Felt." LRC – XI The Localisation Factory

  5. Locales – the many differences • Locales specify user preferences • Linguistic and cultural differences • Languages, scripts, writing systems, ordering, directionality, formatting, numbers, sizes • Even in the same locale, interoperability issues across platforms • Global economics has increased the need for greater globalization support in computer systems • Everyone expects more! LRC – XI The Localisation Factory

  6. Add the Universal Character Encoding • Unicode: Unique character codes for all languages … LRC – XI The Localisation Factory

  7. The Need for Common Locale Data • Computing environments often contain a variety of operating systems and software. • Historically locale sensitive data research has been done by individuals and/or companies. • Because of political changes, it is easy for locale data to become out of date. • It is difficult to get complete agreement on correctness. LRC – XI The Localisation Factory

  8. Common Locale Data Project • Began as Common XML Locale Repository (CXLR) developed by OpenI18N in 2003 • CLDR project began in 2004 • Hosted by Unicode Consortium • http://www.unicode.org/cldr/ • Goals: • Common, necessary software locale data for all world languages • Collect and maintain locale data • XML format for effective interchange • Freely available LRC – XI The Localisation Factory

  9. CLDR in use (partial list) • Libraries and Environments • ICU – International Components for Unicode • JDK – Java Development Kit • Operating Systems • Solaris • AIX • MacOS X • Applications • OpenOffice.org • Acrobat • ModernBill LRC – XI The Localisation Factory

  11. What is a Locale? • A locale is an identifier referring to linguistic and cultural preferences • en_US, en_GB, ja_JP • These preferences can change over time due to cultural and political reasons • Introduction of new currencies, like the Euro • Standard sorting of Spanish changes • Many of these preferences have varying degrees of standardization • 12 and 24 hour format in the United States • This is a very broad topic LRC – XI The Localisation Factory

  12. Types of Locale Data • Dates/time/calendar formats • Number/currency formats • Measurement system • Collation specification • Sorting • Searching • Matching • Translated names for language, territory, script, timezones, currencies,… • Script and characters used by a language LRC – XI The Localisation Factory

  13. Locale Data Markup Language • Locale data described using XML • CLDR data uses LDML • Structure of CLDR controlled by Locale Data Markup Language (LDML) specificationhttp://unicode.org/reports/tr35 LRC – XI The Localisation Factory

  14. LDML Data Categories <ldml> <identity> <localeDisplayNames> <layout> <characters> <delimiters> <measurement> <dates> <numbers> <posix> <collations> LRC – XI The Localisation Factory

  15. Names <localeDisplayNames> • Provides translated display names for languages, territories, scripts, variants and keywords used in CLDR. • Most of this information is at the language level, since it typically does not vary by territory, only language. • An example: ICU Locale Explorer LRC – XI The Localisation Factory

  16. Names Examples From ga.xml (Irish): <localeDisplayNames> <languages> <language type="aa">Afar</language> <language type="ab">Abcáisis</language>… <scripts> <script type="Arab">Araibis</script>… <territories> <territory type="AD">Andóra </territory> <territory type="AE">Aontas na nÉimíríochtaí Arabacha </territory>… LRC – XI The Localisation Factory

  17. Characters <characters> • Allows for creation of exemplar character sets. An exemplar set specifies the set of characters that must be present in order to properly render the language. • Auxiliary exemplar set defines additional characters that may appear in foreign words or phrases. • Lower case only LRC – XI The Localisation Factory

  18. Date Formats <dates> • Defines representation of calendars using various calendaring systems (Gregorian, Buddhist, Islamic, Japanese, etc.) • Defines formatting for dates, times, eras and time zones • wide, abbreviated, or narrow • Date and time formats use patterns of letters to define proper formatting • Week information • Relative day/time translations (for example, yesterday, tomorrow, etc. ) • An example: ICU Locale Explorer LRC – XI The Localisation Factory

  19. Characters / Dates Examples From ga.xml (Irish): <characters> <exemplarCharacters> [a á b-e é f-i í j-o ó p-u ú v-z] </exemplarCharacters> <exemplarCharacters type="auxiliary"> [ḃ ċ ḋ ḟ ġ ṁ ṗ ṡ ṫ]</exemplarCharacters> </characters>… <dayContext type="format"> <dayWidth type="abbreviated"> <day type="sun">Domh</day> <day type="mon">Luan</day>… LRC – XI The Localisation Factory

  20. Time Zone Names <timeZoneNames> • Based on Olson time zone database • Localized display names for standard, daylight, and generic representations of time zones. • Short and long display names. LRC – XI The Localisation Factory

  21. Numbers <numbers> • Specifies proper localized formatting of numeric quantities • Decimal • Scientific • Currency • Percentages • Includes localized decimal, thousands separators, currency symbols, etc. LRC – XI The Localisation Factory

  22. Time Zones / Currencies From ga.xml (Irish) and root.xml: <timeZoneNames> <zone type="Europe/Dublin"> <long> <standard>Meán-Am Greenwich</standard> <daylight>AmSamhraidh na hÉireann</daylight> </long>… <numbers> <currencies> <currency type=“EUR"> <displayName>Euro</displayName> <symbol>€</symbol>… LRC – XI The Localisation Factory

  23. Delimiters <delimiters> • Specifies a primary and secondary of delimiter characters to be used for bracketing quotations in text LRC – XI The Localisation Factory

  24. Delimiters Example From fr.xml (French): <delimiters> <quotationStart>«</quotationStart> <quotationEnd>»</quotationEnd> <alternateQuotationStart>“</alternateQuotationStart> <alternateQuotationEnd>”</alternateQuotationEnd> </delimiters> LRC – XI The Localisation Factory

  25. Collation <collations> • Information in collation directory, not main • XML version of Java/ICU collation syntax • Unicode collation algorithm is the base http://unicode.org/reports/tr10 • Allows tailoring of the UCA on a per locale basis. LRC – XI The Localisation Factory

  26. Collation Example From collations/root.xml: <collations validSubLocales="ga ga_IE id id_ID ms ms_BN ms_MY nl nl_BE nl_NL pt pt_BR pt_PT"> <collation type="standard"> <rules> ... <s>ā</s> <t>Ā</t> <s>á</s> <t>Á</t> <s>ǎ</s> <t>Ǎ</t> <s>à</s> <t>À</t>… LRC – XI The Localisation Factory

  28. CLDR Tools • Export • ICU resource bundle generation • POSIX locale generator • openOffice.org format export • Survey tool • http://www.unicode.org/cgi-bin/cldr-survey LRC – XI The Localisation Factory

  29. Vetting Process for Data • Collect from different platforms, experts, submissions: new or revised • References to external sources strongly encouraged • Must be before freeze date for release • Use Survey Tool to Collect Data LRC – XI The Localisation Factory

  30. Causes of Conflicting Data • Typographical errors • Canda instead of Canada • Regional differences • German spelling is different between countries • Parts of speech • “март 2004” versus “3 марта” when the Russian word for March is used in a date • Context of usage • Normal German sorting versus German phonebook sorting • Standards versus common use • “Republic of Laos” versus “Laos” • Individual preferences • 24 hour time format versus 12 hour time format LRC – XI The Localisation Factory

  32. Latest Release: CLDR 1.4 • Released: July 17, 2006 • 360 locales: • 121 languages • 142 territories • 25% more data • 17,000 new or modified data items • Over 100 different contributors LRC – XI The Localisation Factory

  33. Challenges • Complex Formats • Experts knowledgeable both in technology and a specific language • Collation • Exemplar characters • Etc… • Require close interaction of CLDR experts with language experts LRC – XI The Localisation Factory

  34. Getting Involved • Simplest – anyone! • Use CLDR • Bug report / feature request • More Involved • Vetting, Assessment, Tools, Policies, Decisions, … • Any Unicode member eligible to name representatives including country liaison members LRC – XI The Localisation Factory

  35. Example Country Process (Finland) • Finnish Ministry of Education made CLDR data a major goal, 2004-06 • Research Institute for the Languages of Finland (“RILF” aka “Kotus”) designated agency • Two official languages (Finnish and Swedish) & four regional / minority languages (three Sámi & Romani as spoken in Finland) to be covered • Over 30 different parties represented: commercial, non-commercial, individuals • Results expected to lead to new/revised national standards LRC – XI The Localisation Factory

  36. For More Information • Unicode • http://www.unicode.org/ • CLDR • http://www.unicode.org/cldr/ • LDML specification • http://unicode.org/reports/tr35 • lisam@us.ibm.com LRC – XI The Localisation Factory

