370 likes | 474 Views
CLDR: The Common Locale Data Repository Locales for the World. Lisa Moore George Rhoten Mark Davis Steven Loomis. Agenda. Why CLDR? CLDR data Tools and vetting Today and the future. Agenda. Why CLDR? CLDR data Tools and vetting Today and the future.
E N D
CLDR:The Common Locale Data RepositoryLocales for the World Lisa Moore George Rhoten Mark Davis Steven Loomis
Agenda • Why CLDR? • CLDR data • Tools and vetting • Today and the future LRC – XI The Localisation Factory
Agenda • Why CLDR? • CLDR data • Tools and vetting • Today and the future LRC – XI The Localisation Factory
Locales – does anything stay the same? "Theatre Center News: Thedate of the last version of this document was 2003年3月20日. A copy can be obtained for$50,0 or 1.234,57 грн. We would like to acknowledge contributions by the following authors(in alphabetical order): Alaa Ghoneim, Behdad Esfahbod, Ahmed Talaat, Eric Mader, Asmus Freytag, Avery Bishop, and Doug Felt." LRC – XI The Localisation Factory
Locales – the many differences • Locales specify user preferences • Linguistic and cultural differences • Languages, scripts, writing systems, ordering, directionality, formatting, numbers, sizes • Even in the same locale, interoperability issues across platforms • Global economics has increased the need for greater globalization support in computer systems • Everyone expects more! LRC – XI The Localisation Factory
Add the Universal Character Encoding • Unicode: Unique character codes for all languages … LRC – XI The Localisation Factory
The Need for Common Locale Data • Computing environments often contain a variety of operating systems and software. • Historically locale sensitive data research has been done by individuals and/or companies. • Because of political changes, it is easy for locale data to become out of date. • It is difficult to get complete agreement on correctness. LRC – XI The Localisation Factory
Common Locale Data Project • Began as Common XML Locale Repository (CXLR) developed by OpenI18N in 2003 • CLDR project began in 2004 • Hosted by Unicode Consortium • http://www.unicode.org/cldr/ • Goals: • Common, necessary software locale data for all world languages • Collect and maintain locale data • XML format for effective interchange • Freely available LRC – XI The Localisation Factory
CLDR in use (partial list) • Libraries and Environments • ICU – International Components for Unicode • JDK – Java Development Kit • Operating Systems • Solaris • AIX • MacOS X • Applications • OpenOffice.org • Acrobat • ModernBill LRC – XI The Localisation Factory
Agenda • Why CLDR? • CLDR data • Tools and vetting • The future LRC – XI The Localisation Factory
What is a Locale? • A locale is an identifier referring to linguistic and cultural preferences • en_US, en_GB, ja_JP • These preferences can change over time due to cultural and political reasons • Introduction of new currencies, like the Euro • Standard sorting of Spanish changes • Many of these preferences have varying degrees of standardization • 12 and 24 hour format in the United States • This is a very broad topic LRC – XI The Localisation Factory
Types of Locale Data • Dates/time/calendar formats • Number/currency formats • Measurement system • Collation specification • Sorting • Searching • Matching • Translated names for language, territory, script, timezones, currencies,… • Script and characters used by a language LRC – XI The Localisation Factory
Locale Data Markup Language • Locale data described using XML • CLDR data uses LDML • Structure of CLDR controlled by Locale Data Markup Language (LDML) specificationhttp://unicode.org/reports/tr35 LRC – XI The Localisation Factory
LDML Data Categories <ldml> <identity> <localeDisplayNames> <layout> <characters> <delimiters> <measurement> <dates> <numbers> <posix> <collations> LRC – XI The Localisation Factory
Names <localeDisplayNames> • Provides translated display names for languages, territories, scripts, variants and keywords used in CLDR. • Most of this information is at the language level, since it typically does not vary by territory, only language. • An example: ICU Locale Explorer LRC – XI The Localisation Factory
Names Examples From ga.xml (Irish): <localeDisplayNames> <languages> <language type="aa">Afar</language> <language type="ab">Abcáisis</language>… <scripts> <script type="Arab">Araibis</script>… <territories> <territory type="AD">Andóra </territory> <territory type="AE">Aontas na nÉimíríochtaí Arabacha </territory>… LRC – XI The Localisation Factory
Characters <characters> • Allows for creation of exemplar character sets. An exemplar set specifies the set of characters that must be present in order to properly render the language. • Auxiliary exemplar set defines additional characters that may appear in foreign words or phrases. • Lower case only LRC – XI The Localisation Factory
Date Formats <dates> • Defines representation of calendars using various calendaring systems (Gregorian, Buddhist, Islamic, Japanese, etc.) • Defines formatting for dates, times, eras and time zones • wide, abbreviated, or narrow • Date and time formats use patterns of letters to define proper formatting • Week information • Relative day/time translations (for example, yesterday, tomorrow, etc. ) • An example: ICU Locale Explorer LRC – XI The Localisation Factory
Characters / Dates Examples From ga.xml (Irish): <characters> <exemplarCharacters> [a á b-e é f-i í j-o ó p-u ú v-z] </exemplarCharacters> <exemplarCharacters type="auxiliary"> [ḃ ċ ḋ ḟ ġ ṁ ṗ ṡ ṫ]</exemplarCharacters> </characters>… <dayContext type="format"> <dayWidth type="abbreviated"> <day type="sun">Domh</day> <day type="mon">Luan</day>… LRC – XI The Localisation Factory
Time Zone Names <timeZoneNames> • Based on Olson time zone database • Localized display names for standard, daylight, and generic representations of time zones. • Short and long display names. LRC – XI The Localisation Factory
Numbers <numbers> • Specifies proper localized formatting of numeric quantities • Decimal • Scientific • Currency • Percentages • Includes localized decimal, thousands separators, currency symbols, etc. LRC – XI The Localisation Factory
Time Zones / Currencies From ga.xml (Irish) and root.xml: <timeZoneNames> <zone type="Europe/Dublin"> <long> <standard>Meán-Am Greenwich</standard> <daylight>AmSamhraidh na hÉireann</daylight> </long>… <numbers> <currencies> <currency type=“EUR"> <displayName>Euro</displayName> <symbol>€</symbol>… LRC – XI The Localisation Factory
Delimiters <delimiters> • Specifies a primary and secondary of delimiter characters to be used for bracketing quotations in text LRC – XI The Localisation Factory
Delimiters Example From fr.xml (French): <delimiters> <quotationStart>«</quotationStart> <quotationEnd>»</quotationEnd> <alternateQuotationStart>“</alternateQuotationStart> <alternateQuotationEnd>”</alternateQuotationEnd> </delimiters> LRC – XI The Localisation Factory
Collation <collations> • Information in collation directory, not main • XML version of Java/ICU collation syntax • Unicode collation algorithm is the base http://unicode.org/reports/tr10 • Allows tailoring of the UCA on a per locale basis. LRC – XI The Localisation Factory
Collation Example From collations/root.xml: <collations validSubLocales="ga ga_IE id id_ID ms ms_BN ms_MY nl nl_BE nl_NL pt pt_BR pt_PT"> <collation type="standard"> <rules> ... <s>ā</s> <t>Ā</t> <s>á</s> <t>Á</t> <s>ǎ</s> <t>Ǎ</t> <s>à</s> <t>À</t>… LRC – XI The Localisation Factory
Agenda • Why CLDR? • CLDR data • Tools and vetting • Today and the future LRC – XI The Localisation Factory
CLDR Tools • Export • ICU resource bundle generation • POSIX locale generator • openOffice.org format export • Survey tool • http://www.unicode.org/cgi-bin/cldr-survey LRC – XI The Localisation Factory
Vetting Process for Data • Collect from different platforms, experts, submissions: new or revised • References to external sources strongly encouraged • Must be before freeze date for release • Use Survey Tool to Collect Data LRC – XI The Localisation Factory
Causes of Conflicting Data • Typographical errors • Canda instead of Canada • Regional differences • German spelling is different between countries • Parts of speech • “март 2004” versus “3 марта” when the Russian word for March is used in a date • Context of usage • Normal German sorting versus German phonebook sorting • Standards versus common use • “Republic of Laos” versus “Laos” • Individual preferences • 24 hour time format versus 12 hour time format LRC – XI The Localisation Factory
Agenda • Why CLDR? • CLDR data • Tools and vetting • Today and the future LRC – XI The Localisation Factory
Latest Release: CLDR 1.4 • Released: July 17, 2006 • 360 locales: • 121 languages • 142 territories • 25% more data • 17,000 new or modified data items • Over 100 different contributors LRC – XI The Localisation Factory
Challenges • Complex Formats • Experts knowledgeable both in technology and a specific language • Collation • Exemplar characters • Etc… • Require close interaction of CLDR experts with language experts LRC – XI The Localisation Factory
Getting Involved • Simplest – anyone! • Use CLDR • Bug report / feature request • More Involved • Vetting, Assessment, Tools, Policies, Decisions, … • Any Unicode member eligible to name representatives including country liaison members LRC – XI The Localisation Factory
Example Country Process (Finland) • Finnish Ministry of Education made CLDR data a major goal, 2004-06 • Research Institute for the Languages of Finland (“RILF” aka “Kotus”) designated agency • Two official languages (Finnish and Swedish) & four regional / minority languages (three Sámi & Romani as spoken in Finland) to be covered • Over 30 different parties represented: commercial, non-commercial, individuals • Results expected to lead to new/revised national standards LRC – XI The Localisation Factory
For More Information • Unicode • http://www.unicode.org/ • CLDR • http://www.unicode.org/cldr/ • LDML specification • http://unicode.org/reports/tr35 • lisam@us.ibm.com LRC – XI The Localisation Factory