350 likes | 553 Views
LRC ? XI The Localisation Factory. Agenda. Why CLDR?CLDR dataTools and vettingToday and the future. LRC ? XI The Localisation Factory. Agenda. Why CLDR?CLDR dataTools and vettingToday and the future. LRC ? XI The Localisation Factory. Locales ? does anything stay the same?. "Theatre Center Ne
E N D
1. CLDR:The Common Locale Data RepositoryLocales for the World Lisa Moore
George Rhoten Mark Davis Steven Loomis
2. LRC – XI The Localisation Factory Agenda Why CLDR?
CLDR data
Tools and vetting
Today and the future
3. LRC – XI The Localisation Factory Agenda Why CLDR?
CLDR data
Tools and vetting
Today and the future
4. LRC – XI The Localisation Factory Locales – does anything stay the same? "Theatre Center News: The date of the last version of this document was 2003?3?20?. A copy can be obtained for $50,0 or 1.234,57 ???. We would like to acknowledge contributions by the following authors (in alphabetical order): Alaa Ghoneim, Behdad Esfahbod, Ahmed Talaat, Eric Mader, Asmus Freytag, Avery Bishop, and Doug Felt."
5. LRC – XI The Localisation Factory Locales – the many differences Locales specify user preferences
Linguistic and cultural differences
Languages, scripts, writing systems, ordering, directionality, formatting, numbers, sizes
Even in the same locale, interoperability issues across platforms
Global economics has increased the need for greater globalization support in computer systems
Everyone expects more!
6. LRC – XI The Localisation Factory Add the Universal Character Encoding Unicode: Unique character codes for all languages
7. LRC – XI The Localisation Factory The Need for Common Locale Data Computing environments often contain a variety of operating systems and software.
Historically locale sensitive data research has been done by individuals and/or companies.
Because of political changes, it is easy for locale data to become out of date.
It is difficult to get complete agreement on correctness.
8. LRC – XI The Localisation Factory Common Locale Data Project Began as Common XML Locale Repository (CXLR) developed by OpenI18N in 2003
CLDR project began in 2004
Hosted by Unicode Consortium
http://www.unicode.org/cldr/
Goals:
Common, necessary software locale data for all world languages
Collect and maintain locale data
XML format for effective interchange
Freely available The Common Locale Data Repository (CLDR) was developed in response to the need for standardized locales based on Unicode. CLDR provides key building blocks for software to support the world’s languages. This data is used by a wide spectrum of companies for their software internationalization and localization – adapting software to the conventions of different languages and locations for such common tasks as formatting of dates, times, time zones, numbers, and currency values, sorting text; and choosing languages or countries by name, among others.
The CLDR project collects and maintains locale data and uses the Locale Data Markup Language (LDML) to describe the data.The Common Locale Data Repository (CLDR) was developed in response to the need for standardized locales based on Unicode. CLDR provides key building blocks for software to support the world’s languages. This data is used by a wide spectrum of companies for their software internationalization and localization – adapting software to the conventions of different languages and locations for such common tasks as formatting of dates, times, time zones, numbers, and currency values, sorting text; and choosing languages or countries by name, among others.
The CLDR project collects and maintains locale data and uses the Locale Data Markup Language (LDML) to describe the data.
9. LRC – XI The Localisation Factory CLDR in use (partial list) Libraries and Environments
ICU – International Components for Unicode
JDK – Java Development Kit
Operating Systems
Solaris
AIX
MacOS X
Applications
OpenOffice.org
Acrobat
ModernBill
10. LRC – XI The Localisation Factory Agenda Why CLDR?
CLDR data
Tools and vetting
The future
11. LRC – XI The Localisation Factory What is a Locale? A locale is an identifier referring to linguistic and cultural preferences
en_US, en_GB, ja_JP
These preferences can change over time due to cultural and political reasons
Introduction of new currencies, like the Euro
Standard sorting of Spanish changes
Many of these preferences have varying degrees of standardization
12 and 24 hour format in the United States
This is a very broad topic
A locale is a string identifier that refers to specific linguistic and cultural preferences. These preferences can include date/time formatting, number formatting, spelling of certain names and many other items.
These preferences can change over time due to cultural and political reasons. For example, modern Spanish sorts differently from older Spanish from the 1990s. In another example, some countries mandate how specific regions are referred to (this can happen when ownership of a region is in dispute).
Of course, these types of preferences are not absolute. For example, most people in the United States use 12 hour time, but there are some people in the US that use 24 hour time. There are some languages, like French and Japanese, that have published standards for how to sort those languages. There are some other languages that may not have enough exposure to other cultures to have names for certain places or concepts.
There are many things that locale data can cover. It could cover industry specific topics, like shoe size. CLDR limits its scope to a few specific topics.
Scope of data limited to common system applications
A locale is a string identifier that refers to specific linguistic and cultural preferences. These preferences can include date/time formatting, number formatting, spelling of certain names and many other items.
These preferences can change over time due to cultural and political reasons. For example, modern Spanish sorts differently from older Spanish from the 1990s. In another example, some countries mandate how specific regions are referred to (this can happen when ownership of a region is in dispute).
Of course, these types of preferences are not absolute. For example, most people in the United States use 12 hour time, but there are some people in the US that use 24 hour time. There are some languages, like French and Japanese, that have published standards for how to sort those languages. There are some other languages that may not have enough exposure to other cultures to have names for certain places or concepts.
There are many things that locale data can cover. It could cover industry specific topics, like shoe size. CLDR limits its scope to a few specific topics.
Scope of data limited to common system applications
12. LRC – XI The Localisation Factory Types of Locale Data Dates/time/calendar formats
Number/currency formats
Measurement system
Collation specification
Sorting
Searching
Matching
Translated names for language, territory, script, timezones, currencies,…
Script and characters used by a language This is a list of the some of the topics that CLDR has translations and formats for locale data.This is a list of the some of the topics that CLDR has translations and formats for locale data.
13. LRC – XI The Localisation Factory Locale Data Markup Language Locale data described using XML
CLDR data uses LDML
Structure of CLDR controlled by Locale Data Markup Language (LDML) specificationhttp://unicode.org/reports/tr35
14. LRC – XI The Localisation Factory LDML Data Categories <ldml>
<identity>
<localeDisplayNames>
<layout>
<characters>
<delimiters>
<measurement>
<dates>
<numbers>
<posix>
<collations>
15. LRC – XI The Localisation Factory Names <localeDisplayNames>
Provides translated display names for languages, territories, scripts, variants and keywords used in CLDR.
Most of this information is at the language level, since it typically does not vary by territory, only language.
An example: ICU Locale Explorer
16. LRC – XI The Localisation Factory Names Examples From ga.xml (Irish):
<localeDisplayNames>
<languages>
<language type="aa">Afar</language>
<language type="ab">Abcáisis</language>…
<scripts>
<script type="Arab">Araibis</script>…
<territories>
<territory type="AD">Andóra </territory>
<territory type="AE">Aontas na nÉimíríochtaí Arabacha
</territory>… Here is an example of what CLDR looks like. In this snippet of CLDR data, some translations are provided for some language, country and script display names. The keys use other standards, like ISO-639, ISO-3166 and other various standards. As you can see CLDR is written in XML. This data can be used for web site preferencesHere is an example of what CLDR looks like. In this snippet of CLDR data, some translations are provided for some language, country and script display names. The keys use other standards, like ISO-639, ISO-3166 and other various standards. As you can see CLDR is written in XML. This data can be used for web site preferences
17. LRC – XI The Localisation Factory Characters <characters>
Allows for creation of exemplar character sets. An exemplar set specifies the set of characters that must be present in order to properly render the language.
Auxiliary exemplar set defines additional characters that may appear in foreign words or phrases.
Lower case only
18. LRC – XI The Localisation Factory Date Formats <dates>
Defines representation of calendars using various calendaring systems (Gregorian, Buddhist, Islamic, Japanese, etc.)
Defines formatting for dates, times, eras and time zones
wide, abbreviated, or narrow
Date and time formats use patterns of letters to define proper formatting
Week information
Relative day/time translations (for example, yesterday, tomorrow, etc. )
An example: ICU Locale Explorer
19. LRC – XI The Localisation Factory Characters / Dates Examples From ga.xml (Irish):
<characters>
<exemplarCharacters> [a á b-e é f-i í j-o ó p-u ú v-z]
</exemplarCharacters>
<exemplarCharacters type="auxiliary"> [? c ? ? g ? ? ? ?] </exemplarCharacters>
</characters>…
<dayContext type="format">
<dayWidth type="abbreviated">
<day type="sun">Domh</day>
<day type="mon">Luan </day>…
20. LRC – XI The Localisation Factory Time Zone Names <timeZoneNames>
Based on Olson time zone database
Localized display names for standard, daylight, and generic representations of time zones.
Short and long display names.
21. LRC – XI The Localisation Factory Numbers <numbers>
Specifies proper localized formatting of numeric quantities
Decimal
Scientific
Currency
Percentages
Includes localized decimal, thousands separators, currency symbols, etc.
22. LRC – XI The Localisation Factory Time Zones / Currencies From ga.xml (Irish) and root.xml:
<timeZoneNames>
<zone type="Europe/Dublin">
<long>
<standard>Meán-Am Greenwich</standard>
<daylight>Am Samhraidh na hÉireann </daylight>
</long>…
<numbers>
<currencies>
<currency type=“EUR">
<displayName>Euro</displayName>
<symbol>€</symbol>…
23. LRC – XI The Localisation Factory Delimiters <delimiters>
Specifies a primary and secondary of delimiter characters to be used for bracketing quotations in text
24. LRC – XI The Localisation Factory Delimiters Example From fr.xml (French):
<delimiters>
<quotationStart>«</quotationStart>
<quotationEnd>»</quotationEnd>
<alternateQuotationStart>“</alternateQuotationStart>
<alternateQuotationEnd>”</alternateQuotationEnd>
</delimiters>
25. LRC – XI The Localisation Factory Collation <collations>
Information in collation directory, not main
XML version of Java/ICU collation syntax
Unicode collation algorithm is the base http://unicode.org/reports/tr10
Allows tailoring of the UCA on a per locale basis.
26. LRC – XI The Localisation Factory Collation Example From collations/root.xml:
<collations validSubLocales="ga ga_IE id id_ID ms ms_BN ms_MY nl nl_BE nl_NL pt pt_BR pt_PT">
<collation type="standard">
<rules>
...
<s>a</s>
<t>A</t>
<s>á</s>
<t>Á</t>
<s>a</s>
<t>A</t>
<s>ŕ</s>
<t>Ŕ</t>…
27. LRC – XI The Localisation Factory Agenda Why CLDR?
CLDR data
Tools and vetting
Today and the future
28. LRC – XI The Localisation Factory CLDR Tools Export
ICU resource bundle generation
POSIX locale generator
openOffice.org format export
Survey tool
http://www.unicode.org/cgi-bin/cldr-survey
29. LRC – XI The Localisation Factory Vetting Process for Data Collect from different platforms, experts, submissions: new or revised
References to external sources strongly encouraged
Must be before freeze date for release
Use Survey Tool to Collect Data Will show a demo of Survey Tool
Will show a demo of Survey Tool
30. LRC – XI The Localisation Factory Causes of Conflicting Data Typographical errors
Canda instead of Canada
Regional differences
German spelling is different between countries
Parts of speech
“???? 2004” versus “3 ?????” when the Russian word for March is used in a date
Context of usage
Normal German sorting versus German phonebook sorting
Standards versus common use
“Republic of Laos” versus “Laos”
Individual preferences
24 hour time format versus 12 hour time format
Now we will look at some examples of conflicting data. These are items which turn up when data comparisons are made. Not everything is an either-or case. Sometimes we find that a restructuring of the data is in order to accomodate both the old and new data because both could be correct.
Typographical errors: Sometimes this is due to data being entered by keyboard incorrectly. Other times it can be due to using one locale’s translations as a template for another locale’s data.
Regional differences: Regional and sub-regional differences may require the decision to keep both sets of data in different locales rather than choosing one over another. For example, German in Germany and Switzerland frequently have spelling differences, and sometimes American English is different from British English.
Context of usage: There is more than one way to sort German text. There is normal German sorting, and there is German phonebook sorting. For example, “öf” and “of” sort in differently between normal German sorting and German phonebook sorting.
Parts of speech: Some languages make a distinction between the way month names are written when cited independently, and when written as part of a date. For example, “March 2004” at the heading of a Calendar would be written as just the name March, but the date “3rd March, 2004” would require a different form meaning “of March”. CLDR accommodates such languages using a type value of “standalone” or “format”, respectively.
Standards vs. common use: CLDR uses the commonly used translation or format for the default. However alternates are allowed in CLDR. Sometimes there is more than one right answer.
Misunderstanding: Sometimes translators don’t have enough knowledge about how CLDR works. Sometimes a translator will try to translate the format and characters of a date format instead of just the format. The localized characters of a date format are in a separate field of CLDR.
Uncommon cases: There are some items and concepts in CLDR that are not commonly known by all translators. For example, how does a translator translate the word “Interlingua” (a language) when the translator has never heard of the Interlingua language. Sometimes translators guess, and these guesses will appear during the vetting process.
Individual preferences: Some people have different preferences, and this can vary between translators. For example, the US military usually use 24 hour time, but the rest of the United States uses 12 hour time.
Now we will look at some examples of conflicting data. These are items which turn up when data comparisons are made. Not everything is an either-or case. Sometimes we find that a restructuring of the data is in order to accomodate both the old and new data because both could be correct.
Typographical errors: Sometimes this is due to data being entered by keyboard incorrectly. Other times it can be due to using one locale’s translations as a template for another locale’s data.
Regional differences: Regional and sub-regional differences may require the decision to keep both sets of data in different locales rather than choosing one over another. For example, German in Germany and Switzerland frequently have spelling differences, and sometimes American English is different from British English.
Context of usage: There is more than one way to sort German text. There is normal German sorting, and there is German phonebook sorting. For example, “öf” and “of” sort in differently between normal German sorting and German phonebook sorting.
Parts of speech: Some languages make a distinction between the way month names are written when cited independently, and when written as part of a date. For example, “March 2004” at the heading of a Calendar would be written as just the name March, but the date “3rd March, 2004” would require a different form meaning “of March”. CLDR accommodates such languages using a type value of “standalone” or “format”, respectively.
Standards vs. common use: CLDR uses the commonly used translation or format for the default. However alternates are allowed in CLDR. Sometimes there is more than one right answer.
Misunderstanding: Sometimes translators don’t have enough knowledge about how CLDR works. Sometimes a translator will try to translate the format and characters of a date format instead of just the format. The localized characters of a date format are in a separate field of CLDR.
Uncommon cases: There are some items and concepts in CLDR that are not commonly known by all translators. For example, how does a translator translate the word “Interlingua” (a language) when the translator has never heard of the Interlingua language. Sometimes translators guess, and these guesses will appear during the vetting process.
Individual preferences: Some people have different preferences, and this can vary between translators. For example, the US military usually use 24 hour time, but the rest of the United States uses 12 hour time.
31. LRC – XI The Localisation Factory Agenda Why CLDR?
CLDR data
Tools and vetting
Today and the future
32. LRC – XI The Localisation Factory Latest Release: CLDR 1.4 Released: July 17, 2006
360 locales:
121 languages
142 territories
25% more data
17,000 new or modified data items
Over 100 different contributors
Here is a summary of the latest CLDR release.
Complete POSIX-format data with POSIX conversion tool
More timezone translations
Data for UN M.49 regions, including continents and regions
Addition of ISO 4217 currency code change overs
Additional number and data tests to verify CLDR implementations
Mappings from language to script and territory
Various other fixes, additions, and extensions
Survey tool for improved collection of data
(read only to non-members)Here is a summary of the latest CLDR release.
Complete POSIX-format data with POSIX conversion tool
More timezone translations
Data for UN M.49 regions, including continents and regions
Addition of ISO 4217 currency code change overs
Additional number and data tests to verify CLDR implementations
Mappings from language to script and territory
Various other fixes, additions, and extensions
Survey tool for improved collection of data
(read only to non-members)
33. LRC – XI The Localisation Factory Challenges Complex Formats
Experts knowledgeable both in technology and a specific language
Collation
Exemplar characters
Etc…
Require close interaction of CLDR experts with language experts There are some challenges for creating data for CLDR. Some of the information can be complex. Some items in CLDR have a very specific purpose and meaning, but a language expert may be unfamiliar with these purposes and meanings. Sometimes close interaction between experts can be difficult over the phone or face to face. Interacting over e-mail is easier.There are some challenges for creating data for CLDR. Some of the information can be complex. Some items in CLDR have a very specific purpose and meaning, but a language expert may be unfamiliar with these purposes and meanings. Sometimes close interaction between experts can be difficult over the phone or face to face. Interacting over e-mail is easier.
34. LRC – XI The Localisation Factory Getting Involved Simplest – anyone!
Use CLDR
Bug report / feature request
More Involved
Vetting, Assessment, Tools, Policies, Decisions, …
Any Unicode member eligible to name representatives including country liaison members Who can participate in CLDR? Anyone can get involved! It can be as simple as suggesting a fix for a translation that is misspelled, or it can be as big as submitting data for a whole new locale. We also welcome vetters that can verify that data is correct, tool writers and many other people interested in the topic of locale data.
When submitting data to the CLDR project, references to standards, dictionaries or actual examples of every day use frequently help to get the locale data vetted correctly. Please see the CLDR project web site for how to submit locale data and how to participate in the project.
Designed for most effective participation from people around the world
Meetings
By phone, never face to face
Short, frequent
Allows preparation between meetings
Resolves conflicts and new feature requests
Written
Email
Bug database submissions
Who can participate in CLDR? Anyone can get involved! It can be as simple as suggesting a fix for a translation that is misspelled, or it can be as big as submitting data for a whole new locale. We also welcome vetters that can verify that data is correct, tool writers and many other people interested in the topic of locale data.
When submitting data to the CLDR project, references to standards, dictionaries or actual examples of every day use frequently help to get the locale data vetted correctly. Please see the CLDR project web site for how to submit locale data and how to participate in the project.
Designed for most effective participation from people around the world
Meetings
By phone, never face to face
Short, frequent
Allows preparation between meetings
Resolves conflicts and new feature requests
Written
Email
Bug database submissions
35. LRC – XI The Localisation Factory Example Country Process (Finland) Finnish Ministry of Education made CLDR data a major goal, 2004-06
Research Institute for the Languages of Finland (“RILF” aka “Kotus”) designated agency
Two official languages (Finnish and Swedish) & four regional / minority languages (three Sámi & Romani as spoken in Finland) to be covered
Over 30 different parties represented: commercial, non-commercial, individuals
Results expected to lead to new/revised national standards
36. LRC – XI The Localisation Factory For More Information Unicode
http://www.unicode.org/
CLDR
http://www.unicode.org/cldr/
LDML specification
http://unicode.org/reports/tr35
lisam@us.ibm.com