210 likes | 305 Views
Internationalization Using Locales. Achim Ruopp. Agenda. Working with multilingual data Language and locale identifiers Locale Data Frameworks for locale support Ideas/discussion how this could be used in compling. Not about character encoding. Read Jeremy’s slides from last quarter
E N D
InternationalizationUsing Locales Achim Ruopp
Agenda • Working with multilingual data • Language and locale identifiers • Locale Data • Frameworks for locale support • Ideas/discussion how this could be used in compling
Not about character encoding • Read Jeremy’s slides from last quarter • http://students.washington.edu/jgk/talks/char-enc/char-encodings.pdf • Use Unicode wherever possible
InternationalizationMore than Encoding Text • Where are the word breaks? คลิกปุ่มเมาส์ขวา Your balance is $1234.56... I think. • How do I sort these words in French? • cote dimension • côte coast • coté with dimensions • côté side • How do I uppercase this word in Turkish? • istiyorum - İstiyorum • How do I transcribe this text into Latin characters? • 인수문제를 - in'su'mun'je'reul'
Cultural Conventions • What does this date stand for? • 3/8/2006 • What is the currency symbol for Hungary? • … linguistic characteristics of languages and cultural conventions – a locale
Agenda • Working with multilingual data • Language and locale identifiers • Locale Data • Frameworks for locale support • Ideas/discussion how this could be used in compling
Internet Language Tags • Used today: RFC 3066 (RFC 1766) • Generative:ISO 639-1/2 language tag[-ISO 3166 country tag] • e.g. fr, en-US, ale-CA • Registered with IANA • e.g. no-nyo, zh-Hant • Exceptions • x-… • Several problems • Dependency on ISO standards • No generative options for dialects etc. • RFC3066bis should solve this
SIL Etnologue • Cataloging all of the world’s 6,912 known living languages • http://www.ethnologue.com/ • Uses ISO/DIS 639-3 3-letter codes • E.g. Swabian dialect: x-sil-swg • Hope for consolidation with RFC3066 or successor once 639-3 becomes full standard • Not so well supported in programming frameworks
Agenda • Working with multilingual data • Language and locale identifiers • Locale Data • Frameworks for locale support • Ideas/discussion how this could be used in compling
Types of Locale Data • Dates/time formats • Number/currency formats • Collation Specification • For sorting and comparison • Translated names for language, region, script, timezones, currencies,… • Script and characters used by a language • Measurement System • Paper sizes • …
Common Locale Data Repository • “The purpose of the Common Locale Data Repository project is to provide a general XML format for the exchange of locale information for use in application and system development, and to gather, store, and make available a common set of locale data generated in that format.” • http://www.unicode.org/cldr/
Common Locale Data Repository • Collection/vetting process • Contributors add/modify data • Reviewed by commitee • Accessible over the web • Locale Data Markup Language XML format • E.g. http://unicode.org/cldr/data/common/main/fr.xml
Agenda • Working with multilingual data • Language and locale identifiers • Locale Data • Frameworks for locale support • Ideas/discussion how this could be used in compling
FrameworksPosix Locale • Standard C/C++ libary • LC_COLLATE – sorting/comparison • LC_CTYPE - behavior of character-handling • LC_MONETARY - monetary formatting LC_NUMERIC – numeric formatting • LC_TIME – date/time formatting • Used in Un*x systems for command line functions too • Results can be platform-dependent • Stable, but feature set stuck in the 1980s
FrameworksICU Library • IBM Open Source project • Developed originally for the Taligent OS project in the late 80s/early 90s • Java and C++ APIs • Extensive locale data and APIs to use it • http://www.icu-project.org/cgi-bin/locexp • Also includes localization support • Everybody (Mac OS X, Java, DB2, Mathworks …) is using it • But …
FrameworksMicrosoft • Windows NLS API • Microsoft .NET Framework System.Globalization namespace • Similar set of data to ICU • Vetted by subsidiaries • APIs accessible from all MS programming languages • Localization support in different API
Microsoft demos Culture ExplorerMicrosoft Transliteration Utility
Extensibility • What if I don’t find the locale I need? • What if I need to modify some of the data? • ICU • Can create new locales • Microsoft • .NET Framework v2.0: custom cultures • Windows Vista: custom locales • LDML can be interchange format
Agenda • Working with multilingual data • Language and locale identifiers • Locale Data • Frameworks for locale support • Ideas/discussion how this could be used in compling
Usages for Computational Linguistics • Up to the imagination • Transliteration use in MT • Named Entity Recognition • … • suggestions? • Most importantly: Do not reinvent the wheel! • Check if API or data you need is available • If possible write code in a language/locale-independent fashion
References • RFC3066bis • http://www.inter-locale.com/ID/why-rfc3066bis.html • Etnologue • http://www.ethnologue.com/ • Common Locale Data Repository • http://www.unicode.org/cldr/ • Posix Locale • http://www.opengroup.org/onlinepubs/009695399/basedefs/xbd_chap07.html • ICU • http://icu.sourceforge.net/ • Microsoft • http://www.microsoft.com/globaldev/ • UNGEGN Working Group on Romanization Systems • http://www.eki.ee/wgrs/