430 likes | 616 Views
Unicode for Under Resourced Languages. Daniel Yacob The Ge’ez Frontier Foundation. SALTMIL 5: Genoa, Italy 2006. Overview. What is “Unicode”? More than Just Encoded Letters! Working with Unicode How Unicode can help you. Resources and how to apply them. Working for Unicode
E N D
Unicode for Under Resourced Languages Daniel Yacob The Ge’ez Frontier Foundation SALTMIL 5: Genoa, Italy 2006
Overview • What is “Unicode”? • More than Just Encoded Letters! • Working with Unicode • How Unicode can help you. • Resources and how to apply them. • Working for Unicode • How you can help Unicode. • How Unicode can help your U-RL.
My Background • Started Ethiopic software work in 1993 • transliterator, keyboard, fonts • Amharic Computational Linguistics in 1994 • “Extended Ethiopic” Unicode Standardization 1995-2004 • Corpus Collection 1997 – Present • Began Using Unicode in 1995 for Ethiopic • but no Unicode standard existed until 2000!
My Background • Little or no Unicode based resources in 1993-1997 • Today there is almost always an OpenSource project that you can start with and extend. • Minimize the time and labour you put into developing basic resources. • Avoid the maintenance trap. • We will assume the worst case scenario • You work on a language, using a script, with no pre-existing software resources at all.
What Unicode is Unicode … • is a consortium • is a process • is a community • is a conference • is a database • is a standard • is a collection of standards
What Unicode is not Unicode … • is not a font • is not a keyboard system • is not a transliteration system • is not the ISO • is not perfect • is not complete
Over 80 Scripts not Encoded! Courtesy of Michael Everson: http://evertype.com
Over 80 Scripts not Encoded! Courtesy of Michael Everson: http://evertype.com
For Unicode 5.0 (2006): N’Ko (West Africa) Balinese (Indonesia) Phags-pa (historical) Phoenician (historical) Cuneiform (historical) For Unicode 5.1 (2008): Lepcha (India) Ol Chiki (India) Vai (Liberia) Saurashtra (India) Myanmar minorities (Myanmar) Kayah Li (Myanmar) Rejang (Indonesia) Sundanese (Indonesia) Carian, Lycian, Lydian (historical) Current State of the Unicode Standard: New Script Additions Courtesy of Michael Everson: http://evertype.com
Working with Unicode Unicode is all About Text • Most applicable to problems where language is represented by text. • Unicode addresses some vocabulary but under the scope of localization (CLDR). • May not be the solution if you are not working with text represented in written form • Although, Unicode can be used for symbol processing
Working with Unicode Operating Systems • Most anything from this millennia. • Apple MacOS Version ≥ 9.2 • Microsoft Windows CE, NT, XP, 2000 • Solaris ≥ 2.8 • Any GNU/Linux (for console use) • GNOME 2.0 or KDE 2.0 and Later
Working with Unicode The International Phonetic Alphabet (IPA)
Working with Unicode The International Phonetic Alphabet (IPA) • SIL Charis, Doulos, Gentium • free and most complete • matches “New Times Roman” style • http://scripts.sil.org/IPAhome
Working with Unicode If you need more letters… • Create Your own Fonts! • Use the Unicode Private Use Area (PUA) • this is Unicode’s extension mechanism. • does not break compatibility with Unicode software. • you must send your fonts with your work. • encode non-letter symbols (tokens, tags), no need for fonts.
Working with Unicode The PUA • 6,400 code points in the range E000-F8FF • 218 additional available in “planes” 15 & 16 • Work in Plane 0 first (0000 – FFFF) • Intended for company logos, ligatures used by typesetting software, etc.
Working with Unicode Creating Your Own Fonts • Bitmap (BDF) • Faster to create • One size per font, not so scalable • Works best with X-Windows (Unix) • Outline (TrueType, PostScipt, OpenType) • Takes more time • Scalable • MS Windows, Mac, Modern Unixes
Working with Unicode Bitmap Editors • Each letter is a matrix of pixels, like tiles • You toggle them on or off to shape your letters • GBDFED for recent GNOME/Linux • XBDFED for general Unix • Or search for “BDF Editor”
Working with Unicode Bitmap Editors Zoom View Within Edit Window
Working with Unicode Outline Editors • Create Bezier curves to outline scalable shapes • Here traced around a scanned image • FontForge http://fontforge.sf.net
Working with Unicode Creating Your Own Keyboards • No standard formats • Different on every operating system • May require some painful programming • transliteration may be a better alternative. • For small amounts of typing try: Ctrl+Shift+X1X2X3X4 Ctrl+Shift+1234
Working with Unicode Creating Your Own Keyboards Linux • Migration Toward Smart Common Input Method (SCIM) • simple table based • more complex as needed • http://scim.sf.net - or Yudit, Emacs for older Unixes, but you can only type in these applications.
Working with Unicode Creating Your Own Keyboards Windows • Keyman, most mature & robust • Keyboards created with KeymanDeveloper • $59 academic and developing world license • worth every cent • compiled keyboards also run under Linux with a SCIM module • http://tavultesoft.com
Working with Unicode Text Processing • International Components for Unicode (ICU) • http://icu.sf.net • Java, C/C++ • Bindings in: Python, Ruby, C#,Perl 6 (some Perl 5) • started by IBM, is OpenSource • managed by the Unicode president • check with ICU before • 700+ Encoding Conversions • convert legacy systems to and from Unicode • migrate corpora to Unicode
Working with Unicode Text Processing ICU: Normalization • Equate letters and diacritical symbols . 0323
Working with Unicode Text Processing ICU: Regular Expressions • Applies the Unicode Character Database • Categorize every character as one of • Letter • Number • Separator • Punctuation • Marks • Symbols • Others • Subcategories within each. Examples • Letter, Uppercase, lowercase, Other, … • Symbols, Math, Currency, Modifiers, … • Mark, spacing, non-spacing, enclosing • Defines 80 character property types
Working with Unicode Text Processing ICU: Regular Expressions Set Operations • [^\p{Letter}] Negation • [\p{Letter}\p{Number}] Union • [\p{Letter}&\p{script=Cyrllic}] Intersection • [\p{Letter}-\p{Latin}] Difference • Important for a character set the size of Unicode.
Working with Unicode Text Processing ICU: Regular Expressions • Enhanced Word Boundaries: Hello There. G’day 123.456 Classic REHello There. G’day 123.456 Unicode Word Boundaries
Working with Unicode Text Processing ICU: Regular Expressions • Equivalence Classes • [=e=] matches all “e” [eèéêëēĕėęě] • not yet implemented • use Perl instead
Working with Unicode Overloading Perl Regex with Regexp::Ethiopic Simple Plurals: [#7#]ች vs [ሆሎሖሞሦሮሶሾቆቦቮቶቾኆኖኞኦኮዖዞዦዮዶጆጎጦጮጶጾፆፎፖ]ች
Working with Unicode Overloading Perl Regex with Regexp::Ethiopic • /[#3#]ያ/ • አንባቢያን • ሚያዚያ • ኢትዮጵያዊያን • /[#3,6#]ያ/ • አንባቢያንአንባብያን • ሚያዚያሚያዝያ • ኢትዮጵያዊያንኢትዮጵያውያን
Working with Unicode Text Processing ICU: Transliteration • Defined by “transform rules” • One to one mappings: • α <> a; • β <> b; • Context Rules: • β } [aeiou] > b; • β } [^aeiou] > v;
Working with Unicode Text Processing ICU: Transliteration • Defined by “transform rules” • Applying UCD Properties • Θ } [:LowercaseLetter:] <> Th; • Θ <> TH; • Reverse Transliteration Context Rules • σ < [:^Letter:] { s } [:^Letter:] ; • ς < s } [:^Letter:] ; • σ < s ;
Working with Unicode Text Processing • ICU: Transliteration • Gets much more sophisticated • See also Perl’s Text::Transliterate
Working for Unicode Taking Your Work a Step Further • You’ve helped create an orthography –now make it official. • You’ve worked with a pre-existing un-encoded script using the PUA –now formalize it. • You’ve created a transliteration system–make it an ISO standard. • You’ve identified a dialect –encode it in ISO 639. • You’ve developed a keyboard–make it a national standard. • etc.
Working for Unicode Why go the extra mile kilometer? • Ethnic pride and identity is promoted. • Literacy efforts can be encouraged. • The study of historic scripts is kept alive. • Communication between and amongst members of the community is promoted. • Government communication in times of emergency (disease, war, natural disaster). • Leads to localization, greater access to ICT. • …and you become the expert!
Working for Unicode What to Consider • The work will be more social than technical. • The work will take years (at least two). • Review Encoding History • Has this been attempted before and failed? Why? • Are there any non-Unicode encodings? • Determine the Stakeholders • The Government –will they support you, oppose you, jail you? • Political Parties, Religious, Education, Cultural Groups • does anyone have something to lose by the encoding? • Communicate, Communicate, Communicate… • and be transparent. • the perception of being closed breeds suspicion and opposition. • …even 11 years after the fact, trust me on this.
Working for Unicode New Keyboard? • No international standardization working groups • Contribute Keyboard back to main project • Contact Local ICT Professionals Organization • Contact Local University CS Department • Contact Local Standards Body
Working for Unicode New Language or Dialect? • Contact the ICO/DIS 639-3 Registration Authority • http://sil.org/iso639-3/ • iso639-3@sil.org • Contact Language or Cultural Authority • Contact Local University Linguistics Department
Working for Unicode New Orthography? Or Un-encoded? • Contact the ISO 15924 Registration Authority • http://unicode.org/iso15924/ • Contact Language or Cultural Authority • Contact Local ICT Professionals Organization • Contact Local University CS Department • Contact Local University Linguistics Department • Contact Local Standards Body • Contact the Script Encoding Initiative
Working for Unicode The Script Encoding Initiative • http://linguistics.berkeley.edu/sei • Works with users on script proposals. • Helps raise money for script proposals to be written and free fonts to be created. • Works collaboratively with other groups (e.g. SIL) to avoid duplication of effort. • Helps seek experts to review proposals. • Participates at standards meetings on behalf of minority groups and scholars.
~fini~ • Conclusion • Use Unicode Now! • You can do it! • Yes you can do it! • There are no excuses anymore… • …its 2006 already, I’m telling you can do this! • and when you do (remember I have faith in you!) consider feeding back into the system via standardization. • Be a good citizen of earth, always ☺. Thank You for Listening. Are There Any Questions? This presentation: http://yacob.org/papers/