1 / 43

Unicode for Under Resourced Languages

Unicode for Under Resourced Languages. Daniel Yacob Ge’ez Frontier Foundation. SALTMIL 5: Genoa, Italy 2006. Overview. What is “Unicode”? More than Just Encoded Letters! Working with Unicode How Unicode can help you. Resources and how to apply them. Working for Unicode

dara-craft
Download Presentation

Unicode for Under Resourced Languages

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Unicode for Under Resourced Languages Daniel Yacob Ge’ez Frontier Foundation SALTMIL 5: Genoa, Italy 2006

  2. Overview • What is “Unicode”? • More than Just Encoded Letters! • Working with Unicode • How Unicode can help you. • Resources and how to apply them. • Working for Unicode • How you can help Unicode. • How Unicode can help your U-RL.

  3. My Background • Started Ethiopic software work in 1993 • transliterator, keyboard, fonts • Amharic Computational Linguistics in 1994 • “Extended Ethiopic” Unicode Standardization 1995-2004 • Corpus Collection 1997 – Present • Began Using Unicode in 1995 for Ethiopic • but no Unicode standard existed until 2000!

  4. My Background • Little or no Unicode based resources in 1993-1997 • Today there is almost always an OpenSource project that you can start with and extend. • Minimize the time and labour you put into developing basic resources. • Avoid the maintenance trap. • We will assume the worst case scenario • You work on a language, using a script, with no pre-existing software resources at all.

  5. What Unicode is Unicode … • is a consortium • is a process • is a community • is a conference • is a database • is a standard • is a collection of standards

  6. What Unicode is not Unicode … • is not a font • is not a keyboard system • is not a transliteration system • is not the ISO • is not perfect • is not complete

  7. Over 80 Scripts not Encoded! Courtesy of Michael Everson: http://evertype.com

  8. Over 80 Scripts not Encoded! Courtesy of Michael Everson: http://evertype.com

  9. For Unicode 5.0 (2006): N’Ko (West Africa) Balinese (Indonesia) Phags-pa (historical) Phoenician (historical) Cuneiform (historical) For Unicode 5.1 (2008): Lepcha (India) Ol Chiki (India) Vai (Liberia) Saurashtra (India) Myanmar minorities (Myanmar) Kayah Li (Myanmar) Rejang (Indonesia) Sundanese (Indonesia) Carian, Lycian, Lydian (historical) Current State of the Unicode Standard: New Script Additions Courtesy of Michael Everson: http://evertype.com

  10. Working with Unicode Unicode is all About Text • Most applicable to problems where language is represented by text. • Unicode addresses some vocabulary but under the scope of localization (CLDR). • May not be the solution if you are not working with text represented in written form • Although, Unicode can be used for symbol processing

  11. Working with Unicode Operating Systems • Most anything from this millennia. • Apple MacOS Version ≥ 9.2 • Microsoft Windows CE, NT, XP, 2000 • Solaris ≥ 2.8 • Any GNU/Linux (for console use) • GNOME 2.0 or KDE 2.0 and Later

  12. Working with Unicode The International Phonetic Alphabet (IPA)

  13. Working with Unicode The International Phonetic Alphabet (IPA) • SIL Charis, Doulos, Gentium • free and most complete • matches “New Times Roman” style • http://scripts.sil.org/IPAhome

  14. Working with Unicode If you need more letters… • Create Your own Fonts! • Use the Unicode Private Use Area (PUA) • this is Unicode’s extension mechanism. • does not break compatibility with Unicode software. • you must send your fonts with your work. • encode non-letter symbols, no need for fonts.

  15. Working with Unicode The PUA • 6,400 code points in the range E000-F8FF • 218 additional available in “planes” 15 & 16 • Work in Plane 0 first (0000 – FFFF) • Intended for company logos, ligatures used by typesetting software, etc.

  16. Working with Unicode Creating Your Own Fonts • Bitmap (BDF) • Faster to create • One size per font, not so scalable • Works best with X-Windows (Unix) • Outline (TrueType, PostScipt, OpenType) • Takes more time • Scalable • MS Windows, Mac, Modern Unixes

  17. Working with Unicode Bitmap Editors • Each letter is a matrix of pixels, like tiles • You toggle them on or off to shape your letters • GBDFED for recent GNOME/Linux • XMBDFED for general Unix • Or search for “BDF Editor”

  18. Working with Unicode

  19. Working with Unicode Bitmap Editors Zoom View Within Edit Window

  20. Working with Unicode Outline Editors • Create Bezier curves to outline scalable shapes • Here traced around a scanned image • FontForge http://fontforge.sf.net

  21. Working with Unicode Creating Your Own Keyboards • No standard formats • Different on every operating system • May require some painful programming • transliteration may be a better alternative. • For small amounts of typing try: Ctrl+Shift+X1X2X3X4 Ctrl+Shift+1234

  22. Working with Unicode Creating Your Own Keyboards Linux • Migration Toward Smart Common Input Method (SCIM) • simple table based • more complex as needed • http://scim.sf.net - or Yudit, Emacs for older Unixes, but you can only type in these applications.

  23. Working with Unicode Creating Your Own Keyboards Windows • Keyman, most mature & robust • Keyboards created with KeymanDeveloper • $59 academic and developing world license • worth every cent • compiled keyboards also run under Linux with a SCIM module • http://tavultesoft.com

  24. Working with Unicode Text Processing • International Components for Unicode (ICU) • http://icu.sf.net • Java, C/C++ • Bindings in: Python, Ruby, C#,Perl 6 (some Perl 5) • started by IBM, is OpenSource • managed by the Unicode president • check with ICU before • 700+ Encoding Conversions • convert legacy systems to and from Unicode • migrate corpora to Unicode

  25. Working with Unicode Text Processing ICU: Normalization • Equate letters and diacritical symbols . 0323

  26. Working with Unicode Text Processing ICU: Regular Expressions • Applies the Unicode Character Database • Categorize every character as one of • Letter • Number • Separator • Punctuation • Marks • Symbols • Others • Subcategories within each. Examples • Letter, Uppercase, lowercase, Other, … • Symbols, Math, Currency, Modifiers, … • Mark, spacing, non-spacing, enclosing • Defines 80 character property types

  27. Working with Unicode Text Processing ICU: Regular Expressions Set Operations • [^\p{Letter}] Negation • [\p{Letter}\p{Number}] Union • [\p{Letter}&\p{script=Cyrllic}] Intersection • [\p{Letter}-\p{Latin}] Difference • Important for a character set the size of Unicode.

  28. Working with Unicode Text Processing ICU: Regular Expressions • Enhanced Word Boundaries: Hello There. G’day 123.456 Classic REHello There. G’day 123.456 Unicode Word Boundaries

  29. Working with Unicode Text Processing ICU: Regular Expressions • Equivalence Classes • [=e=] matches all “e” [eèéêëēĕėęě] • not yet implemented • use Perl instead

  30. Working with Unicode Overloading Perl Regex with Regexp::Ethiopic Simple Plurals: [#7#]ች vs [ሆሎሖሞሦሮሶሾቆቦቮቶቾኆኖኞኦኮዖዞዦዮዶጆጎጦጮጶጾፆፎፖ]ች

  31. Working with Unicode Overloading Perl Regex with Regexp::Ethiopic • /[#3#]ያ/ • አንባቢያን • ሚያዚያ • ኢትዮጵያዊያን • /[#3,6#]ያ/ • አንባቢያን አንባብያን • ሚያዚያ ሚያዝያ • ኢትዮጵያዊያን ኢትዮጵያውያን

  32. Working with Unicode Text Processing ICU: Transliteration • Defined by “transform rules” • One to one mappings: • α <> a; • β <> b; • Context Rules: • β } [aeiou] > b; • β } [^aeiou] > v;

  33. Working with Unicode Text Processing ICU: Transliteration • Defined by “transform rules” • Applying UCD Properties • Θ } [:LowercaseLetter:] <> Th; • Θ <> TH; • Reverse Transliteration Context Rules • σ < [:^Letter:] { s } [:^Letter:] ; • ς < s } [:^Letter:] ; • σ < s ;

  34. Working with Unicode Text Processing • ICU: Transliteration • Gets much more sophisticated • See also Perl’s Text::Transliterate

  35. Working for Unicode Taking Your Work a Step Further • You’ve helped create an orthography –now make it official. • You’ve worked with a pre-existing un-encoded script using the PUA –now formalize it. • You’ve created a transliteration system–make it an ISO standard. • You’ve identified a dialect –encode it in ISO 639. • You’ve developed a keyboard–make it a national standard. • etc.

  36. Working for Unicode Why go the extra mile kilometer? • Ethnic pride and identity is promoted. • Literacy efforts can be encouraged. • The study of historic scripts is kept alive. • Communication between and amongst members of the community is promoted. • Government communication in times of emergency (disease, war, natural disaster). • Leads to localization, greater access to ICT. • …and you become the expert!

  37. Working for Unicode What to Consider • The work will be more social than technical. • The work will take years (at least two). • Review Encoding History • Has this been attempted before and failed? Why? • Are there any non-Unicode encodings? • Determine the Stakeholders • The Government –will they support you, oppose you, jail you? • Political Parties, Religious, Education, Cultural Groups • does anyone have something to lose by the encoding? • Communicate, Communicate, Communicate… • and be transparent. • the perception of being closed breeds suspicion and opposition. • …even 11 years after the fact, trust me on this.

  38. Working for Unicode New Keyboard? • No international standardization working groups • Contribute Keyboard back to main project • Contact Local ICT Professionals Organization • Contact Local University CS Department • Contact Local Standards Body

  39. Working for Unicode New Language or Dialect? • Contact the ICO/DIS 639-3 Registration Authority • http://sil.org/iso639-3/ • iso639-3@sil.org • Contact Language or Cultural Authority • Contact Local University Linguistics Department

  40. Working for Unicode New Orthography? Or Un-encoded? • Contact the ISO 15924 Registration Authority • http://unicode.org/iso15924/ • Contact Language or Cultural Authority • Contact Local ICT Professionals Organization • Contact Local University CS Department • Contact Local University Linguistics Department • Contact Local Standards Body • Contact the Script Encoding Initiative

  41. Working for Unicode The Script Encoding Initiative • http://linguistics.berkeley.edu/sei • Works with users on script proposals. • Helps raise money for script proposals to be written and free fonts to be created. • Works collaboratively with other groups (e.g. SIL) to avoid duplication of effort. • Helps seek experts to review proposals. • Participates at standards meetings on behalf of minority groups and scholars.

  42. ~fini~ • Conclusion • Use Unicode Now! • You can do it! • Yes you can do it! • There are no excuses anymore… • …its 2006 already, I’m telling you can do this! • and when you do (remember I have faith in you!) consider feeding back into the system via standardization. • Be a good citizen of earth, always ☺. Thank You for Listening. Are There Any Questions? This presentation: http://yacob.org/papers/

More Related