1 / 35

Unicode Transforms in ICU

Unicode Transforms in ICU. Mark Davis Chief SW Globalization Architect IBM. What is ICU?. Internationalization libraries for C, C++, Java* Open source – non-viral Sponsored by IBM Sun’s Java licenses an earlier ICU version; ICU4J updates it. Unicode standard compliant

blackmanj
Download Presentation

Unicode Transforms in ICU

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Unicode Transforms in ICU Mark DavisChief SW Globalization Architect IBM

  2. What is ICU? • Internationalization libraries for C, C++, Java* • Open source – non-viral • Sponsored by IBM • Sun’s Java licenses an earlier ICU version; ICU4J updates it. • Unicode standard compliant • full supplementary support • Cross-platform; extensible and customizable • High performance and thread-safe • Multiple locales in same thread – simultaneously • http://oss.software.ibm.com/icu/ 22st International Unicode Conference

  3. Unicode text handling Character set conversions (700+) Collation & Searching Locales (170+) Resource Bundles Calendar & Time zones Complex-text layout engine Breaks: character, word, line, & sentence Formatting Date & time Messages Numbers & currencies Transforms Normalization Casing Transliterations ICU Features 22st International Unicode Conference

  4. ICU Transforms • Powerful, flexible mechanism • Uppercase, Lowercase, Titlecase, Full/Halfwidth • Normalization • Hex, Character Names • Script to Script conversion… • Supports Styled Text, not just Plain Text • Chaining, Filters, Buffering • Customizable 22st International Unicode Conference

  5. Transform Examples • “Any-Uppercase” a → A • “Any-Hex/Java” a → \u0061 • “Greek-Latin” a → α 22st International Unicode Conference

  6. Filters • “[aeiou] Latin - Greek” • “Latin” is the source • “[aeiou]” is a filter, restricts the application to only English vowels. Uses UnicodeSet. • “Greek” is the target • “[^\u0000-\u007E] Any - Hex” • “A δ is…” → “A \u03B4 is\u2026” 22st International Unicode Conference

  7. UnicodeSet • Ranges [ABC a-z] • Union [[:Lu:] [:P:]] • Intersection [[:Lu:] & [\u0000-\u01FF]] • Set Difference [[:Lu:] - [\u0000-\u01FF]] • Complement [^aeiou] • Properties • Uppercase letters[:Lu:] • Punctuation[:P:] • Script[:Greek:] ICU 2.2: all enumerated Unicode 3.2 properties 22st International Unicode Conference

  8. UnicodeSet Property Syntax • Either POSIX or Perl Style • \p{letter} • [:letter:] • Short or long form (UCD Property Aliases) • \p{general_category = uppercase_letter} • \p{gc=Lu} • Case-, Space-, Underbar-Insensitive 22st International Unicode Conference

  9. Example Filter • “[:Lu:] Latin-Katakana; Latin-Hiragana” • Converts all uppercase Latin characters to Katakana, • Then converts all other Latin characters to Hiragana. 22st International Unicode Conference

  10. Chaining Transforms • “Kana-Latin; Any-Title” • たけだ, まさゆき • takeda, masayuki • Takeda, Masayuki • Any number 22st International Unicode Conference

  11. Filtering plus Chaining • “NFD; [:Mark:] Remove; NFC” • Decompose • Remove accents (Marks) • Recompose 22st International Unicode Conference

  12. Built-in Transforms • Normalization • Å → Å • Casing • a → A • Full ↔ Halfwidth • カ → カ • Character Names • a → {LATIN SMALL LETTER A} • Hex: XML, Java, C++, Perl, … styles • a → \u0061, U+0061, … 22st International Unicode Conference

  13. Script ↔ Script Conversions • General conversions, e.g.: Greek-Latin • Source-Target Reversible: φ → ph → φ • Not Target-Source Reversible: f → φ → ph • Variants • By Language: Greek-German • By Standard: Greek-Latin/UNGEGN • Can build your own 22st International Unicode Conference

  14. 김, 국삼 김, 명희 정, 병호 たけだ, まさゆき ますだ, よしひこ やまもと, のぼる Ρούτση, Άννα Καλούδης, Χρήστος Θεοδωράτου, Ελένη Gim, Gugsam Gim, Myeonghyi Jeong, Byeongho Takeda, Masayuki Masuda, Yoshihiko Yamamoto, Noboru Roútsē, Ánna Kaloúdēs, Chrḗstos Theodōrátou, Elénē “Any-Latin” Example 22st International Unicode Conference

  15. Styled Text • Preserves individual styles on letters, where possible απα → apa 22st International Unicode Conference

  16. p? ph? ps? When Buffering • Conversions are not performed if they may extend over boundaries Key Result a α p αp a απα p απαp h απαφ 22st International Unicode Conference

  17. Custom Rules • Similar to Regular Expressions • Variables • Property matches • Contextual matches • Rearrangement • $1, $2… • Quantifiers: • *, +, ? 22st International Unicode Conference

  18. Differences from Reg. Exp.’s • More Powerful… • Buffered/Keyboard • Styled Text • Ordered Rules • Cursor Backup • Less Powerful… • Only greedy quantifiers • No backup: so no (X | Y) • No “input-side back references” 22st International Unicode Conference

  19. Example of Custom Rules • “UnixQuotes-RealQuotes” \`\` > “; # two graves → right-quote \'\' > ” ; # two generics → left-quote • Example (SJ Mercury News online) ``expertise''→“expertise” 22st International Unicode Conference

  20. Rule Ordering • Find first rule that matches at start • If no match, or (isBuffered & clipped-Match) • advance start by 1 • Else if match, • Substitute text • Move start as specified • Continue until start reaches limit 22st International Unicode Conference

  21. Rule Ordering Example Translit. Reg Exp. xy > c ; s/xy/c/g yx > d ; s/yx/d/g xyx-yxy-xyx cx-dy-cx cx-yc-cx 22st International Unicode Conference

  22. Context • Rules: • {γ } [ Γ Κ Χ Ξ γ κ χ ξ ] > n; • γ > g; • Meaning: • Convert gamma into n IF followed byΓ, Κ, Χ, Ξ, γ, κ, χ, or ξ • Otherwise into g 22st International Unicode Conference

  23. Cursor Backup |BYO • Allows text to be revisited • Reduces rule-count • Example Rules • BY > ビ | ~Y ; • ~YO > ョ; 1 ビ|~YO 2 ビョ| 22st International Unicode Conference

  24. Demonstration • Public Demo • http://oss.software.ibm.com/icu/demo • (local copy, samples) 22st International Unicode Conference

  25. More Information http://oss.software.ibm.com/… User Guide /icu/userguide/ C /icu/apiref/utrans_h.html C++ /icu/apiref/ Java API /icu4j/doc/com/ibm/text/ • Latest Version of these slides • http://www.macchiato.com 22st International Unicode Conference

  26. ICU Transforms • Powerful, flexible mechanism • Uppercase, Lowercase, Titlecase, Full/Halfwidth • Normalization • Hex, Character Names • Script to Script conversion… • Supports Styled Text, not just Plain Text • Chaining, Filters, Buffering • Customizable 22st International Unicode Conference

  27. Q & A 22st International Unicode Conference

  28. Backup Slides • Not used in the presentation, except in response to questions 22st International Unicode Conference

  29. Buffered Usage • No conversion for clipped match …t…t • Fill buffer • Transliterate • May have left-overs x …τ…t th… • Copy left-overs to start • Fill rest of buffer • Transliterate θ… 22st International Unicode Conference

  30. Styled Text Handling • Transforms operate on Replaceable, an interface/abstract class defined by ICU • In ICU4c, UnicodeString is a Replaceable subclass (with no out-of-band data -- no styles) • ICU4j defines ReplaceableString, a Replaceable subclass, also with no styles • Clients must define their own Replaceable subclass that implements their styled text. 22st International Unicode Conference

  31. Transliteration Sources • Søren Binks • http://homepage.mac.com/sirbinks/translit.html • UNGEGN • http://www.eki.ee/wgrs/ • … 22st International Unicode Conference

  32. API: Information • Like other ICU APIs, can get each of the available Transform IDs: • count =Transliterator:: countAvailableIDs(); • myID = Transliterator::getAvailableID(n); • And get a localizable name for each: • Transliterator::getDisplayName(myID, france, nameForUser); Note: these are C++ APIs; C and Java are also available. 22st International Unicode Conference

  33. API: Creation • Use an ID to create: • myTrans = Transliterator::createInstance("Latin-Greek"); 22st International Unicode Conference

  34. API: Simple usage • Convert entire string • myTrans.transliterate(myString); 22st International Unicode Conference

  35. More Control • Specify Context • Use with Styled Text abcdefghijklmnopqrstuvwxyz contextStart contextLimit start limit 22st International Unicode Conference

More Related