1 / 21

An ICU Overview

An ICU Overview. Mark Davis Chief Globalization Architect, IBM IBM Globalization Center of Competency. Agenda. What is ICU? Architecture Overview Significant New ICU Features Near Future Features References Q and A. Unicode text handling Character set conversions (700+)

kimberly
Download Presentation

An ICU Overview

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. An ICU Overview Mark Davis Chief Globalization Architect, IBM IBM Globalization Center of Competency

  2. Agenda • What is ICU? • Architecture Overview • Significant New ICU Features • Near Future Features • References • Q and A 23nd International Unicode Conference

  3. Unicode text handling Character set conversions (700+) Collation & Searching Locales (170+) Resource Bundles Calendar & Time zones Complex-text layout engine Regular Expressions Breaks: character, word, line, & sentence Formatting Date & time Messages Numbers & currencies Transforms Normalization Casing Transliterations ICU Features 23nd International Unicode Conference

  4. Unicode Text Handling • C • UChar*: null-terminated or with length • C++ • UnicodeString: full featured string class • Java • Uses normal JDK String, adds utilities • All handle supplementary characters • Required for GB 18030 and JIS 213 repertoire 23nd International Unicode Conference

  5. Unicode Text Handling II • All Unicode properties • UnicodeSet • fast, low-memory • boolean combinations of properties & ranges • [[\p{whitespace}\p{Latin}]-[aeiuo]] • in regular expressions, transform filters, & stand-alone 23nd International Unicode Conference

  6. Character Set Conversion • 700+ supported character sets • Precise alias information: • When you ask for “SJIS”, you can request the precise definition: windows, ibm, solaris,… • Buffer management handles characters that cross buffers • Customizations allowed for illegal sequences, and undefined characters • Unicode Text Compression – SCSU, BOCU 23nd International Unicode Conference

  7. Collation and Searching • Fast international comparison and string search; fully UCA compliant • Compressed sort keys, optimized string comparison, sublinear string search • Supports precise binary sortkey stability over time* • Fully data driven* • API / rule customizations: strength, normalization, upper vs. lowercase first, … 23nd International Unicode Conference

  8. Calendar & Time Zones • International Calendars – Arabic, Buddhist, Hebrew, and Japanese • Required for correct presentation of dates in some countries. 23nd International Unicode Conference

  9. Formatting • Date & time • Messages • Completely localizable, Plural support • Numbers & currencies • Scientific Notation, Spelled-out (checks, etc.) • Dual Currency support: e.g. Indian Rupee • In Hindi: • In English: 1,234.57 Rupees 23nd International Unicode Conference

  10. Transforms • Unicode Normalization* • Highly optimized for performance • performance utilities: concatenation, detection, comparison • Casing (upper, lower, title, folding)* • General Transforms • Script transliterations • Half-width/Full-width, Hex, etc. • Chain transforms together, filter source characters • Rule-based, customizable at runtime. 23nd International Unicode Conference

  11. Word, line & sentence breaks • Fast state-table implementation • Customizable • Rule-based – customizable at runtime • Special customizations, e.g. Thai 23nd International Unicode Conference

  12. Complex-text layout engine • Glyph processing, positioning & adjustment • ligature substitution, contextual forms, kerning, accent placement, Bidi scripts, etc. • Support for: • Drawing • Caret Display • Hit Testing • Selection Highlighting • Caret Movement • Layout Metrics • Line Break 23nd International Unicode Conference

  13. Locale Based Services Locale is an identifier, not a container Object in C++ and Java, char* in C Default locale is set to the platform locale Resource inheritance Architecture Overview 23nd International Unicode Conference

  14. Open and Close Service Model Better performance by avoiding setup costs per operation ICU Threading Model Multiple versions in use simultaneously Large resources shared in read-only cache Modularization Link against multiple ICU version Build partial ICU versions Architecture Overview 23nd International Unicode Conference

  15. Data Driven Services Customize at build-time or run-time Interchange with other platforms; same results on each Rule-based Collation, Word-breaks, Transforms Pattern-based Formats, UnicodeSet Table-based Character Conversion Architecture Overview 23nd International Unicode Conference

  16. Simple Error Handling C++ subset for portability Support for multi-threaded environment Version Management Multiple versions at the same time Data and library versioning String Buffer Management Preflighting and overflow protection Architecture Overview – ICU4C 23nd International Unicode Conference

  17. Recent Features (I) • Unicode Regular Expressions (phase 1) • Full Unicode properties/values • Charset Conversion Enhancements • Alias Management • platform matching • Compression: preserving binary order • Customization • Modularized ICU library building • Service Registration (phase 1) • Dual Currency Support 23nd International Unicode Conference

  18. Recent Features (II) • Memory Management • Load and unload ICU libraries • Choice of heap allocation • Performance • Collation • Fast Unicode Normalization • UnicodeSet • Test framework & tests 23nd International Unicode Conference

  19. 2003: SS 2.6, WS 2.8 • Unicode 4.0 Update • More multi-threading support, customization, modularization • Improved RegEx, TextBoundaries, TextLayout • IDN conversion • Collation: UCA 4.0, Partial Sort Keys, Multi-charset • Ongoing work: porting, docs, perf.,… • Related: LDML 23nd International Unicode Conference

  20. ICU main site: http://oss.software.ibm.com/icu/ Links to Download ICU User Guide, Technical FAQ, Support, Bug Reports Unicode Consortium http://www.unicode.org Unicode glossary, Unicode character database IBM Developerworks http://www.ibm.com/developerworks/unicode References 23nd International Unicode Conference

  21. Questions and Answers 23nd International Unicode Conference

More Related