220 likes | 445 Views
An ICU Overview. Mark Davis Chief Globalization Architect, IBM IBM Globalization Center of Competency. Agenda. What is ICU? Architecture Overview Significant New ICU Features Near Future Features References Q and A. Unicode text handling Character set conversions (700+)
E N D
An ICU Overview Mark Davis Chief Globalization Architect, IBM IBM Globalization Center of Competency
Agenda • What is ICU? • Architecture Overview • Significant New ICU Features • Near Future Features • References • Q and A 23nd International Unicode Conference
Unicode text handling Character set conversions (700+) Collation & Searching Locales (170+) Resource Bundles Calendar & Time zones Complex-text layout engine Regular Expressions Breaks: character, word, line, & sentence Formatting Date & time Messages Numbers & currencies Transforms Normalization Casing Transliterations ICU Features 23nd International Unicode Conference
Unicode Text Handling • C • UChar*: null-terminated or with length • C++ • UnicodeString: full featured string class • Java • Uses normal JDK String, adds utilities • All handle supplementary characters • Required for GB 18030 and JIS 213 repertoire 23nd International Unicode Conference
Unicode Text Handling II • All Unicode properties • UnicodeSet • fast, low-memory • boolean combinations of properties & ranges • [[\p{whitespace}\p{Latin}]-[aeiuo]] • in regular expressions, transform filters, & stand-alone 23nd International Unicode Conference
Character Set Conversion • 700+ supported character sets • Precise alias information: • When you ask for “SJIS”, you can request the precise definition: windows, ibm, solaris,… • Buffer management handles characters that cross buffers • Customizations allowed for illegal sequences, and undefined characters • Unicode Text Compression – SCSU, BOCU 23nd International Unicode Conference
Collation and Searching • Fast international comparison and string search; fully UCA compliant • Compressed sort keys, optimized string comparison, sublinear string search • Supports precise binary sortkey stability over time* • Fully data driven* • API / rule customizations: strength, normalization, upper vs. lowercase first, … 23nd International Unicode Conference
Calendar & Time Zones • International Calendars – Arabic, Buddhist, Hebrew, and Japanese • Required for correct presentation of dates in some countries. 23nd International Unicode Conference
Formatting • Date & time • Messages • Completely localizable, Plural support • Numbers & currencies • Scientific Notation, Spelled-out (checks, etc.) • Dual Currency support: e.g. Indian Rupee • In Hindi: • In English: 1,234.57 Rupees 23nd International Unicode Conference
Transforms • Unicode Normalization* • Highly optimized for performance • performance utilities: concatenation, detection, comparison • Casing (upper, lower, title, folding)* • General Transforms • Script transliterations • Half-width/Full-width, Hex, etc. • Chain transforms together, filter source characters • Rule-based, customizable at runtime. 23nd International Unicode Conference
Word, line & sentence breaks • Fast state-table implementation • Customizable • Rule-based – customizable at runtime • Special customizations, e.g. Thai 23nd International Unicode Conference
Complex-text layout engine • Glyph processing, positioning & adjustment • ligature substitution, contextual forms, kerning, accent placement, Bidi scripts, etc. • Support for: • Drawing • Caret Display • Hit Testing • Selection Highlighting • Caret Movement • Layout Metrics • Line Break 23nd International Unicode Conference
Locale Based Services Locale is an identifier, not a container Object in C++ and Java, char* in C Default locale is set to the platform locale Resource inheritance Architecture Overview 23nd International Unicode Conference
Open and Close Service Model Better performance by avoiding setup costs per operation ICU Threading Model Multiple versions in use simultaneously Large resources shared in read-only cache Modularization Link against multiple ICU version Build partial ICU versions Architecture Overview 23nd International Unicode Conference
Data Driven Services Customize at build-time or run-time Interchange with other platforms; same results on each Rule-based Collation, Word-breaks, Transforms Pattern-based Formats, UnicodeSet Table-based Character Conversion Architecture Overview 23nd International Unicode Conference
Simple Error Handling C++ subset for portability Support for multi-threaded environment Version Management Multiple versions at the same time Data and library versioning String Buffer Management Preflighting and overflow protection Architecture Overview – ICU4C 23nd International Unicode Conference
Recent Features (I) • Unicode Regular Expressions (phase 1) • Full Unicode properties/values • Charset Conversion Enhancements • Alias Management • platform matching • Compression: preserving binary order • Customization • Modularized ICU library building • Service Registration (phase 1) • Dual Currency Support 23nd International Unicode Conference
Recent Features (II) • Memory Management • Load and unload ICU libraries • Choice of heap allocation • Performance • Collation • Fast Unicode Normalization • UnicodeSet • Test framework & tests 23nd International Unicode Conference
2003: SS 2.6, WS 2.8 • Unicode 4.0 Update • More multi-threading support, customization, modularization • Improved RegEx, TextBoundaries, TextLayout • IDN conversion • Collation: UCA 4.0, Partial Sort Keys, Multi-charset • Ongoing work: porting, docs, perf.,… • Related: LDML 23nd International Unicode Conference
ICU main site: http://oss.software.ibm.com/icu/ Links to Download ICU User Guide, Technical FAQ, Support, Bug Reports Unicode Consortium http://www.unicode.org Unicode glossary, Unicode character database IBM Developerworks http://www.ibm.com/developerworks/unicode References 23nd International Unicode Conference
Questions and Answers 23nd International Unicode Conference