280 likes | 305 Views
ICU Overview The Open-Source Unicode Library, v3.2. Markus Scherer ICU Manager IBM Globalization Center of Competency. Agenda. Background What is ICU? Architecture Overview ICU Features and recent additions References Q and A. Why Globalization?. Unicode. All world languages
E N D
ICU OverviewThe Open-SourceUnicode Library, v3.2 Markus Scherer ICU Manager IBM Globalization Center of Competency 27th Internationalization and Unicode Conference
Agenda • Background • What is ICU? • Architecture Overview • ICU Features and recent additions • References • Q and A 27th Internationalization and Unicode Conference
Why Globalization? 27th Internationalization and Unicode Conference
Unicode • All world languages • Efficient and effective processing • Lossless data exchange • Enables single-binary global software • But… all languages ⇒ large, complex standard • 1,400 pages + Annexes + additional standards • 90,000+ characters • Major update every 3 years • 70 character properties, many multi-valued • Affects many processes: display, line-break, regex, … 27th Internationalization and Unicode Conference
Locales • Features vary widely across languages & countries • Sorting, line breaks, date/time/number/currency formatting, codepage conversion, … • Performance is key: easy to do the right thing; hard to do it fast 27th Internationalization and Unicode Conference
What is ICU? • Globalization / Unicode / Locales • Mature, widely used set of C/C++ and Java libraries • Basis for Java 1.1 internationalization – but goes far beyond • “ICU4C”: C/C++ libraries; “ICU4J”: Java library • Very portable – identical results on all platforms / programming languages • C/C++: 30+ platforms/compilers • Java: IBM & Sun JDK • Full threading model; customizable; modular • Open source – but not viral • ICU 3.2: 78 languages; 118 countries; 870 codepages 27th Internationalization and Unicode Conference
Who uses ICU? (Examples) • Products Within IBM • DB2, COBOL, InfoPrint Manager, Lotus Notes, Lotus Workplace, Tivoli Presentation Services, WebSphere, XML Parser, … • Other Companies and Organizations • Adobe, Apple (Mac OS X), BEA, CERN, Cognos, Debian, HP, Inktomi, JD Edwards, Macromedia, Mathworks, Mozilla, NCR, OpenOffice, PayPal, SAP, Siebel, SIL, Software AG, Sun Microsystems (Solaris, Java), SuSE, Sybase, webMethods, … 27th Internationalization and Unicode Conference
Unicode text handling Charset conversions (870+) Collation & Searching Locales (170+) Resource Bundles Calendar & Time zones Complex-text layout engine Unicode Regular Expressions Breaks: word, line, … Formatting Date & time Messages Numbers & currencies Transforms Normalization Casing Transliterations ICU Features 27th Internationalization and Unicode Conference
Architecture Overview 1 • Locale Based Services • Locale is an identifier, not a container • Keywords for variants: de@collation=phonebook • Recent addition: accept-language support • Resource inheritance: shared resources root Language en de zh Script Hant Hans Country US IE DE CH TW CN CN TW 27th Internationalization and Unicode Conference
Architecture Overview 2 • Open and Close Service Model • Open a service object, use it many times, close it when done • Better performance by avoiding setup costs per operation • Warning: use properly for maximum performace • ICU Threading Model • Multiple service objects in use simultaneously, with same or different attributes • Large resources shared in read-only cache 27th Internationalization and Unicode Conference
Architecture Overview 3 • Data Driven Services • Customize at build-time or run-time • Interchange with other platforms; • same results on each • Rule-based • Collation, Word-breaks, Transforms • Pattern-based • Formats, UnicodeSet • Table-based • Character Conversion 27th Internationalization and Unicode Conference
Architecture Overview – ICU4C • Simple Error Handling • C++ subset for portability • Support for multi-threaded environment • Version Management • Multiple versions at the same time • Data and library versioning • String Buffer Management • Preflighting and overflow protection • Misc: Load/Unload ICU • Recent Additions: • Runtime-settable memory allocation and mutex functions 27th Internationalization and Unicode Conference
Architecture Overview – ICU4J • Supplement for Java • Core globalization (no character conversion or regular expressions, no GUI components) • We do supply complex text support for Sun • Modularized: products may add just needed functionality 27th Internationalization and Unicode Conference
ICU4J vs. JDK • CLDR 1.2 (Common Locale Data Repository) • Up-to-date globalization: standards-compliant; latest Unicode • Supplementary character (GB 18030, JIS X 213, HKSCS) • Java 5 adds handling of supplementary characters • Full properties – JDK has only a fraction • Unicode Collation Algorithm • Local calendars (Thailand, Japan,…); ISO dates • Currencies, String Search, Int’l Domain Names • Transforms: Case, Scripts, Normalization • Much faster turn-around on bug fixes, enhancements 27th Internationalization and Unicode Conference
Unicode Text Handling • C • UChar*: null-terminated or with length • C++ • UnicodeString: full featured string class • Java • Uses normal JDK String, adds utilities • All handle supplementary characters • Required for GB 18030/JIS X 0213/HKSCS repertoires 27th Internationalization and Unicode Conference
Unicode Text Handling 2 • All Unicode 4.0.1 properties • Direct API • Values, names, enumerations • UnicodeSet • Fast, compact set operations • Pattern-based (both Perl & POSIX syntax for properties) • \p{greek} vs. [:greek:] • All properties: • [\p{lowercase}-[a-z]] • [\p{greek} & \p{uppercase}] 27th Internationalization and Unicode Conference
Data: Recent Additions • Conforms to CLDR 1.2 • 50% more data than CLDR 1.0: adding many translated terms for languages, scripts, countries, currencies, and time zones. • Added data for new languages: Malayalam, Oriya, Welsh • Reduced multiplatform install image size • Improved XLIFF-ICU conversion tools • Locale canonicalization spec defined and implemented (C+J) • Provides interoperability with POSIX and .NET locale IDs, more RFC 3066 support 27th Internationalization and Unicode Conference
Character Set Conversion • Precise alias information: • When you ask for “SJIS”, you can request the precise definition by platform: • Windows, IBM, Solaris,… • Buffer management • automatically handles characters that cross buffers • Customizations allowed for: • illegal sequences • undefined characters • Unicode Text Compression – SCSU, BOCU-1 27th Internationalization and Unicode Conference
Collation and Searching • Fast international comparison and string search; fully UCA compliant • Compressed sort keys, optimized string comparison, sublinear string search • incremental sortkeys for radix-sort • Precise binary sortkey stability over time • Fully data driven • API / rule customizations • strength, normalization, upper vs. lowercase first, ignore punctuation, sort digits as numbers, … 27th Internationalization and Unicode Conference
Collation and Searching: Recent Additions • Numeric sorting: sequences of digits can be sorted numerically instead of alphabetically • e.g., filenames would sort "ab-2" < "ab-10" • without material performance cost • with reduced sortkey length. • Significantly improved sorting orders for many other languages • Data in separate tree, for easier modularization and maintenance • getFunctionalEquivalent API allows for better caching and UI support. 27th Internationalization and Unicode Conference
Calendar & Time Zones • International Calendars – Arabic, Buddhist, Hebrew, Japanese • Required for correct presentation of dates in some countries • Olson timezone support, with localizations • Recent Additions: • RFC822 time zone format support in DateFormat (C+J) for compatibility. • “Universal Time” conversions for high-precision date/time computations 27th Internationalization and Unicode Conference
Formatting • Date & time: 8 formats per locale • Messages • Completely localizable, Plural support • Numbers & currencies • Scientific Notation, Spelled-out (checks, etc.) • Full Orthogonal Currency support • INR In Hindi: • INR In English: Rs. 1,234.57 • INR In German: Rs. 1.234,57 • Recent Additions • POSIX migration library • Allows parsing multiple currencies with one formatter • Short and stand-alone month/day names 27th Internationalization and Unicode Conference
Transforms • Unicode Normalization • Highly optimized for performance • performance utilities: concatenation, detection, comparison • Casing (upper, lower, title, folding) • General Transforms • Script transliterations • Half-width/Full-width, Hex, etc. • Chain transforms together, filter source characters • Rule-based, customizable at runtime. • IDNA: International Domain Names 27th Internationalization and Unicode Conference
Segmentation: word, line & sentence • Fast state-table implementation • Customizable • Rule-based – customizable at runtime • Special customizations, e.g. Thai • Recent Additions: • Greatly improved performance when going backwards(common case when doing line break) • Java • The rules syntax has been extended. Rules can now return information about the types of characters they encountered. • Common compiled (binary) rule format with ICU4C 27th Internationalization and Unicode Conference
Unicode Regular Expressions • Full Regex Implementation • C only: Java 1.4 has own package (though not as powerful) • All Unicode 4.0.1 Properties • supported through UnicodeSet • Good performance • competitive with non-Unicode regex • Recent Additions • Now features a C API, instead of just C++. 27th Internationalization and Unicode Conference
Complex-text layout engine • Glyph processing, positioning & adjustment • ligature substitution, contextual forms, kerning, accent placement, Bidi scripts, etc. • Support for: • Drawing • Caret Display • Hit Testing • Selection Highlighting • Caret Movement • Layout Metrics • Line Break • Recent addition: Canonical Equivalence: a + ´ or á 27th Internationalization and Unicode Conference
References • ICU main site: • http://ibm.com/software/globalization/icu • New URL • Links to • Download ICU • User Guide, Technical FAQ, Support, Bug Reports • Unicode Consortium • http://www.unicode.org • Unicode glossary, Unicode character database 27th Internationalization and Unicode Conference
Questions and Answers 27th Internationalization and Unicode Conference