430 likes | 574 Views
Getting Started with ICU. Vladimir Weinstein Eric Mader Steven R. Loomis. Agenda. Getting & setting up ICU4C Using conversion engine Using break iterator engine Getting & setting up ICU4J Using collation engine Using message formats Example analysis. Getting ICU4C.
E N D
Getting Started with ICU Vladimir Weinstein Eric Mader Steven R. Loomis
Agenda • Getting & setting up ICU4C • Using conversion engine • Using break iterator engine • Getting & setting up ICU4J • Using collation engine • Using message formats • Example analysis 27th Internationalization and Unicode Conference
Getting ICU4C • http://ibm.com/software/globalization/icu • Get the latest release • Get the binary package • Source download for modifying build options • CVS for bleeding edge – read instructions 27th Internationalization and Unicode Conference
Setting up ICU4C • Unpack binaries • If you need to build from source • Windows: • MSVC .Net 2003 Project, • CygWin + MSVC 6, • just CygWin • Unix: runConfigureICU • make install • make check 27th Internationalization and Unicode Conference
Testing ICU4C • Windows - run: cintltst, intltest, iotest • Unix - make check (again) • See it for yourself: #include <stdio.h> #include "unicode/utypes.h" #include "unicode/ures.h" main() { UErrorCode status = U_ZERO_ERROR; UResourceBundle *res = ures_open(NULL, "", &status); if(U_SUCCESS(status)) { printf("everything is OK\n"); } else { printf("error %s opening resource\n", u_errorName(status)); } ures_close(res); } 27th Internationalization and Unicode Conference
Conversion Engine - Opening • ICU4C uses open/use/close paradigm • Open a converter: UErrorCode status = U_ZERO_ERROR; UConverter *cnv = ucnv_open(encoding, &status); if(U_FAILURE(status)) { /* process the error situation, die gracefully */ } • Almost all APIs use UErrorCode for status • Check the error code! 27th Internationalization and Unicode Conference
What Converters are Available • ucnv_countAvailable() – get the number of available converters • ucnv_getAvailable – get the name of a particular converter • Lot of frameworks allow this examination 27th Internationalization and Unicode Conference
Converting Text Chunk by Chunk char buffer[DEFAULT_BUFFER_SIZE]; char *bufP = buffer; len = ucnv_fromUChars(cnv, bufP, DEFAULT_BUFFER_SIZE, source, sourceLen, &status); if(U_FAILURE(status)) { if(status == U_BUFFER_OVERFLOW_ERROR) { status = U_ZERO_ERROR; bufP = (UChar *)malloc((len + 1) * sizeof(char)); len = ucnv_fromUChars(cnv, bufP, DEFAULT_BUFFER_SIZE, source, sourceLen, &status); } else { /* other error, die gracefully */ } } /* do interesting stuff with the converted text */ 27th Internationalization and Unicode Conference
Converting Text Character by Character UChar32 result; char *source = start; char *sourceLimit = start + len; while(source < sourceLimit) { result = ucnv_getNextUChar(cnv, &source, sourceLimit, &status); if(U_FAILURE(status)) { /* die gracefully */ } /* do interesting stuff with the converted text */ } • Works only from code page to Unicode 27th Internationalization and Unicode Conference
Converting Text Piece by Piece while((!feof(f)) && ((count=fread(inBuf, 1, BUFFER_SIZE , f)) > 0) ) { source = inBuf; sourceLimit = inBuf + count; do { target = uBuf; targetLimit = uBuf + uBufSize; ucnv_toUnicode(conv, &target, targetLimit, &source, sourceLimit, NULL, feof(f)?TRUE:FALSE, /* pass 'flush' when eof */ /* is true (when no more data will come) */ &status); if(status == U_BUFFER_OVERFLOW_ERROR) { // simply ran out of space – we'll reset the // target ptr the next time through the loop. status = U_ZERO_ERROR; } else { // Check other errors here and act appropriately } text.append(uBuf, target-uBuf); count += target-uBuf; } while (source < sourceLimit); // while simply out of space } 27th Internationalization and Unicode Conference
Clean up! • Whatever is opened, needs to be closed • Converters use ucnv_close • Sample uses conversion to convert code page data from a file 27th Internationalization and Unicode Conference
Text Boundary Analysis • Process of locating linguistic boundaries while formatting and processing text • Many uses • Relatively straightforward for English • Hard for some other languages: • Chinese and Japanese • Thai • Hindi 27th Internationalization and Unicode Conference
Break Iteration - Introduction • Character boundaries: grapheme clusters • Word boundaries: word counting, double click selection • Line break boundaries: where to break a line • Sentence break boundaries: sentence counting, triple click selection • ICU class - BreakIterator 27th Internationalization and Unicode Conference
Break Iteration – starting states • Points to a boundary between two characters • Index of character following the boundary • Use current() to get the boundary • Use first() to set iterator to start of text • Use last() to set iterator to end of text 27th Internationalization and Unicode Conference
Break Iteration - Navigation • Use next() to move to next boundary • Use previous() to move to previous boundary • Returns DONE if can’t move boundary 27th Internationalization and Unicode Conference
Break Itaration – Checking a position • Use isBoundary() to see if position is boundary • Use preceeding() to find boundary at or before • Use following() to find boundary at or after 27th Internationalization and Unicode Conference
Break Iteration - Opening • Use the factory methods: Locale locale = …; // locale to use for break iterators UErrorCode status = U_ZERO_ERROR; BreakIterator *characterIterator = BreakIterator::createCharacterInstance(locale, status); BreakIterator *wordIterator = BreakIterator::createWordInstance(locale, status); BreakIterator *lineIterator = BreakIterator::createLineInstance(locale, status); BreakIterator *sentenceIterator = BreakIterator::createSentenceInstance(locale, status); • Don’t forget to check the status! 27th Internationalization and Unicode Conference
Set the text • We need to tell the iterator what text to use: UnicodeString text; readFile(file, text); wordIterator->setText(text); • Reuse iterators by calling setText() again. 27th Internationalization and Unicode Conference
Break Iteration - Counting words in a file: int32_t countWords(BreakIterator *wordIterator, UnicodeString &text) { U_ERROR_CODE status = U_ZERO_ERROR; UnicodeString word; UnicodeSet letters(UnicodeString("[:letter:]"), status); int32_t wordCount = 0; int32_t start = wordIterator->first(); for(int32_t end = wordIterator->next(); end != BreakIterator::DONE; start = end, end = wordIterator->next()) { text->extractBetween(start, end, word); if(letters.containsSome(word)) { wordCount += 1; } } return wordCount; } 27th Internationalization and Unicode Conference
Break Iteration – Breaking lines int32_t previousBreak(BreakIterator *breakIterator, UnicodeString &text, int32_t location) { int32_t len = text.length(); while(location < len) { UChar c = text[location]; if(!u_isWhitespace(c) && !u_iscntrl(c)) { break; } location += 1; } return breakIterator->previous(location + 1); } 27th Internationalization and Unicode Conference
Break Iteration – Cleaning up • Use delete to delete the iterators delete characterIterator; delete wordIterator; delete lineIterator; delete sentenceIterator; 27th Internationalization and Unicode Conference
Useful Links • Homepage: http://ibm.com/software/globalization/icu • API documents and User guide: http://ibm.com/software/globalization/icu/documents.jsp 27th Internationalization and Unicode Conference
Getting ICU4J • Easiest – pick a .jar file off download section on http://ibm.com/software/globalization/icu • Use the latest version if possible • For sources, download the source .jar • For bleeding edge, use the latest CVS – see site for instructions 27th Internationalization and Unicode Conference
Setting up ICU4J • Check that you have the appropriate JDK version • Try the test code (ICU4J 3.0 or later): import com.ibm.icu.util.ULocale; import com.ibm.icu.util.UResourceBundle; public class TestICU { public static void main(String[] args) { UResourceBundle resourceBundle = UResourceBundle.getBundleInstance(null, ULocale.getDefault()); } } • Add ICU’s jar to classpath on command line • Run the test suite 27th Internationalization and Unicode Conference
Building ICU4J • Need ant in addition to JDK • Use ant to build • We also like Eclipse 27th Internationalization and Unicode Conference
Collation Engine • More on collation tomorrow! • Used for comparing strings • Instantiation: ULocale locale = new ULocale("fr"); Collator coll = Collator.getInstance(locale); // do useful things with the collator • Lives in com.ibm.icu.text.Collator 27th Internationalization and Unicode Conference
String Comparison • Works fast • You get the result as soon as it is ready • Use when you don’t need to compare same strings many times int compare(String source, String target); 27th Internationalization and Unicode Conference
Sort Keys • Used when multiple comparisons are required • Indexes in data bases • ICU4J has two classes • Compare only sort keys generated by the same type of a collator 27th Internationalization and Unicode Conference
CollationKey class • JDK compatible • Saves the original string • Compare keys with compareTo method • Get the bytes with toByteArray method • We used CollationKey as a key for a TreeMap structure 27th Internationalization and Unicode Conference
RawCollationKey class • Does not store the original string • Get it by using getRawCollationKey method • Mutable class, can be reused • Simple and lightweight 27th Internationalization and Unicode Conference
Message Format - Introduction • Assembles a user message from parts • Some parts fixed, some supplied at runtime • Order different for different languages: • English: My Aunt’s pen is on the table. • French: The pen of my Aunt is on the table. • Pattern string defines how to assemble parts: • English: {0}''s {2} is {1}. • French: {2} of {0} is {1}. • Get pattern string from resource bundle 27th Internationalization and Unicode Conference
Message Format - Example String person = …; // e.g. “My Aunt” String place = …; // e.g. “on the table” String thing = …; // e.g. “pen” String pattern = resourceBundle.getString(“personPlaceThing”); MessageFormat msgFmt = new MessageFormat(pattern); Object arguments[] = {person, place, thing); String message = msgFmt.format(arguments); System.out.println(message); 27th Internationalization and Unicode Conference
Message Format – Different data types • We can also format other data types, like dates • We do this by adding a format type: String pattern = “On {0, date} at {0, time} there was {1}.”; MessageFormat fmt = new MessageFormat(pattern); Object args[] = {new Date(System.currentTimeMillis()), // 0 “a power failure” // 1 }; System.out.println(fmt.format(args)); • This will output: On Jul 17, 2004 at 2:15:08 PM there was a power failure. 27th Internationalization and Unicode Conference
Message Format – Format styles • Add a format style: String pattern = “On {0, date, full} at {0, time, full} there was {1}.”; MessageFormat fmt = new MessageFormat(pattern); Object args[] = {new Date(System.currentTimeMillis()), // 0 “a power failure” // 1 }; System.out.println(fmt.format(args)); • This will output: On Saturday, July 17, 2004 at 2:15:08 PM PDT there was a power failure. 27th Internationalization and Unicode Conference
Message Format – Format style details 27th Internationalization and Unicode Conference
Message Format – No format type • If no format type, data formatted like this: 27th Internationalization and Unicode Conference
Message Format – Counting files • Pattern to display number of files: There are {1, number, integer} files in {0}. • Code to use the pattern: String pattern = resourceBundle.getString(“fileCount”); MessageFormat fmt = new MessageFormat(fileCountPattern); String directoryName = … ; Int fileCount = … ; Object args[] = {directoryName, new Integer(fileCount)}; System.out.println(fmt.format(args)); • This will output messages like: There are 1,234 files in myDirectory. 27th Internationalization and Unicode Conference
Message Format – Problems counting files • If there’s only one file, we get: There are 1 files in myDirectory. • Could fix by testing for special case of one file • But, some languages need other special cases: • Dual forms • Different form for no files • Etc. 27th Internationalization and Unicode Conference
Message Format – Choice format • Choice format handles all of this • Use special format element: There {1, choice, 0#are no files| 1#is one file| 1<are {1, number, integer} files} in {0}. • Using this pattern with the same code we get: There are no files in thisDirectory. There is one file in thatDirectory. There are 1,234 files in myDirectory. 27th Internationalization and Unicode Conference
Message Format – Choice format patterns • Selects a string based on number • If string is a format element, process it • Splits real line into two or more ranges • Range specifiers separated by vertical bar (“|”) • Lower limit, separator, string • Separator indicates type of lower limit: 27th Internationalization and Unicode Conference
Message Format – Choice pattern details • Here’s our pattern again: There {1, choice, 0#are no files| 1#is one file| 1<are {1, number, integer} files} in {0}. • First range is [0..1) • Really [-∞..1) • Second range is [1..1] • Third range is (1..∞] 27th Internationalization and Unicode Conference
Message Format – Other details • Format style can be a pattern string • Format type number: use DecimalFormat pattern • Format type date, time: use SimpleDateFormat pattern • Quoting in patterns • Enclose special characters in single quotes • Use two consecutive single quotes to represent one The '{' character, the '#' character and the '' character. 27th Internationalization and Unicode Conference
Useful Links • Homepage: http://ibm.com/software/globalization/icu • API documents and User guide: http://ibm.com/software/globalization/icu/documents.jsp 27th Internationalization and Unicode Conference