370 likes | 583 Views
Sorting it all out: An introduction to collation. Cathy Wissink Michael Kaplan. Globalization Infrastructure and Font Technology Windows International Microsoft. Who is this talk geared towards?. This is a high-level introduction to the concepts of collation, assuming no prior knowledge.
E N D
Sorting it all out: An introduction to collation Cathy Wissink Michael Kaplan Globalization Infrastructure and Font Technology Windows International Microsoft
Who is this talk geared towards? • This is a high-level introduction to the concepts of collation, assuming no prior knowledge. • Audience: • Developers new to concept • People who need to understand collation enough to “sell” this globalization feature to management • Not intended to be a “nuts and bolts” talk (see the presentation immediately following!) Prague, Czech Republic (IUC23)
Collation: Used Everyday! It may not be obvious, but you most likely use collation in some form everyday: • Finding a mail slot for a colleague • Searching for an author at the bookstore • Library card catalog • Looking up a phone number Prague, Czech Republic (IUC23)
Anytime you order or search for data in a logical fashion within a structure, you use collation! Prague, Czech Republic (IUC23)
Collation, the definition: • The culturally expected ordering of linguistic characters in a particular language • Often referred to as sorting, ordering, alphabetizing • Informants recognize correct vs. incorrect collation for their language, but often have a hard time explaining the particular collation rules Prague, Czech Republic (IUC23)
Great definitions, but what do they mean, really? • Every language (every culture) has an expected result when users search for data in “sorted” order • If the ordering isn’t perfectly correct, users have a very hard time finding data • This ordering can be influenced by a number of linguistic and orthographic elements within a language Prague, Czech Republic (IUC23)
Examples of linguistic elements that impact collation • “Character” order • Casing (upper case vs. lower case) • Modifiers (diacritics, Indic matras, vowel marks) • Radicals (CJK) • Stroke counts (CJK) • Syllable structure (SE Asian languages) • Pronunciation Prague, Czech Republic (IUC23)
Collation in Action • Latin scripts: English, French, Lithuanian, Swedish, Traditional Spanish • Chinese variants (Taiwanese orders) • Devanagari script: Hindi, Marathi • Tamil script: Tamil Prague, Czech Republic (IUC23)
English: Prague, Czech Republic (IUC23)
French: Prague, Czech Republic (IUC23)
Lithuanian: Prague, Czech Republic (IUC23)
Swedish: Prague, Czech Republic (IUC23)
Spanish (Traditional): Prague, Czech Republic (IUC23)
Devanagari Hindi: consonants with modifier marks (candrabindu U+0901, anusvara U+0902 or visarga U+0903) sort differently than the consonant alone. A consonant and one of these modifier marks has a lighter primary sorting weight than the same consonant without a modifier mark. Prague, Czech Republic (IUC23)
Devanagari कँ (Devanagari Ka + candrabindu) कं (Devanagari Ka + anusvara) कः (Devanagari Ka + visarga) क (Devanagari Ka) Prague, Czech Republic (IUC23)
Devanagari • Hindi vs. Marathi • Two different languages within the Devanagari script, two different sorts of Lla (U+0933) Prague, Czech Republic (IUC23)
Devanagari • Hindi: 0932 < 0933 < 0934; that is: ल < ळ < ऴ • Marathi: 0939 < 0933 < 0915+094d+0937 conjunct; that is: ह <ळ < क्ष Prague, Czech Republic (IUC23)
Tamil Consonant + virama (halant) combination has primary weight lighter than the consonant alone Prague, Czech Republic (IUC23)
Tamil க் (Tamil Ka + virama) க (Tamil Ka) ங் (Tamil Nga + virama) ங (Tamil Nga) ச் (Tamil Ca + virama) ச (Tamil Ca) ஞ் (Tamil Nya + virama) ஞ (Tamil Nya) Prague, Czech Republic (IUC23)
Myths about collation “Well, if I localize my product, these kind of details don’t matter” Prague, Czech Republic (IUC23)
Myths about collation “If I already use Unicode in my product, sorting is covered by this universal encoding” Prague, Czech Republic (IUC23)
Myths about collation “One collation is good enough for Europe*, right?” * Replace with the market of your choice: Asia, North America, India, etc. Prague, Czech Republic (IUC23)
Myths about collation “One collation is good enough for the Latin* script, right?” * Replace with the script of your choice: Cyrillic, Han, Devanagari, etc. Prague, Czech Republic (IUC23)
Why should I care about all this? • Ideally, a well-globalized product uses culturally correct collation where the users expect it, for example: • Address book • Document filing system • Database • … • Your users will expect collation in a surprising number of places! Prague, Czech Republic (IUC23)
Collation Example Prague, Czech Republic (IUC23)
Yet another collation example Prague, Czech Republic (IUC23)
How do I make sure my users get the results they expect? • Collation usually needs to address user’s expected ordering, not the linguistic ordering of the data source (these two can differ!) • Swedish user, German data • Multiple users, multilingual data • The Switzerland example Prague, Czech Republic (IUC23)
How do I make sure my users get the results they expect? • Make sure you’re using a collation-aware mechanism to order data • Windows APIs such as CompareString, LCMapString • SQL Server 2000 collations • The .Net Framework's CompareInfo class • Except when you want non-linguistic collation… Prague, Czech Republic (IUC23)
When not to use linguistic collation When consistency across different cultures is required • “Case insensitive” file systems • File extension names (.INF, .GIF, etc.) Prague, Czech Republic (IUC23)
When not to use linguistic collation When users expect data in a specific collation other than their own • Excel column names • “ASCII” order Prague, Czech Republic (IUC23)
In summary… • Linguistically-aware collation is an important feature of any well-globalized product • Collation needs to be considered at the language level • Encoding, region, script level not enough! • There are many collation-aware mechanisms out there (within OS for example); take advantage of them! Prague, Czech Republic (IUC23)
Other applicable IUC talks • Stay tuned for the second half of this tutorial! • Cathy's "Issues in Indic Collation" talk on Thursday afternoon Prague, Czech Republic (IUC23)
Other References • This tutorial's corresponding paper • Unicode Technical Note (UTN) #1http://unicode.org/notes/tn1/ • Nadine Kano, Developing International Software (out of print, but still available on the web)http://microsoft.com/globaldev/dis_v1/disv1.asp • New! Developing International Software , 2nd edition (available now or very soon): http://microsoft.com/globaldev/dis_v2/disv2.asp • Michael Kaplan, Internationalization with VBhttp://i18nWithVB.com/ Prague, Czech Republic (IUC23)
Questions? Prague, Czech Republic (IUC23)
Don't forget to fill out your evals! Sorting it all out: An introduction to collation Prague, Czech Republic (IUC23)