Sorting it all out: An introduction to collation

Sorting it all out: An introduction to collation Cathy Wissink Michael Kaplan Globalization Infrastructure and Font Technology Windows International Microsoft

Who is this talk geared towards? • This is a high-level introduction to the concepts of collation, assuming no prior knowledge. • Audience: • Developers new to concept • People who need to understand collation enough to “sell” this globalization feature to management • Not intended to be a “nuts and bolts” talk (see the presentation immediately following!) Prague, Czech Republic (IUC23)

Collation: Used Everyday! It may not be obvious, but you most likely use collation in some form everyday: • Finding a mail slot for a colleague • Searching for an author at the bookstore • Library card catalog • Looking up a phone number Prague, Czech Republic (IUC23)

Prague, Czech Republic (IUC23)

Anytime you order or search for data in a logical fashion within a structure, you use collation! Prague, Czech Republic (IUC23)

Collation, the definition: • The culturally expected ordering of linguistic characters in a particular language • Often referred to as sorting, ordering, alphabetizing • Informants recognize correct vs. incorrect collation for their language, but often have a hard time explaining the particular collation rules Prague, Czech Republic (IUC23)

Great definitions, but what do they mean, really? • Every language (every culture) has an expected result when users search for data in “sorted” order • If the ordering isn’t perfectly correct, users have a very hard time finding data • This ordering can be influenced by a number of linguistic and orthographic elements within a language Prague, Czech Republic (IUC23)

Examples of linguistic elements that impact collation • “Character” order • Casing (upper case vs. lower case) • Modifiers (diacritics, Indic matras, vowel marks) • Radicals (CJK) • Stroke counts (CJK) • Syllable structure (SE Asian languages) • Pronunciation Prague, Czech Republic (IUC23)

Collation in Action • Latin scripts: English, French, Lithuanian, Swedish, Traditional Spanish • Chinese variants (Taiwanese orders) • Devanagari script: Hindi, Marathi • Tamil script: Tamil Prague, Czech Republic (IUC23)

English: Prague, Czech Republic (IUC23)

French: Prague, Czech Republic (IUC23)

Lithuanian: Prague, Czech Republic (IUC23)

Swedish: Prague, Czech Republic (IUC23)

Spanish (Traditional): Prague, Czech Republic (IUC23)

Prague, Czech Republic (IUC23)

Devanagari Hindi: consonants with modifier marks (candrabindu U+0901, anusvara U+0902 or visarga U+0903) sort differently than the consonant alone. A consonant and one of these modifier marks has a lighter primary sorting weight than the same consonant without a modifier mark. Prague, Czech Republic (IUC23)

Devanagari कँ (Devanagari Ka + candrabindu) कं (Devanagari Ka + anusvara) कः (Devanagari Ka + visarga) क (Devanagari Ka) Prague, Czech Republic (IUC23)

Devanagari • Hindi vs. Marathi • Two different languages within the Devanagari script, two different sorts of Lla (U+0933) Prague, Czech Republic (IUC23)

Devanagari • Hindi: 0932 < 0933 < 0934; that is: ल < ळ < ऴ • Marathi: 0939 < 0933 < 0915+094d+0937 conjunct; that is: ह <ळ < क्ष Prague, Czech Republic (IUC23)

Tamil Consonant + virama (halant) combination has primary weight lighter than the consonant alone Prague, Czech Republic (IUC23)

Tamil க் (Tamil Ka + virama) க (Tamil Ka) ங் (Tamil Nga + virama) ங (Tamil Nga) ச் (Tamil Ca + virama) ச (Tamil Ca) ஞ் (Tamil Nya + virama) ஞ (Tamil Nya) Prague, Czech Republic (IUC23)

Myths about collation “Well, if I localize my product, these kind of details don’t matter” Prague, Czech Republic (IUC23)

Myths about collation “If I already use Unicode in my product, sorting is covered by this universal encoding” Prague, Czech Republic (IUC23)

Myths about collation “One collation is good enough for Europe*, right?” * Replace with the market of your choice: Asia, North America, India, etc. Prague, Czech Republic (IUC23)

Myths about collation “One collation is good enough for the Latin* script, right?” * Replace with the script of your choice: Cyrillic, Han, Devanagari, etc. Prague, Czech Republic (IUC23)

Why should I care about all this? • Ideally, a well-globalized product uses culturally correct collation where the users expect it, for example: • Address book • Document filing system • Database • … • Your users will expect collation in a surprising number of places! Prague, Czech Republic (IUC23)

Collation Example Prague, Czech Republic (IUC23)

Yet another collation example Prague, Czech Republic (IUC23)

How do I make sure my users get the results they expect? • Collation usually needs to address user’s expected ordering, not the linguistic ordering of the data source (these two can differ!) • Swedish user, German data • Multiple users, multilingual data • The Switzerland example Prague, Czech Republic (IUC23)

How do I make sure my users get the results they expect? • Make sure you’re using a collation-aware mechanism to order data • Windows APIs such as CompareString, LCMapString • SQL Server 2000 collations • The .Net Framework's CompareInfo class • Except when you want non-linguistic collation… Prague, Czech Republic (IUC23)

When not to use linguistic collation When consistency across different cultures is required • “Case insensitive” file systems • File extension names (.INF, .GIF, etc.) Prague, Czech Republic (IUC23)

When not to use linguistic collation When users expect data in a specific collation other than their own • Excel column names • “ASCII” order Prague, Czech Republic (IUC23)

In summary… • Linguistically-aware collation is an important feature of any well-globalized product • Collation needs to be considered at the language level • Encoding, region, script level not enough! • There are many collation-aware mechanisms out there (within OS for example); take advantage of them! Prague, Czech Republic (IUC23)

Other applicable IUC talks • Stay tuned for the second half of this tutorial! • Cathy's "Issues in Indic Collation" talk on Thursday afternoon Prague, Czech Republic (IUC23)

Other References • This tutorial's corresponding paper • Unicode Technical Note (UTN) #1http://unicode.org/notes/tn1/ • Nadine Kano, Developing International Software (out of print, but still available on the web)http://microsoft.com/globaldev/dis_v1/disv1.asp • New! Developing International Software , 2nd edition (available now or very soon): http://microsoft.com/globaldev/dis_v2/disv2.asp • Michael Kaplan, Internationalization with VBhttp://i18nWithVB.com/ Prague, Czech Republic (IUC23)

Questions? Prague, Czech Republic (IUC23)

Don't forget to fill out your evals! Sorting it all out: An introduction to collation Prague, Czech Republic (IUC23)

Sorting it all out: An introduction to collation

Sorting it all out: An introduction to collation

Presentation Transcript

Introduction of IPTV and MHP

Triage

How does a protein get to the correct cellular location?

Chapter 12

Collections

Heaps

CMPT 454

Sorting

Sorting and Searching Algorithms

The Population of the UK

Parallel Algorithms and Computing Selected topics

Teacher Talk 2011

Comparison Sorts

CSE 3101

Parallel Algorithms and Computing Selected topics

Java Programming

Crash Course on Startup Analytics

Content, Connectivity & Convergence: Sorting Out Complexity