Developing a multilingual text analysis engine - does using Unicode solve all the issues?

Developing a multilingual text analysis engine - does using Unicode solve all the issues? Dr. Brian O’Donovan, IBM Ireland, Sept. 2002

Agenda • What is IBM LanguageWare? • Where/How is it Used? • Our Experiences of Conversion to Unicode • Benefits Accruing • Challenges to be Overcome • Word Identification problems • Future work

What is IBM LanguageWare? • A suite of tools to assist in linguistic analysis • Initially developed by IBM USA, but now developed by a Globally distributed team • Used in internal and external products • Available to OEMs as a toolkit

Where are We? • Developers • North Carolina, USA • Dublin, Ireland • Helsinki, Finland • Taipei, Taiwan • Yamato, Japan • Seoul, Korea • Major Customers • Various Locations, USA • Böblingen, Germany

Afrikaans Arabic Catalan Simplified Chinese Traditional Chinese Czech Danish Dutch (x2) English x4 regions & x2 domains Finnish French (x2) German (x3) Greek Hebrew Hungarian Icelandic Italian Japanese Korean Norwegian (x2) Polish Portuguese (x2) Russian Spanish Swedish Tamil Thai Turkish 28 (or 30+) Languages Supported

Lexical Analysis Hierarchy Summarization Grammar check Words = spell check, morphology, etc.

Where/How is it Used? • Word Processors • spell aid, grammar checker • Search Engines & Data Mining • Extracting Lemmatized form • Query Expansion (e.g. Synonyms) • Assisting in Taxonomy Generation • Translation Memory • Identification of linguistic units • NLP tools use as first stage of analysis • Machine Translation (limited) • Speech recognition (not currently)

Conversion to Unicode • Previous version used different code pages for each language • In fact a collection of different engines • New version has a single engine for all languages using Unicode (not all languages have been converted yet) • Provided many benefits • Raised some challenges

Benefits of Unicode • No more conversion tables • Standardized on utf-16 • Big-Endian Dictionaries • Converted to platform byte order on load • Possible to deal properly with multilingual text • Able to utilize ICU utilities • Can edit/view all test cases on one machine

Issues with Unicode • Lots of work • Converting dictionaries is not trivial • Changing code would be huge (we were rewriting anyway) • Large code page • Some ambiguous representations & orthographic rules • ICU Tokenisation not always to our liking

How we dealt with large code page • Our architecture is based upon Finite State Transducers (FSTs) • In FSTs you frequently need to build tables with entries for every possible next character • for single byte code page => 256 entries • for double byte code page => 64K entries • If not careful, dictionaries might grow to 256 times their previous size • We deal with this by building a character transition table per dictionary • Each UNICODE character appearing in the dictionary is assigned a code by which it is referenced within the dictionary • This is not a private code page • The table is dynamically created when building the dictionary and will map differently each time • When looking up dictionary each character is first mapped through the table • If any characters are not in mapping table => we won't find a match in the dictionary • New characters are automatically added to the mapping table once words with the characters are added to the dictionary • Typically European languages require fewer than 100 characters • e.g. punctuation symbols do not appear in the dictionary

FSA Node English Dictionary a b c Z

FSA Node Japanese Dictionary か山川部

Representation Issues • Even with utf-16 as a universal standard representation issues can arise • The letter ë is represented as 0x00EB • But could be e=0x0065 followed by umlaut=0x0308 • Arabic letter Heh (ه) = E5 in Windows code page • In UNICODE 0x0647 • But it has different forms e.g. ههه • Unicode defines 4 more code points for presentation forms • isolated=0xFEE9, final=0xFEEA, initial=0xFEEB and medial=0xFEEC

Arabic Shape Bitmaps (Heh) • Heh varies with position • Isolated • Initial • Medial • Final

Arabic Shape Bitmaps (Seen) • Seen varies much less • Isolated • Initial • Medial • Final

Orthographic Variation • A word may appear differently in a sample text than in the dictionary. • All languages have different rules about what is allowed/required. • Some languages have simple rules • e.g. English casing rules • a lowercase dictionary word can appear in text as titlecase or uppercase • a titlecase dictionary word could appear in text as uppercase • an uppercase or mixed case dictionary word must appear in text exactly the same.

Other languages have more complex rules • In France letters lose their accent when capitalized • e.g. é is capitalized as E not É • but this rule does not apply in French Canada • So être becomes Etre in Paris but Être in Montreal • In German capitalization can change the letter count • 'ß' is sometimes capitalized as 'SS' • In English we optionally drop accents • e.g. the name Zoë can be written Zoe in English • But in German dropped umlauts are represented by a following e • e.g. Böblingen can be written Boeblingen

Vowel dropping in Semitic languages • Semitic languages such as Arabic and Hebrew, have an orthographic rule that short vowels may be dropped. • Imagine that Arabic has the words hat, hit, hut • They will be written in dictionary as • Hat = هَت • Hut = هُت • Hit = هِت • In any piece of text the string هت could be any of these words

Arabic examples in Large font • Hat = هَت • Hut = هُت • Hit = هِت • Ht = هت

Inflection Variants walk walked walking Surface Forms walk Walk WALK walked Walked WALKED walking Walking WALKING Root Word walk Model of How Orthographic & Representation variation handled Orthography & Representation Inflection

Inflection Variants passé Surface Forms passé Passe passe‌´ Passé Passe‌‌´ Passe PASSÉ PASSE PASSE‌‌́´ Root Word passé Model of How Orthographic & Representation variation handled Orthography & Representation Inflection

Recognize talk l k a t 4 3 2 0 1

Recognize walk or talk w l k a t 4 3 2 0 1

Recognize walk, walked, walking talk, talked or talking d 6 w 5 e l k a t 4 3 2 0 1 i g n 9 7 8

Recognize all variants of walk or talk d 6 w 5 e l k a t 4 3 2 0 1 i g W n 9 7 a 8 T D 15 14 E K L A 13 12 11 I 10 G N 18 16 17

Recognize all variants of wálk or tálk d 6 w 5 e k l á 4 t 2 3 1 0 i a g n 19 9 7 8 á W T D 15 a 14 E K L 13 12 11 I G Á N 18 10 16 17

Word Breaking • ICU break iterators can only provide us with a first pass of the word segmentation • Significant improvements in ICU 2.2 • Sometimes the word segmentation is not obvious • multi word expressions • compound words • languages with no spaces

Multiword Expressions • Word break is not always obvious • French pommes de terre • as 3 words = apples of the ground • as 1 word = potatoes • French l'intelligence artificielle • really 2 words le and intelligence artificielle • Sometimes the word break is ambigous • English red tape • as 2 words = tape with red color • as 1 word = synonym of bureaucracy

Decompounding • German and related languages allow speakers to generate their own compound words by combining component words • typically a series of nouns and/or adjectives • ICU considers these one word • but to analyse them properly we need to break it into its constituent words and look these up in the dictionary • Can be computationally expensive • We can also sometimes get ambiguous results

Ambiguous Decompounding - example 1 • Wachstube • Option 1 = wax tube • Wachs = wax • Tube = tube • Option 2 = guard room • Wache = guard • Stube = room • Option 3 = awake room • Wach = awake • Stube = room

Ambiguous Decompounding - example 2 • Hochschullehrer = University Lecturer • 3 component words • Hoch = High • Schule = school • Lehrer = teacher • Could index under • Hochschullehrer = Univesity Lecturer • Hochschule = High School & Lehrer = Teacher • Hoch = Higher & Shullehrer = school teacher • However Schullehrer (=schoolteacher) is a lexicalised compound • The two words are so often used together that native speakers no consider them to be a single word • Hoch & Schullehrer is only valid decomposition • In fact most German Speakers would regard Hochschullehrer as a single lexical word because it is commonly used

Ambiguous Decompounding - example 2 • Neuroschaltungsverstärkung • Neuro = neuronal • Schaltungs = circuit • verstärkung = amplification • Could be Neuroschaltungs|verstärkung • interpreted as an amplification of a "neuronal circuit" • Could also be Neuro|schaltungsverstärkung • interpreted as a kind of circuit amplification which is done by neuronal technology

Languages with no spaces • Some languages (e.g. Chinese, Japanese, Thai) do not put spaces between words • Therefore, it is difficult to figure out where the word breaks are • Sometimes there are multiple possible word segmentations. • Sometimes the choice of segmentation can change the meaning • For these languages we use ICU break iterator to find "unambiguous word break" (e.g. punctuation symbol, line break, character type change) • Looking in the dictionary we find all possible word combinations within this text sequence • Using statistical techniques we figure out which word sequence is most likely

此 = this (ci) 路 = road (lu) 不 = no/not (bu) 通 = through (tong) 此路不通 = cul-de-sac (cilubutong) 行 = walk (xing) 得 = get (de) 不得 = forbidden (bu-de) 在 = be (ci) 在此 = here (zaici) 小 = small (xiao) 便 = convenience (bian) 小便 = urinate (xiaobian) 此路不通行此路不通行不得在此小便 Interp 1 此 /路 /不 /通 //行不得 /在此 /小便 No through way for pedestrians. Urination forbidden here. Interp 2 此路不通行不 / 得在此 / 小便 Cul-de-sac. Walking Forbidden. Urinate here. Chinese Segmentation Example 此路不通行不得在此小便

此 = this (ci) 路 = road (lu) 不 = no/not (bu) 通 = through (tong) 此路不通 = cul-de-sac (cilubutong) 行 = walk (xing) Interp 1 此 /路 /不 /通 //行 This road not through walk No through way for pedestrians. Chinese Segmentation Example 此路不通行

此 = this (ci) 路 = road (lu) 不 = no/not (bu) 通 = through (tong) 此路不通 = cul-de-sac (cilubutong) 行 = walk (xing) 得 = get (de) 不得 = forbidden (bu-de) 在 = be (ci) 在此 = here (zaici) 小 = small (xiao) 便 = convenience (bian) 小便 = urinate (xiaobian) Interp 1 此 /路 /不 /通 //行不得 /在此 /小便 This road not through walkForbidden here urinate No through way for pedestrians. Urination forbidden here. Chinese Segmentation Example 此路不通行/不得在此小便

此 = this (ci) 路 = road (lu) 不 = no/not (bu) 通 = through (tong) 此路不通 = cul-de-sac (cilubutong) 行 = walk (xing) 得 = get (de) 不得 = forbidden (bu-de) 在 = be (ci) 在此 = here (zaici) 小 = small (xiao) 便 = convenience (bian) 小便 = urinate (xiaobian) Interp 1 此 /路 /不 /通 //行不得 /在此 /小便 No through way for walkers. Urination forbidden here. Interp 2 此路不通 = cul-de-sac 行不 / 得 = walk not 在此 / 小便 = here urinate Cul-de-sac. Walking Forbidden. Urinate here. Chinese Segmentation Example 此路不通行不得在此小便

Future Challenges • Complete move of all languages to new architecture • Improving quality and breadth of linguistic data • more languages • more words (better user dictionary support) • richer relationships (e.g. part of, type of etc.) • Increasing Accuracy of analysis • Part of speech disambiguation • Ranking of parse results • Increasing performance speed • Latest version exceeds 2.5 Giga Char/hour on standard PC • Some customers say it is not fast enough, but this only allows 1.5 micro seconds/char • Available through ICU API

Developing a multilingual text analysis engine - does using Unicode solve all the issues?