210 likes | 379 Views
Dr. Lu Qin( 陸勤), csluqin@comp Rm PQ 814, ext 7247 Course Material on-line: www.comp.polyu.edu.hk/~csluqin/comp341 Lecture notes available : Friday 14:30 previous week. Lab/tutorial hand-outs: Friday 14:30 previous week Schedule and announcement on-line
E N D
Dr. Lu Qin(陸勤), csluqin@comp Rm PQ 814, ext 7247 Course Material on-line: www.comp.polyu.edu.hk/~csluqin/comp341 Lecture notes available:Friday 14:30 previous week. Lab/tutorial hand-outs: Friday 14:30 previous week Schedule and announcement on-line office hours: 2:30 – 3:30 Tues, 2:30 – 3:30 Thurs Labs : Mr. Joe Lam, Tel. 2766 7330, Rm QT406 Email: cscwlam@comp.polyu.edu.hk Text book: CJKV Information Processing, by Ken Lunde, O’Reilly, 1999 Multilingual Computing
Teaching and Assessment • Lectures(fundamentals) • Introduction • Characteristics of different languages(scripts) • Computer representations • Input Processing & Output Processing • Information processing techniques: • Open systems • Internationalization and localization • Algorithms • Software development for multilingual environment • Introduction to natural language processing • Tutorial/labs(gain experience in using some common Chinese Operating System and programming),http://www4.comp.polyu.edu.hk/~cscwlam/cc/ • MS Chinese Windows • Different programming environments • Assessment: 60%Final, 15% Midterm, 20% Proj &Hwk (15% +5%), 3%class participation and 2% punctuation
What is Multilingual Computing • Computer processing of data related to more than one language/scripts including any human-computer interaction activity where communication is achieved • Bilingual, trilingual, vs. Multilingual • Fundamental issues: • Dealing with different languages and each language has there own characteristics which requires expert knowledge of each language Example: count the number of words: “Multilingual Computing” vs “多語言文字處理技術” • Ways to distinguish different scripts • How can a system be designed so that it can be used by different languages with minimal changes • How can a system be designed so that it can be used for multiple languages
Different Scripts(Written languages) • English: Fixed alphabet, words are naturally delimited by SPACE, more morphological changes but very regular, more of a token based language than a phonetic based language, writing from left-to-right Example: auto, automatic, autonomous, automation, Auto-movement, spelling is easy to do • Phonetic transcription system: Pinyin, Jyut Ping(粵拼), International Phonetic Alphabet(IPA) • Korean: Kanja(漢字) similar to Chinese, Hangul is a two dimensional Pinyin system. In other words, Hangul is a phonetic script or phonetic transcription system.
Korean Hangul • KA KEU NGOA SAN NUN KOAEN • Romanization: Using Roman letters to denote the phonetic transcriptions
Japanese Kana • Hiragana(phonetic): can be used completed without any Han characters, often used with Han characters(Hanji), for Japanese/Chinese native words • Katakana(phonetic): denoting only foreign words • Writing either from left-to-right or top-to-bottom for both Hiragana and Katakana as well as Han characters
The Chinese Language • General Characteristics • Sino-Tibetan Language Family (漢藏語系) • Ideographic in nature (表意文字 ) • 50+ languages in PRC • Hanyu the official language • 7 Major Hanyu dialects • Hanyu Dialect similarities • relatively unified writing system • some dialect-specific characters and variant character writing • Hanyu Dialect differences • different pronunciation across different dialects • different words (e.g. 係 and 是 ) • word-order reversal (e.g. 找尋 and 尋找) • different expression / grammar (e.g.先坐 and 坐先)
Chinese Characters • Graphemics ( the look, 形 ) • Strokes (distribution 1-30+), Radicals (214+), components(500+), Characters (65,000+) • Stroke sequence order • Variant writing (e.g. 教 都) • Character Formation • Bounded radicals and components, but unbounded alphabet / character set (charset) • 6 principles - ideographic 象形 (火) , objective 指事 (一二 ), meaning會意 (炎旦), ideo-phonetic 形聲( 訪), borrowed假借(孰 熟), transitive 轉注( 考 老)
Character Decomposition • Most basic elements of characters are • “Strokes”(筆畫) 基本的“一”(橫)、“”(豎)、“”(撇)、“、”(點)和“”(折)。 • Chinese components(部件) is composed of strokes which can be considered a functional unit and they can reflect the meaning, pronunciation and origin of the characters • See http://glyph.iso10646hk.net • Chinese character variants(異體字): and鳥 for birds, thus, and
Phonetics ( the sound,音) • Phoneme( 音素 單音 ) - contrastive unit of speech (e.g. bag and tag) • vows(元音) and consonants(輔音) • Putonghua: single consonants, can be double vows: b, p, m, f, a, o, e, ai (two phonemes), • Cantonese: kwok, cheung, ng • One-character-one-syllable: mono-syllable • Tonal language - tone differentiates meaning • Putonghua: 5 tones • Cantonese: 9 tones(?) • Semantics (the meaning,義 ) • meaning may derive from components of character (e.g. 廳) • Single-character words have multiple-meanings( 樂) • Multi-character words usually have less ambiguity( 快樂 音樂 ) • Writing from left-to-right and also from top-to-bottom • Pinyin system, Zhuyin system(only for learning characters, not as general reading tool)
Character Set • A character set is a collection of characters. The set usually has a name, such as, KangXi character set, etc. Usually, each character in a character set is unique. C ={ci| 1<i<n, ci is a character} • Computer processing of a character set requires that that each character in a character set is assigned a unique binary value • Encoding: Is the process of mapping a character to a numeric value • A coded character set, normal referred to as acodeset CC, can be considered as a set of tuples: CC={(ci, codei) |ciC and codei CODE } • where codei<>codej if ci <> cj, & CODE is normally a set of integers in binary form and CODE is also called code space
Note that CODE is a set of numbers usually in consecutive orders. • Examples: Suppose CODE1={00, 01, 10, 11}, CODE2={0000, 0001, 0010, 0011}, CODE3={1000, 1001, 1010, 1011}, CC1={(ci, codei) |ciC and codei CODE1 } CC2={(ci, codei) |ciC and codei CODE2 } CC3={(ci, codei) |ciC and codei CODE3 } Where CC1 , CC2 , and CC3 are different codesets! • A codeset can also be considered conceptually as a character set with a predetermined order and the order is determined by the numerical value in CODE • Length of binary/order depends on the size of (C) or some predetermined number • Codepoint: a value in the code space • For Chinese, since there are more than 256 characters in the set, at least 2 bytes (at most 64k codepoints) are necessary to represent all the Chinese characters.
Numerical Notations • Decimal notation (10 distinct values): no prefix • Binary notation (2 distinct values): • Hexadecimal notation: 0xHH where H: 0 ..9,A..F • Hexadecimal notation is normally used to replace binary notation for better readability • 1 to 4 binary digits -> 1 Hex digit • Scalar value: the actual numeric value for any fixed digit numbers: scalar( 0001) = 12, scalar( 0111) = 716, scalar( 01111) = F16= 1510= 11112 • In computer, 00AF and AF represents different things, but they have the same scalar value.
ASCII code table • 0x00 - 0x1F and 7Fcontrol characters • 0x20 - 0x7E graphic characters(printable chars) • Code range: range of values in code-point assignment • The code range is 00 to 7F for ASCII • Code range may not start from scalar value zero
Row-Cell notation: Matrix with row number and column number defines a cell and thus the order of the characters, also avoid binary notation. This is particularly useful when the code range is not consecutive. • Character subsets: putting characters of similar nature next to each other, different subsets in different rows • Some codepoints in the code space may not have any character assignment, they are called empty codepoints.
Codeset Compatibility • For two character sets, C1 and C2, equivalence: C1 = C2 , subset: C1 C2, superset: C1 C2 intersect: C1 C2 , C1 C2 Examples: GB&B5 -> ? GB&GBK -> ? • For two coded character sets: CC1={(c1i, code1i) | c1i C1 and code1i CODE1 } CC2={(c2i, code2i) | c2i C2 and code2i CODE2 } If for every (c1i, code1i) CC1, it is true that (c1i, code1i) CC2 then, CC2 is said to be fullycompatible with CC1