1 / 37

COMP323 Foundations of Chinese Computing

COMP323 Foundations of Chinese Computing. Course Introduction. Lecturer Qin LU csluqin@comp.polyu.edu.hk R oo m PQ814, Tel. 27667247 Teaching Assistant ( Responsible for some Labs and Project Assignments ) Chen Yirong csyrchen@comp.polyu.edu.hk R oo m QT416 , Tel. 2766 7326.

ismet
Download Presentation

COMP323 Foundations of Chinese Computing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. COMP323 Foundations of Chinese Computing

  2. Course Introduction • Lecturer • Qin LU • csluqin@comp.polyu.edu.hk • Room PQ814, Tel. 27667247 • Teaching Assistant (Responsible for some Labs and Project Assignments) • Chen Yirong • csyrchen@comp.polyu.edu.hk • Room QT416, Tel. 2766 7326 COMP323 Lecture 1

  3. Course Introduction • COMP323 Reference Books • CJKV Information Processing: Chinese, Japanese, Korean and Vietnamese Computing (PL1074.5 .L86) • An Introduction to Chinese, Japanese and Korean Computing (QA76.H7795) • 計算機中文信息處理(PL1074.5.C42) and others • Tutorials and labs: PQ604A • Tuesday Group: 9:30 – 10:30 Tuesdays • Thursday Group: 9:30 – 10:30 Thursdays • Try to finish the labs and the online assignment/QA during lab hours COMP323 Lecture 1

  4. Course Introduction • COMP323 Website • WebCT • Lecture notes available Wed. by 5pm • Print as NotePage • Method of Assessment • Course Work 55% • 2 Programming Assignments 20% • 2 online quizzes 20% • 1 online homework 5% • 4 online QA(labs) 8% • Class attendance (punctuation) 2% • Final Examination 45% COMP323 Lecture 1

  5. Course Introduction • Introduction to Chinese Computing • Computer processing of data related to Chinese, involving any human-computer interaction activity where communication is achieved using Chinese language. Chinese Computing About one-fifth of the people in the world speak some form of Chinese as their native language, making it the language with the most native speakers. COMP323 Lecture 1

  6. Course Introduction • Fundamental Problems with Chinese Computing • At Chinese CharacterLevel • Large and not Closed Character Set • Computer Representation, Input and Output • At Chinese LanguageLevel • Lack of Morphological Variation • Lack of Grammar • Very Arbitrary and Flexible • Superimposed Grammar • Texts are Running Together COMP323 Lecture 1

  7. Course Introduction • Fundamental Problems with Chinese Computing COMP323 Lecture 1

  8. Course Introduction • Fundamental Problems with Chinese Language • Bi-lingual, Tri-lingual and Multi-lingualComputing • Question: Is Hong Kong a multi-lingual society? • How can a system be designed so that it can be used by different languages with minimal changes? • How can a system be designed so that it can be used for multiple languages? • Distinguish Chinese and English Characters • Chinese Text, English Text orChinese Text Mixed Together with English Text? COMP323 Lecture 1

  9. Multilingual Computing 多語言文字處理技術 Course Introduction • Fundamental Problems with Chinese Language • Bi-lingual, Tri-lingual and Multi-lingualComputing • Example: Count the Number of (Chinese and/or English) Characters or Words ? COMP323 Lecture 1

  10. Tentative Teaching Content • Characteristics of Chinese Language • Reading System (Pronunciation) • Writing System (Look) • Computer Representation of Chinese Characters • Character Set Standards (GB, Big5 and Unicode ...) • Encoding Schemes (ISO and UTF …) • Chinese Character Input • Chinese Input Processing by (Pen, Image, Speech and)Key Stroke • Shape-based Keystroke Input Method • Phonetic-based Keystroke Input Method COMP323 Lecture 1

  11. Tentative Teaching Content • Chinese Character Output • Bitmap and Outline Font Representation • Compression • Scaling Problem • Software Development for Chinese • Text Processing, such as Character Searching, Editing, and Deletion … • Software Localization and Internationalization COMP323 Lecture 1

  12. Tentative Teaching Content • Chinese Language Processing • Word Segmentation • Part-of-Speech (POS) Tagging • SyntacticAnalysis(Grammatical Analysis) • Chinese Information (Document) Retrieval • Document Retrieval Models • Language-Related Issues • Advanced Topics (possibly) • Information Extraction • Text Summarization COMP323 Lecture 1

  13. Lecture 1 Characteristics of Chinese

  14. The Chinese Language • General Characteristics • The official language in China is mandarin (普通話), but there are many dialects in spoken form (50+). • Different Pronunciation across Different Dialects • Relatively Unified Writing System • Dialect-specific Characters andVariant Character Writing • Different words express the same meaning, e.g. 係and 是 (to be) • Word order reversal, e.g. 找尋and 尋找 (look for) 叻吓吔呃咁咗咩哂哋唔唥唧啱啲喐喥喺嗰嘅嘜嘞嘢 COMP323 Lecture 1

  15. The Chinese Language COMP323 Lecture 1

  16. The Chinese Language • Characteristicsof Chinese Characters • Each Chinese character associates with three features, namely its look (called graphemics), its pronunciation (called phonetics), and its meaning (called semantics). Graphemics (The Look) Phonetics (The Sound) Semantic (The Meaning) COMP323 Lecture 1

  17. Chinese WritingSystem 一丨丶丿乙亅二亠人儿入八冂冖冫几凵刀力勹匕匚匸十卜卩厂厶又口囗土士夊夊夕大女子宀寸小尢尸屮山巛工己巾乡广廴廾弋弓彐彡彳心戈戶手支攴文斗斤方无日曰月木欠止歹殳毋比毛氏气水火爪父爻爿片牙犬玄玉瓜瓦甘生用田疋疒癶白皮目矛矢石示禸禾穴立竹米糸缶网羊羽老而耒耳聿肉臣自至臼舌舛舟艮色艸虍虫血行衣襾見角言谷豆豕豸貝赤走足身車辛辰辵邑酉釆里金長門阜隶隹雨靑非面革韦韭音頁凬飛食首香馬骨高髟鬥鬯鬲鬼魚鳥鹵鹿麦麻黃黍黑黹黽鼎鼓鼠鼻齊 • Radicals (部首) • Chinese characters are composed of smaller units, called radicals. • 214+ radicals are used for indexing Chinese characters. • The advantage of a radical is that one does not have to know the pronunciation of the character, but can still look up a character in a dictionary. COMP323 Lecture 1

  18. Chinese WritingSystem • Radicals • Remark: Several radicals can stand alone as single and meaningful Chinese characters. Radical Standalone Examples 本未术札朽朴朳杀杂机朵权 木 Yes 炜炬炅炖炒炝炙炘炊炆炕炉 火 Yes 伈芯志忐吣忘忍态忠念忿忽 心 Yes 岩矾矿宕砀码研砆砌砂泵砍 石 Yes COMP323 Lecture 1

  19. Chinese WritingSystem • Strokes (筆劃) • Radicals in turn are composed of smaller units, called strokes. • 30+ strokes are the most basic elements of a character. • 5 basic strokes are “一” (横, a horizontal stroke), “丨” (竖, a vertical stroke), “丶”(点, dot), “丿” (撇, a stroke curved to the left) and “乙”(折, a bend stroke). COMP323 Lecture 1

  20. Chinese WritingSystem • Strokes • Stroke Order (筆順) • The strokes for each Chinese character are to bedrawn in a certain defined order. • Basic principles are: from left to right, top to bottom, outside to inside,horizontal before vertical,left slant before right slant, center before two sides, etc. • See Animations here http://www.chinawestexchange.com/Chinese/characters.htm COMP323 Lecture 1

  21. Chinese WritingSystem • Tree Structure of Chinese Characters COMP323 Lecture 1

  22. Chinese WritingSystem • Character Classifications and Formation • Type 1: Pictographs (Picture Characters) (象形) • They look like the things they represent, e.g. • Other examples are日(sun), 山(mountain), 水(water), 鸟(bird), 火(fire), 木(tree), 車(car, cart), and 口 (month, opening), etc. Does this character 月 really look like a moon to you? Centuries ago, it was written like this: COMP323 Lecture 1

  23. Chinese WritingSystem • Evolution of Chinese Characters COMP323 Lecture 1

  24. Chinese WritingSystem • Character Classifications and Formation • Type 2:(Simple) Ideographs (指事 or 表意) • They represent abstract concepts or ideas, such as numbers and directions, e.g. 一 (one), 二 (two), 三 (three), and 中 (center, middle), 上(above), 下(below) etc. COMP323 Lecture 1

  25. Chinese WritingSystem • Character Classifications and Formation • Type 3: Compound Ideographs (會意) • Pictographs and ideographs can be combined to represent more complex characters, and usually reflect the combined meaning of them. • Examples: • More Interesting Animations from Internet http://www.language.berkeley.edu/fanjian/compound_ideographs.html sun 日 + moon 月 = bright 明 person 人 + person 人 = agree/follow 从 sun 日 + tree木 = east (sun rising above the trees in the east) 東 tree木 + tree木 = forest 林 + one more tree木 = full of trees 森 COMP323 Lecture 1

  26. Chinese WritingSystem • Character Classifications and Formation • Type 3: Compound Ideographs COMP323 Lecture 1

  27. Chinese WritingSystem • Character Classifications and Formation • Type 3: Compound Ideographs COMP323 Lecture 1

  28. Chinese WritingSystem • Character Classifications and Formation • Type 4: Phonetic Ideographs (形聲) • They usually have at least two component characters, one influences the sound and the other influences the meaning. • For example, • They account for more than 90% of all Chinese characters in use today. For the character “跳” ( jump ), the left part “足“ means “foot”. The meanings of those characters that contain “足”are related to “foot” in a certain way. The right part “兆” indicates the sound. They share the same vowel. COMP323 Lecture 1

  29. Chinese WritingSystem Thought to be the oldest types of characters, pictographs were originally pictures of things. During the past 5,000 years or so they have become simplified and stylised. Ideographs are graphical representations of abstract ideas. Compound pictographs and ideographs combine one or more pictographs or ideographs to form new characters. Both component parts contribute to the meaning of the compound character. COMP323 Lecture 1

  30. Chinese WritingSystem Semantic-phonetic compounds represent around 90% of all existing characters and consist of two parts: a semantic component or radical which hints at the meaning of the character, and a phonetic component which gives a clue to the pronunciation of the character. Characters containing the same phonetic component may have the same sound and the same tone, the same sound but a different tone, the same initial or final sound, or a different sound and a different tone. Phonetic components are generally a more reliable indication of pronunciation than semantic components are of meaning. COMP323 Lecture 1

  31. Chinese WritingSystem • Traditional and Simplified Characters • Over time, frequently used and complex Chinese characters tend to be simplified. • More about Pitfalls and Complexities of Chinese to Chinese Conversion http://www.cjk.org/cjk/c2c/c2cbasis.htm retain only one part from the traditional character COMP323 Lecture 1

  32. Chinese WritingSystem • Chinese Language (Chinese Text) • Chinese characters are subsequently combined with other Chinese characters as words to form more complex ideas and concepts. • Question: How many Chinese characters? The Chinese writing system isopen-ended, meaning that there is no upper limit to the number of characters. The largest Chinese dictionaries include about 56,000 characters, but most of them are archaic, obscure or rare variant forms. Knowledge of about 3,000 characters enables you to read about 99% of the characters in Chinese newspapers and magazines. To read Chinese literature, technical writings or classical Chinese, though, you need to be familiar with about 6,000 characters. COMP323 Lecture 1

  33. Chinese Reading System • Pronunciation • The phonetic information is not explicit. • Sometimes, you can guess the pronunciation through the component characters. • Sometimes, the pronunciation has no relation to its components at all. • It makes the learning of Chinese difficult without a phonetic transcription system. • Phonetic transcription: Dictation of pronunciations • Symbols to indicate all sounds in the language - sufficient • One sound is denoted by only one symbol - Uniqueness COMP323 Lecture 1

  34. Chinese Reading System • Pronunciation • Pinyin: dictating Mandarin Chinese • Vowel (元音, Initial) and Consonant (輔音, final) • More aboutPronunciation http://www.chinese-outpost.com/language/pronunciation/mandarin-chinese-initials-and-finals-table-1.asp For example, consider Beijing: bei: b is an initial, and ei is a final jing: j is an initial, and ing is a final In speech, Chinese words are created using just 21 beginning sounds called initials, and 37 ending sounds called finals. Initials and finals, of course, combine to create the basic sounds of Chinese. COMP323 Lecture 1

  35. Chinese Reading System • Pronunciation • Pinyin COMP323 Lecture 1

  36. Chinese Reading System • Pronunciation • Tones of Chinese • Chinese is a tonal Language. • Mandarin has 4 (5)tones and Cantonese has 6 (9) tones, which makes it much harder to learn than Mandarin. COMP323 Lecture 1

  37. Chinese Reading System • Pronunciation • Tones differentiate meanings. Everyone seems to know this one: Yes, just by saying “ma” in different tones, you can ask, “Did mother scold the horse?” 妈骂马吗? (mā mà mă ma?) 鞏俐(Gong Li, with third and fourth tones), is the name of the star of “Raise the Red Lantern” and other contemporary Chinese films.However, 公里(gong li, with first and third tones, means kilometer. COMP323 Lecture 1

More Related